Learning Bayesian Networks: Naïve Bayes

Learning Baian Networks: Naïve and n-na Naïve Ba Hypothesis Space fixed size stochastic continuous parameters Learning Algorithm direct computation eager batch

Multivariate Gaussian Classifier y x The multivariate Gaussian Classifier is equivalent to a simple Baian network This models the joint distribution P(x,y) under the assumption that the class conditional distributions P(x y) are multivariate gaussians P(y): multimial random variable (K-sided coin) P(x y): multivariate gaussian mean µ k covariance matrix Σ k

Naïve Ba Model y x 1 x 2 x 3 x n Each de contains a probability table y: : P(y = k) x j : = v y = k) class conditional probability Interpret as a generative model Choose the class k according to P(y = k) Generate each feature independently according to =v y=k) The feature values are conditionally independent P(x i,x j y) ) = P(x i y) y)

Representing y) Many representations are possible Univariate Gaussian if x j is a continuous random variable, then we can use a rmal distribution and learn the mean µ and variance σ 2 Multimial if x j is a discrete random variable, x j {v 1,, v m }, then we construct the conditional probability table y = 1 y=2 y=k x j =v 1 =v 1 y = 1) =v 1 y = 2) =v m y = K) x j =v 2 =v 2 y = 1) =v 2 y = 2) =v m y = K) x j =v m =v m y = 1) =v m y = 2) =v m y = K) Discretization convert continuous x j into a discrete variable Kernel Density Estimates apply a kind of nearest-neighbor neighbor algorithm to compute y) ) in neighborhood of query point

Discretization via Mutual Information Many discretization algorithms have been studied. One of the best is mutual information discretization To discretize feature x j, grow a decision tree considering only splits on x j. Each leaf of the resulting tree will correspond to a single value of the discretized x j. Stopping rule (applied at each de). Stop when I(x j ; y) < log 2(N 1) N + N =log 2 (3 K 2) [K H(S) K l H(S l ) Kr H(Sr)] where S is the training data in the parent de; S l and S r are the examples in the left and right child. K, K l, and K r are the corresponding number of classes present in these examples. I is the mutual information, H is the entropy, and N is the number of examples in the de.

Kernel Density Estimators Define exp 2πσ σ to be the Gaussian Kernel with parameter σ Estimate K(x j,x i,j )= 1 P (x j y = k) = µ xj x i,j P {i y=k} K(x j,x i,j ) N k where N k is the number of training examples in class k. 2

Kernel Density Estimators (2) This is equivalent to placing a Gaussian bump of height 1/N k on each trianing data point from class k and then adding them up y) x j

Kernel Density Estimators Resulting probability density y) x j

The value chosen for σ is critical σ=0.15 σ=0.50

Naïve Ba learns a Linear Threshold Unit For multimial and discretized attributes (but t Gaussian), Naïve Ba gives a linear decision boundary P (x Y = y) =P (x 1 = v 1 Y = y) P (x 2 = v 2 Y = y) P (x n = v n Y = y) Define a discriminant function for class 1 versus class K P (Y =1 X) h(x) = P (Y = K X) = P (x 1 = v 1 Y =1) P (x 1 = v 1 Y = K) P (x n = vn Y =1) P (Y =1) P (x n = v n Y = K) P (Y = K)

Log of Odds Ratio log P (y =1 x) P (y = K x) = P (x 1 = v 1 y =1) P (x 1 = v 1 y = K) P (x n = vn y =1) P (y =1) P (xn = vn y = K) P (y = K) P (y =1 x) P (y = K x) = log P (x 1 = v 1 y =1) P (x 1 = v 1 y = K) +...log P (x n = v n y =1) (y =1) +logp P (xn = vn y = K) P (y = K) Suppose each x j is binary and define α j,0 = log P (x j =0 y =1) P (x j =0 y = K) α j,1 = log P (x j =1 y =1) P (x j =1 y = K)

Log Odds (2) Now rewrite as P (y =1 x) log P (y = K x) = X j P (y =1 x) log P (y = K x) = X j P (y =1) (α j,1 α j,0 )x j + α j,0 +log P (y = K) P (y =1) (α j,1 α j,0 )x j + α j,0 +log P (y = K) X j We classify into class 1 if this is 0 and into class K otherwise

Learning the Probability Distributions by Direct Computation P(y=k) ) is just the fraction of training examples belonging to class k. For multimial variables, = v y = k) ) is the fraction of training examples in class k where x j = v ˆµ jk For Gaussian variables, is the average value of x j for training examples in class k. ˆσ jk is the sample standard deviation of those points: ˆσ jk = v u t 1 N k X {i y i =k} (x i,j ˆµ jk ) 2

Improved Probability Estimates via Laplace Corrections When we have very little training data, direct probability computation can give probabilities of 0 or 1. Such extreme probabilities are too strong and cause problems Suppose we are estimate a probability P(z) and we have n 0 examples where z is false and n 1 examples where z is true. Our direct estimate is P (z =1)= n 1 n 0 + n 1 Laplace Estimate. Add 1 to the numerator and 2 to the deminator or P (z =1)= n 1 +1 n 0 + n 1 +2 This says that in the absence of any evidence, we expect P(z) ) = 0.5, but our belief is weak (equivalent to 1 example for each outcome). Generalized Laplace Estimate. If z has K different outcomes, then we estimate it as P (z =1)= n 1 +1 n 0 + + n K 1 + K

Naïve Ba Applied to Diabetes Diagsis Ba nets and causality Ba nets work best when arrows follow the direction of causality two things with a common cause are likely to be conditionally independent given the cause; arrows in the causal direction capture this independence In a Naïve Ba network, arrows are often t in the causal direction diabetes does t cause pregnancies diabetes does t cause age But some arrows are correct diabetes does cause the level of blood insulin and blood glucose

Non-Na Naïve Ba Manually construct a graph in which all arcs are causal Learning the probability tables is still easy. For example, P(Mass Age, Preg) involves counting the number of patients of a given age and number of pregnancies that have a given body mass Classification: P (D = d A, P, M, I, G) = P (I D = d)p (G I,D = d)p (D = d A, M, P) P (I,G)

Evaluation of Naïve Ba Criterion LMS Logistic LDA Trees Nets NNbr SVM NB Mixed data Missing values some Outliers disc Motone transformations some disc Scalability Irrelevant inputs some some some Linear combinations some Interpretable some Accurate Naïve Ba is very popular, particularly in natural language processing and information retrieval where there are many features compared to the number of examples In applications with lots of data, Naïve Ba does t usually perform as well as more sophisticated methods

Naïve Ba Summary Advantages of Baian networks Produces stochastic classifiers can be combined with utility functions to make optimal decisions Easy to incorporate causal kwledge resulting probabilities are easy to interpret Very simple learning algorithms if all variables are observed in training data Disadvantages of Baian networks Fixed sized hypothesis space may underfit or overfit the data may t contain any good classifiers if prior kwledge is wrong Harder to handle continuous features