Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Lynne Fitzgerald
7 years ago
Views:

1 . {circular,large,light,smooth,thick}, malignant Outline Contents Introduction to Machine Learning Bayesian Classification Varun Chandola January 3, 07 Learning Probabilistic Classifiers. Treating Output Label Y as a Random Variable Computing Posterior for Y Computing Class Conditional Probabilities Naive Bayes Classification 4. Naive Bayes Assumption Maximizing Likelihood Maximum Likelihood Estimates Adding Prior Using Naive Bayes Model for Prediction Naive Bayes Example Gaussian Discriminant Analysis 8 3. Moving to Continuous Data Quadratic and Linear Discriminant Analysis Learning Probabilistic Classifiers Training data, D = [ x i, y i ] D. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin}, benign 4. {oval,large,light,irregular,thick}, malignant 5. {circular,small,light,smooth,thick}, benign Testing: Predict y for x Option : Functional Approximation Option : Probabilistic Classifier y = f(x ) P (Y = benign X = x ), P (Y = malignant X = x ) Training data, D = [ x i, y i ] D. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin}, benign 4. {oval,large,light,irregular,thick}, malignant 5. {circular,small,light,smooth,thick}, benign x = circular,small, light,irregular,thin What is P (Y = benign x )? What is P (Y = malignant x )? Turns out that if we have not observed the training data, then the best probabilistic estimates we can provide is P (Y = benign) = P (Y = malignant) = 0.5. But if we know how many times Y takes each value in a randomly sampled data set, we can make a better estimate.

2 . Treating Output Label Y as a Random Variable Y takes two values What is p(y )? Ber(θ) How do you estimate θ? Treat the labels in training data as binary samples Done that last week! Posterior for θ p(θ) = α 0 + N α 0 + β 0 + N Class - Malignant; Class - Benign Can we just use p(y θ) for predicting future labels? Just a prior for Y. Computing Posterior for Y What is probability of x to be malignant P (X = x Y = malignant)? P (Y = malignant)? P (Y = malignant X = x )? P (Y = malignant X = x ) =.3 Computing Class Conditional Probabilities Class conditional probability of random variable X Step : Assume a probability distribution for X (p(x)) Step : Learn parameters from training data But X is multivariate discrete random variable! How many parameters are needed? 3 P (X=x Y =malignant)p (Y =malignant) P (X=x Y =malignant)p (Y =malignant)+p (X=x Y =benign)p (Y =benign) ( D ) How much training data is needed? Note that the X can take D values. That means that the probability distribution should consist of probability of observing each possibility. Given that all probabilities sum to, we need D probabilities. We need these probabilities for each value of Y, hence ( D ) probabilities. Obviously, to reliably estimate the probabilities, one need to observe each possible realization of X at least a few times. Which means that we need large amounts of training data! Naive Bayes Classification. Naive Bayes Assumption All features are independent Each variable can be assumed to be a Bernoulli random variable P (X = x Y = malignant) = P (X = x Y = benign) = y D p(x j Y = malignant) j= D p(x j Y = benign) j= x x x 3 x 4 x 5 x 6 x D Only need D parameters Training a Naive Bayes Classifier Find parameters that maximize likelihood of training data What is a training example? 4

3 x i? x i, y i What are the parameters? θ for Y (class prior) θ benign and θ malignant (or θ and θ ) Joint probability distribution of (X, Y ) p(x i, y i ) = p(y i θ)p(x i y i ) = p(y i θ) j p(x ij θ jyi ).3 Maximum Likelihood Estimates Maximize with respect to θ, assuming Y to be Bernoulli ˆθ = N c N Assuming each feature is binary (x j (y = c) Bernoulli(θ cj ), c = {, }) ˆθ cj = N cj N c. Maximizing Likelihood Likelihood for D Log-likelihood for D l(d Θ) = i ( p(y i θ) j ll(d Θ) = N log θ + N log( θ) + j + j p(x ij θ jyi ) ) N j log θ j + (N N j ) log ( θ j ) N j log θ j + (N N j ) log ( θ j ) N - # malignant training examples, N = # benign training examples N j - # malignant training examples with x j =, N j = # benign training examples with x j = Derivation of the log-likelihood can be done by using the following results. The summation i log p(y i θ) can be expanded and reordered by each class. For each class, the contribution to the sum will be N c p(y i θ c ) where N c is the number of training examples with c as the class label and θ c is the class prior for class c. The double summation i j log p(x ij θ jyi ) is same as j i log p(x ij θ jyi ). The inner sum can be expanded and order by each class. For each class, the contribution to the sum will be i:y log p(x i=c ij θ jc ). 5 Algorithm Naive Bayes Training for Binary Features : N c = 0, N cj = 0, j : for i = : N do 3: c y i 4: N c N c + 5: for j = : D do 6: if x ij = then 7: N cj N cj + 8: end if 9: end for 0: end for : ˆθc = Nc N, ˆθ cj = Ncj Nc : return b.4 Adding Prior Add prior to θ and each θ cj. Beta prior for θ ( Beta(a 0, b 0 )) Beta prior for θ cj ( Beta(a, b)) Posterior Estimates p(θ D) = Beta(N + a 0, N N + b 0 ) p(θ cj D) = Beta(N cj + a, N c N cj + b) 6

4 .5 Using Naive Bayes Model for Prediction p(y = c x, D) p(y = c D) j MLE approach, MAP approach? Bayesian approach: p(y = x, D) p(x j y = c, D) [ ] Ber(y = θ)p(θ D)dθ) [ ] Ber(x j θ cj )p(θ cj D)dθ cj j θ = N + a 0 N + a 0 + b 0 Test example: x = {cir, small, light} We can predict a label in three ways. First is to use the MLE for all the parameters. Second is to use MAP and third is to use the Bayesian averaging approach. In each, we need to plug in the parameter estimates in: P (Y = malignant X = x ) = ˆθ ˆθ malignant,cir ˆθ malignant,small ˆθ malignant,light P (Y = benign X = x ) = ˆθ ˆθ benign,cir ˆθ benign,small ˆθ benign,light 3 Gaussian Discriminant Analysis 3. Moving to Continuous Data Naive Bayes is still applicable! Each variable is a univariate Gaussian (normal) distribution θ cj = N cj + a N c + a + b Obviously, the MLE and MAP approach use the MLE and MAP estimates of the parameters to compute the above probability..6 Naive Bayes Example # Shape Size Color Type cir large light malignant cir large light benign 3 cir large light malignant 4 ovl large light benign 5 ovl large dark malignant 6 ovl small dark benign 7 ovl small dark malignant 8 ovl small light benign 9 cir small dark benign 0 cir large dark malignant 7 p(y x) = p(y) j = p(y) p(x j y) = p(y) j (π) D/ e Σ / πσ j (x µ) Σ (x µ) e (xj µi) σ j Where Σ is a diagonal matrix with σ, σ,..., σd as the diagonal entries µ is a vector of means Treating x as a multivariate Gaussian with zero covariance Gaussian Discriminant Analysis Class conditional density Posterior density for y p(y = x) = p(x y = ) = N (µ, Σ ) p(x y = ) = N (µ, Σ ) p(y = )N (µ, Σ ) p(y = )N (µ, Σ ) + p(y = )N (µ, Σ ) 8

5 Mahalanobis distance of x from the two means. One can geometrically interpret the Gaussian Discriminant Analysis by noting that the exponential in the pdf of a multivariate gaussian: (x µ) Σ (x µ) is the Mahalanobis Distance between an example x and the mean µ in the D dimensional space. For better understanding let us consider the Eigendecomposition of Σ, i.e., Sigma = UΛU, where U is an orthonormal matrix of eigenvectors with U U = I and Λ is a diagonal matrix consisting of eigenvalues. We can rewrite the inverse of Σ as: Quadratic decision boundary If Σ = Σ = Σ Linear Discriminant Analysis (LDA) Parameter sharing or tying Results in linear surface No quadratic term References Σ = (UΛU ) = U Λ U D = u i u i where u i is the i th eigenvector and is the corresponding eigenvalue. The Mahalanobis distance between x and µ can be rewritten as: ( D ) (x µ) Σ (x µ) = (x µ) u i u i (x µ) = = D D (x µ) u i u i (x µ) where y i = (x µ) u i. This is an equation for an ellipse in D-dimensional space. Thus it shows that the points on an ellipse around the mean have the same probability density for a Gaussian. 3. Quadratic and Linear Discriminant Analysis y i Using non-diagonal covariance matrices for each class - Quadratic Discriminant Analysis (QDA) 9 0

Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes