Introduction to Machine Learning Lecture 3. Mehryar Mohri Courant Institute and Google Research

Size: px

Start display at page:

Download "Introduction to Machine Learning Lecture 3. Mehryar Mohri Courant Institute and Google Research"

Eileen Francis
7 years ago
Views:

1 Introduction to Machine Learning Lecture 3 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

2 Bayesian Learning

3 Terminology: Bayes Formula/Rule Pr[Y X] = posterior probability likelihood Pr[X Y ]Pr[Y ]. Pr[X] evidence prior 3

4 Loss Function Definition: function L: Y Y R + indicating the penalty for an incorrect prediction. L(y, y) : loss for prediction of y instead of y. Examples: zero-one loss: standard loss function in classification; L(y, y )=1 y=y for y, y Y. non-symmetric losses: e.g., for spam classification; L( ham, spam) L( spam, ham). squared loss: standard loss function in regression; L(y, y )=(y y) 2. 4

5 Input space Classification Problem X : e.g., set of documents. feature vector Φ(x) R N associated to x X. notation: feature vector x R N. example: vector of word counts in document. Output or target space : set of classes; e.g., sport, business, art. Y Problem: given x, predict the correct class associated to x. y Y 5

6 Bayesian Prediction Definition: the expected conditional loss of predicting y Y is Bayesian decision: predict class minimizing expected conditional loss, that is y =argmin by zero-one loss: L[y x] = y Y L(y, y)pr[y x]. L[y x] =argmin by y =argmax by y Y L(y, y)pr[y x]. Pr[y x]. Maximum a Posteriori (MAP) principle. 6

7 Binary Classification - Illustration 1 Pr[y 1 x] 0 x Pr[y 2 x] 7

8 Maximum a Posteriori (MAP) Definition: the MAP principle consists of predicting according to the rule y =argmax y Y Pr[y x]. Equivalently, by the Bayes formula: y =argmax y Y Pr[x y]pr[y] Pr[x] =argmax y Y Pr[x y]pr[y]. How do we determine Pr[x y] and Pr[y]? Density estimation problem. 8

9 Application - Maximum a Posteriori Formulation: hypothesis set H. ĥ =argmax h H Pr[h O] =argmax h H Pr[O h]pr[h] Pr[O] =argmax h H Pr[O h]pr[h]. Example: determine if a patient has a rare disease H ={d, nd}, given laboratory test O ={pos, neg}. With Pr[d]=.005, Pr[pos d] =.98, Pr[neg nd] =.95, if the test is positive, what should be the diagnosis? Pr[pos d]pr[d]= = Pr[pos nd]pr[nd]=(1.95).(1.005)=.04975>

10 Density Estimation Data: sample drawn i.i.d. from set some distribution D, x 1,...,x m X. Xaccording to Problem: find distribution p out of a set P that best estimates D. Note: we will study density estimation specifically in a future lecture. 10

11 Maximum Likelihood Likelihood: probability of observing sample under distribution p P, which, given the independence assumption is m Pr[x 1,...,x m ]= p(x i ). Principle: select distribution maximizing sample probability m p = argmax p(x i ), p P i=1 m or p = argmax log p(x i ). p P i=1 i=1 11

12 Example: Bernoulli Trials Problem: find most likely Bernoulli distribution, given sequence of coin flips H, T, T, H, T, H, T, H, H, H, T, T,..., H. Bernoulli distribution: p(h) =θ, p(t )=1 θ. Likelihood: Solution: dl(p) dθ l = N(H) θ l(p) =logθ N(H) (1 θ) N(T ) = N(H)logθ + N(T )log(1 θ). is differentiable and concave; N(T ) 1 θ =0 θ = N(H) N(H)+N(T ). 12

13 Example: Gaussian Distribution Problem: find most likely Gaussian distribution, given sequence of real-valued observations Normal distribution: Likelihood: Solution: p(x) µ =0 µ = 1 m l 3.18, 2.35,.95, 1.175,... ( 1 p(x) = exp (x µ)2 2πσ 2 2σ 2 l(p) = 1 m 2 m (x i µ) 2 log(2πσ2 ) 2σ 2. is differentiable and concave; m i=1 x i p(x) σ 2 i=1 =0 σ 2 = 1 m ) m x 2 i µ 2. i=1. 13

14 ML Properties Problems: the underlying distribution may not be among those searched. overfitting: number of examples too small wrt number of parameters. Pr[y] =0 if class y does not appear in sample! smoothing techniques. 14

15 Additive Smoothing Definition: the additive or Laplace smoothing for estimating Pr[y], y Y, from a sample of size m is defined by α =0 : ML estimator (MLE). MLE after adding Pr[y] = α y + α m + α Y. to the count of each class. Bayesian justification based on Dirichlet prior. poor performance for some applications, such as n-gram language modeling. 15

16 Estimation Problem Conditional probability: for large N, number of features, difficult to estimate. Pr[x y] =Pr[x 1,...,x N y]. even if features are Boolean, that is x i {0, 1}, there are 2 N possible feature vectors! may need very large sample. 16

17 Naive Bayes Conditional independence assumption: for any y Y, Pr[x 1,...,x N y] =Pr[x 1 y]...pr[x N y]. given the class, the features are assumed to be independent. strong assumption, typically does not hold. 17

18 Example - Document Classification Features: presence/absence of word x i. Estimation of Pr[x i y] : frequency of word x i among documents labeled with y, or smooth estimate. Estimation of Classification: Pr[y] y =argmax y Y : frequency of class y in sample. Pr[y] N Pr[x i y]. i=1 18

19 Naive Bayes - Binary Classification Classes: Y = { 1, +1}. Decision based on sign of log-odd ratios: log Pr[+1 x] Pr[ 1 x] log Pr[+1 x] Pr[ 1 x] =logpr[+1] Pr[x +1] Pr[ 1] Pr[x 1] ; in terms of =log Pr[+1] N i=1 Pr[x i +1] Pr[ 1] N i=1 Pr[x i 1] =log Pr[+1] Pr[ 1] + N i=1 contribution of feature/expert i to decision log Pr[x i +1] Pr[x i 1]. 19

20 Naive Bayes = Linear Classifier Theorem: assume that x i {0, 1} for all i [1,N]. Then, the Naive Bayes classifier is defined by where and x sgn(w x + b), Proof: observe that for any i [1,N], log Pr[x i +1] Pr[x i 1] = w i =log Pr[x i=1 +1] Pr[x i =1 1] log Pr[x i=0 +1] Pr[x i =0 1] b =log Pr[+1] Pr[ 1] + N i=1 log Pr[x i=0 +1] Pr[x i =0 1]. log Pr[x i =1 +1] Pr[x i =1 1] log Pr[x i =0 +1] x i +log Pr[x i =0 +1] Pr[x i =0 1] Pr[x i =0 1]. 20

21 Summary Bayesian prediction: requires solving density estimation problems. often difficult to estimate Pr[x y] for x R N. but, simple and easy to apply; widely used. Naive Bayes: strong assumption. straightforward estimation problem. specific linear classifier. sometimes surprisingly good performance. 21

1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N