Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Size: px

Start display at page:

Download "Naïve Bayes. Vibhav Gogate The University of Texas at Dallas"

Erika Johnston
7 years ago
Views:

1 Naïve Bayes Vibhav Gogate The University of Texas at Dallas

2 Supervised Learning of Classifiers Find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximation to f : X Y Examples: what are X and Y? Spam Detection Map to {Spam,Ham} Digit recognition Map pixels to {0,1,2,3,4,5,6,7,8,9} Stock Prediction Map new, historic prices, etc. to (the real numbers) Â Classification

3 3 Bayesian Categorization/Classification Let the set of categories be {c 1, c 2, c n } Let E be description of an instance. Determine category of E by determining for each c i P(E) can be ignored (normalization constant) Select the class with the max. probability. ) ( ) ( ) ( ) ( E P c E P c P E c P i i i ) ( ) ( ~ ) ( i i i c E P c P E c P

4 Classify s Text classification Y = {Spam,NotSpam} Classify news articles Y = {what is the topic of the article?} Classify webpages Y = {Student, professor, project, } What to use for features, X?

5 Features X are word sequence in document X i for i th word in article

6 Features for Text Classification X is sequence of words in document X (and hence P(X Y)) is huge!!! Article at least 1000 words, X={X 1,,X 1000 } X i represents i th word in document, i.e., the domain of X i is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. 10, = Atoms in Universe: We may have a problem

7 Bag of Words Model Typical additional assumption Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) (all positions have the same distribution) Ignore the order of words Sounds really silly, but often works very well! From now on: X i = Boolean: word i is in document X = X 1 X n

8 Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0

9 Bayesian Categorization P(y 1 X) ~ P(y i )P(X y i ) Need to know: Priors: P(y i ) Conditionals: P(X y i ) P(y i ) are easily estimated from data. If n i of the examples in D are in y i, then P(y i ) = n i / D Conditionals: X = X 1 X n Estimate P(X 1 X n y i ) Too many possible instances to estimate! (exponential in n) Even with bag of words assumption! 9

10 Need to Simplify Somehow Too many probabilities P(x 1 x 2 x 3 y i ) P(x 1 x 2 x 3 spam) P(x 1 x 2 x 3 spam) P(x 1 x 2 x 3 spam). P( x 1 x 2 x 3 spam) Can we assume some are the same? P(x 1 x 2 y i )=P(x 1 y i ) P(x 2 y i ) 10

11 Conditional Independence X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z e.g., Equivalent to:

12 Naïve Bayes Naïve Bayes assumption: Features are independent given class: More generally: How many parameters now? Suppose X is composed of n binary features

13 The Naïve Bayes Classifier Given: Prior P(Y) Y n conditionally independent features X given the class Y For each X i, we have likelihood P(X i Y) X 1 X 2 X n Decision rule:

14 MLE for the parameters of NB Given dataset, count occurrences for all pairs Count(X i =x i,y=y)----- How many pairs? MLE for discrete NB, simply: Prior: Likelihood:

15 NAÏVE BAYES CALCULATIONS

16 Subtleties of NB Classifier: #1 Violating the NB Assumption Usually, features are not conditionally independent: The naïve Bayes assumption is often violated, yet it performs surprisingly well in many cases. Plausible reason: Only need the probability of the correct class to be the largest! Example: two-way classification; just need to figure out the correct side of 0.5 and not the actual probability (0.51 is the same as 0.99).

17 Subtleties of NB Classifier: #2 Insufficient Training Data What if you never see a training instance where X1=a and Y=b You never saw, Y=Spam, X1={Enlargement} P(X1=Enlargement Y=Spam)=0 Thus no matter what values X2, X3,.,Xn take: P(X1=Enlargement, X2=a2,,Xn=an Y=Spam)=0 Why?

18 For Binary Features: We already know the answer! MAP: use most likely parameter Beta prior equivalent to extra observations for each feature As N 1, prior is forgotten But, for small sample size, prior is important!

19 That s Great for Binomial Works for Spam / Ham What about multiple classes Eg, given a wikipedia page, predicting type 19

20 Multinomials: Laplace Smoothing Laplace s estimate: Pretend you saw every outcome k extra times H H T What s Laplace with k = 0? k is the strength of the prior Can derive this as a MAP estimate for multinomial with Dirichlet priors Laplace for conditionals: Smooth each condition independently:

21 Probabilities: Important Detail! P(spam X 1 X n ) = P(spam X i ) i Any more potential problems here? We are multiplying lots of small numbers Danger of underflow! = 7 E -18 Solution? Use logs and add! p 1 * p 2 = e log(p1)+log(p2) Always keep in log form

22 NB for Text Classification: Learning Learning phase: P(Y m ) and P(X i Y m )

23 NB for Text Classification: Classification Given a new document having length L

24 Example: (Borrowed from Dan Jurafsky)

25 Bayesian Learning What if Features are Continuous? Eg., Character Recognition: X i is i th pixel Prior Posterior P(Y X) P(X Y) P(Y) Data Likelihood

26 Bayesian Learning What if Features are Continuous? Eg., Character Recognition: X i is i th pixel P(X i =x Y=y k ) = N( ik, ik ) P(Y X) P(X Y) P(Y) N( ik, ik ) =

27 P(X i =x Y=y k ) = N( ik, ik ) Gaussian Naïve Bayes Sometimes Assume Variance is independent of Y (i.e., i ), or independent of X i (i.e., k ) or both (i.e., ) P(Y X) P(X Y) P(Y) N( ik, ik ) =

28 Maximum Likelihood Estimates: Mean: Learning Gaussian Parameters Variance:

29 Learning Gaussian Parameters Maximum Likelihood Estimates: Mean: j th training example Variance: (x) 1 if x true, else 0

30 Maximum Likelihood Estimates: Mean: Learning Gaussian Parameters Variance:

31 What you need to know about Naïve Bayes Naïve Bayes classifier What s the assumption Why we use it How do we learn it Why is Bayesian estimation important Text classification Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class

Bayes and Naïve Bayes. cs534-machine Learning

Bayes and Naïve Bayes. cs534-machine Learning Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule