Bayes and Naïve Bayes. cs534-machine Learning

Size: px

Start display at page:

Download "Bayes and Naïve Bayes. cs534-machine Learning"

Tamsyn Franklin
10 years ago
Views:

1 Bayes and aïve Bayes cs534-machine Learning

2 Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule In LDA we assumed that is Gaussian with different means and shared covariance matrix

Classifier, because of the use of the Bayes rule In LDA we

3 Joint Density estimation More generally, learning x is a density estimation problem, which is a challenging task Consider the case where x is a -dimensional binary vector Learning the joint distribution of x involves estimating 2 1parameters For a large, this number is prohibitively large Typically we don t have enough data to estimate the joint distribution accurately It is common to encounter the situation where no training examples have the exact x 1, 2,. value combination then x 0 for all values of.

large, this number is prohibitively large Typically we don t have enough data to estimate the joint distribution accurately It

4 aïve Bayes Assumption Assume that each feature is independent from one another given the class label Definition: is conditionally independent of given, if the probability distribution governing is independent of the value of, given the value of,,, Often denoted as, For example:, If, are conditional independent given, we have:,

probability distribution governing is independent of the value of, given the

5 aïve Bayes Classifier Under aïve Bayes assumption, we have: x Thus, no need to estimate the joint distribution Instead, only need to estimate for each feature Example: with binary features and classes, we reduce the number of parameters from 2 1 to Significantly reduces overfitting

for each feature Example: with binary features and classes, we reduce

6 Case study: spam filtering Bag-of-words to describe s Represent an by a vector whose dimension = the number of words in our vocabulary Example: Bernoulli feature x i =1 if the ith word is present x i =0 if the ith word is not present The ordering/position of the words do not matter

Example: Bernoulli feature x i =1 if the ith word is present x i =0 if

7 MLE for aïve Bayes with Bernoulli Model Suppose our training set contained s, likelihood estimate of the parameters are : the maximum P( y 1) P( x i, where i.e., the fraction of spam s where P( x i 1 1 y 1) 1 y 0) i 1 1 i 0 0, 1 is the number of i.e., the fraction of the nonspam s where x i spam s appeared x i appeared

$the fraction of spam emails where P( x i 1 1 y 1) 1 y 0) i 1 1 i 0 0, 1 is the$

8 Multinomial Model Instead of treating the absence/presence of each word in the dictionary as a Bernoulli, we can treat each word in the as rolling a sided die, where is the total size of the dictionary Each of the words has a fixed probability of being selected This allows us to take the counts of the words into consideration

die, where is the total size of the dictionary Each of the words has a fixed

9 MLE for aïve Bayes with Multinomial model The likelihood of observing one MLE estimate for the -th word in the dictionary: Total number of parameters:

10 Problem with MLE Suppose you picked up the new word Mahalanobis in your class and started using it in your x Because Mahalanobis (say it s the n+1 th word in the vocabulary) has never appeared in any of the training s, the probability estimate for this word will be p n+1 1 =0 and p n+1 0 =0 P( x y) 0 ow P( x y ) = for both y=0 and y=1 i i Given limited training data, MLE can result in probabilities of 0 or 1. Such extreme probabilities are too strong and cause problems.

probability estimate for this word will be p n+1 1 =0 and p n+1 0 =0 P( x y) 0 ow P( x y ) = for both y=0 and y=1 i i

11 Bayesian Parameter Estimation In the Bayesian framework, we consider the parameters we are trying to estimate to be random variables We give them a prior distribution, which is used to capture our prior believe about the parameter When the data is sparse and this allows us to fall back to the prior and avoid the issues faced by MLE

which is used to capture our prior believe about the parameter When the data is

12 Example: Bernoulli Given a unfair coin, we want to estimate - the probability of head We toss the coin times, and observe heads MLE estimate: ow let s go Bayesian and assume that is a random variable and has a prior distribution For reasons that will become clear later, we assume the following prior for : ~,, ;, 1, 1

Bayesian and assume that is a random variable and has a prior distribution For

13 Beta distribution uniform : U-shape uni-model symmetric

14 Posterior distribution of 1 1, 1 1, 1 oting that the posterior has exactly the same form as the prior. This is not a coincidence, it is due to careful selection of the prior distribution conjugate prior

15 Maximum a-posterior (MAP) A full Bayesian treatment will not care about estimating the parameter Instead, it will use the posterior distribution to make predictions for the target variable But if we do need to have a concrete estimate of the parameter We can use Maximum A Posterior estimation, that is

make predictions for the target variable But if we do need to have a

16 MAP for Bernoulli 1 D, 1 Mode: the maximum point for, is 1 2 In this case we have: 1 2 Let 2,2, we have Comparing the MLE, it is like adding some fake coin tosses (one head, one tail) into the observed data this is also called sometimes as Laplacian smoothing

like adding some fake coin tosses (one head, one tail) into the

17 Laplace Smoothing Suppose we estimate a probability P(z) and we have n 0 examples where 0and n 1 examples where 1 MLE estimate is n P ( z = 1) = 1 n 0 + n 1 Laplace Smoothing. Add 1 to the numerator and 2 to the denominator n1 1 P( z 1) n0 n1 2 If we don t observe any examples, we expect P(z=1) = 0.5, but our belief is weak (equivalent to seeing one example of each outcome).

Add 1 to the numerator and 2 to the denominator n1 1 P( z 1) n0 n1 2 If we don t observe any

18 MAP estimation for Multinomials The conjugate prior for multinomial is called Dirichlet distribution For outcomes, a Dirichlet distribution has parameters, each serves a similar purpose as Beta distribution s parameters Acting as fake observation(s) for the corresponding output, the count depends on the value of the parameter Laplace smoothing for this case: P( z k) n n k 1 K

purpose as Beta distribution s parameters Acting as fake observation(s) for the corresponding

19 MAP for aïve Bayes Spam Filter When estimating 1 and 0 MLE MAP Bernoulli case: P( x i 1 y MLE Multinomial case: 0 0) i 0 0 P( x 1 total # of word in ns s total # of words in ns s total # of word in ns s1 0 total # of words in ns s V where is the size of the vocabulary 1 2 When encounter a new word that has not appeared in training set, now the probabilities do not go to zero i y MAP 0) i 0 0

of word in ns emails1 0 total # of words in ns emails V where is the size of the vocabulary 1 2 When

20 aïve Bayes Summary Generative classifier learn P( x y ) and P(y) Use Bayes rule to compute P( y x ) for classification Assumes conditional independence between features given class labels Greatly reduces the numbers of parameters to learn MAP estimation (or Laplace smoothing) is necessary to avoid overfitting and extreme probability values

given class labels Greatly reduces the numbers of parameters to learn MAP estimation

1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N