CMPSCI 240: Reasoning about Uncertainty

Size: px

Start display at page:

Download "CMPSCI 240: Reasoning about Uncertainty"

Kelley Wright
8 years ago
Views:

1 CMPSCI 240: Reasoning about Uncertainty Lecture 18: Spam Filtering and Naive Bayes Classification Andrew McGregor University of Massachusetts Last Compiled: April 9, 2015

2 Review Total Probability If A 1,..., A n partition Ω then for any event B: P(B) P(B A 1 )P(A 1 ) P(B A n )P(A n ) n P(B A i )P(A i ) Bayes Theorem If A 1,..., A n partition Ω then for any event B: P(A i B) P(A i B) P(B) i1 P(B A i )P(A i ) n j1 P(B A j)p(a j )

3 Spam Filtering When you receive an you have two hypotheses : H 1 is spam H 2 is not spam and note that these are partitioning events. You have some observed data e.g., D contains the word cash How do you use the observed data to pick one of the hypotheses? We want to pick the maximum a posteriori hypothesis (MAP), i.e., the hypothesis H i than maximizes P(H i D).

events. You have some observed data e.g.

4 Using Bayes Theorem for Spam Filtering By Bayes theorem: P(H i D) P(H i D) P(D) P(D H i )P(H i ) We need to know the prior probabilities of H 1 and H 2 : P(H 1 )? P(H 2 )? i.e., before we knew anything above the , how likely was the to be spam or not. We also need to know that likelihood of the observed data given each hypothesis: P(D H 1 )? P(D H 2 )? i.e., for each hypothesis, what s the probability that we would see the observed data?

We also need to know that likelihood of the observed data given each hypothesis: P(D H 1 )? P(D H 2 )? i.e., for each hypothesis, what s the probability that we would see the observed data?

5 Computing Priors and Likelihoods To compute priors and likelihoods we use training or historical data. Suppose that we have analyzed 1000 pervious s and found: 700 of the s were spam 300 of the s were not spam From this, it would be natural to believe that 70% of future s were spam and that 30% of future s were not spam. Hence, we set the priors to be P (H 1 ) 0.7 and P (H 2 ) 0.3. Suppose we have also found that 350 of the spam s include the word cash 100 of the non-spam s include the word cash From this, it s natural to set the likelihoods to be P (D H 1 ) 1/2 and P (D H 2 ) 1/3.

believe that 70% of future emails were spam and that 30

6 Putting it all together If P (H 1 ) 0.7, P (H 2 ) 0.3, P (D H 1 ) 1/2, P (D H 2 ) 1/3 then: P(H 1 D) P(H 1 D) P(D) and similarly P(D H 1 )P(H 1 ) 1/ P(H 2 D) 0.3 1/3 0.1 Therefore the MAP hypothesis is H 1.

1 D) P(D) and similarly P(D H 1 )P(H 1 ) 1/2 0.7 0.

7 Combining Observed Data Suppose we have two pieces of observed data: D 1 contains the word cash D 2 contains the word pharmacy The MAP hypothesis is the one maximizing P(H j D 1 D 2 ) where P(H j D 1 D 2 ) P(D 1 D 2 H j )P(H j ) P(D 1 D 2 H 1 )P(H 1 ) + P(D 2 D 2 H 2 )P(H 2 )

MAP hypothesis is the one maximizing P(H j D 1 D 2 ) where P(H j D 1 D

8 Naive Bayes Classification In Naive Bayes Classification (NBC) we assume that the observed data is independent conditioned on either of the hypotheses, i.e., P(D 1 D 2 H 1 ) P(D 1 H 1 )P(D 2 H 1 ) P(D 1 D 2 H 2 ) P(D 1 H 2 )P(D 2 H 2 ) And therefore NBC picks the hypothesis H j that maximizes P(D 1 D 2 H j )P(H j ) P(D 1 H j )P(D 2 H j )P(H j ) Hence, we just need to compute the priors P(H 1 ), P(H 2 ) and the likelihoods P(D 1 H 1 ), P(D 2 H 1 ), P(D 1 H 2 ), P(D 2 H 2 ) from the training data.

therefore NBC picks the hypothesis H j that maximizes P(D 1 D 2 H j )P(H j ) P(D 1 H j )P(D 2 H j )P(H j ) Hence, we just

9 Picking an Hypothesis Suppose we have multiple hypotheses H 1, H 2,... H k and we have priors for these hypotheses P(H 1 ), P(H 2 ),..., P(H k ). If we observe some event D, the MAP hypothesis is the hypothesis H i that maximizes P(H i D) P(D H i)p(h i ) j P(D H j)p(h j ) To find this we need likelihoods P(D H j ) for each hypothesis H j. If we observe multiple events D 1, D 2,..., D t, the MAP hypothesis is the hypothesis H i that maximizes P(H i D 1 D 2... D t ) P(D 1 D 2... D t H i )P(H i ) j P(D 1 D 2... D t H j )P(H j ) In Naive Bayes Classification we assume that for each j: P(D 1 D 2... D t H j ) P(D 1 H j )P(D 2 H j )... P(D t H j )

P(D H j ) for each hypothesis H j. If we observe multiple events D 1, D 2,..., D t, the MAP hypothesis is the hypothesis H i that maximizes P(H i D 1 D 2.

10 Combing Evidence: Bird Watching You see a blue parrot and you hypothesize that it s a Norwegian Blue Your birdwatching book says: Only 10% of blue parrots are Norwegian Blues Norwegian Blues spend 60% of their time lying down Other blue parrots only spend 20% of their time lying down 80% of Norwegian Blues have lovely plumage 20% of other blue parrots have lovely plumage. The blue parrot you observe is lying down (D 1 ) and has lovely plumage (D 2 ): what is the likelihood it s a Norwegian Blue (N)? P (N D 1 D 2 ) P (D 1 D 2 N) P (N) P (D 1 D 2 N) P (N) + P (D 1 D 2 N c ) P (N c ) Not enough information for P (D 1 D 2 N) & P (D 1 D 2 N c ) Could use NBC: Assume P (D 1 D 2 N) P (D 1 N) P (D 2 N) and P (D 1 D 2 N c ) P (D 1 N c ) P (D 2 N c )

plumage. The blue parrot you observe is lying down (D 1 ) and has lovely plumage (D 2 ): what is the likelihood it s a Norwegian Blue (N)?

Math 141. Lecture 2: More Probability! Albyn Jones 1. jones@reed.edu www.people.reed.edu/ jones/courses/141. 1 Library 304. Albyn Jones Math 141

Math 141. Lecture 2: More Probability! Albyn Jones 1. jones@reed.edu www.people.reed.edu/ jones/courses/141. 1 Library 304. Albyn Jones Math 141 Math 141 Lecture 2: More Probability! Albyn Jones 1 1 Library 304 jones@reed.edu www.people.reed.edu/ jones/courses/141 Outline Law of total probability Bayes Theorem the Multiplication Rule, again Recall