BUILDING A SPAM FILTER USING NAÏVE BAYES 1
Spam or not Spam: that is the question. From: "" <takworlld@hotmail.om> Subjet: real estate is the only way... gem oalvgkay Anyone an buy real estate with no money down Stop paying rent TODAY! There is no need to spend hundreds or even thousands for similar ourses I am 22 years old and I have already purhased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW! ================================================= Clik Below to order: http://www.wholesaledaily.om/sales/nmd.htm ================================================= 2
Categorization/Classifiation Problems Given: A desription of an instane X where X is the instane language or instane spae. (Issue: how do we represent tet douments?) A fied set of ategories: C = { 1 2 n } Determine: The ategory of : ()C where () is a ategorization funtion whose domain is X and whose range is C. We want to automatially build ategorization funtions ( lassifiers ). 3
EXAMPLES OF TEXT CATEGORIZATION Categories = SPAM? spam / not spam Categories = TOPICS finane / sports / asia Categories = OPINION like / hate / neutral Categories = AUTHOR Shakespeare / Marlowe / Ben Jonson The Federalist papers 4
Bayesian Methods for Classifiation Uses Bayes theorem to build a generative model that approimates how data is produed. First step: P( C X ) P( X C) P( C) PX ( ) Where C: Categories X: Instane to be lassified Uses prior probability of eah ategory given no information about an item. Categorization produes a posterior probability distribution over the possible ategories given a desription of eah instane. 6
Maimum a posteriori (MAP) Hypothesis Let MAP be the most probable ategory. Then goodbye to that nasty normalization!! MAP argma P( X ) C No need to ompute P(X)!!!! argma C P( D ) P( ) PX ( ) argma P( X ) P( ) C As P(X) is onstant 7
Maimum likelihood Hypothesis If all hypotheses are a priori equally likely to find the maimally likely ategory ML we only need to onsider the P(X ) term: ML argma P( X ) C Maimum Likelihood Estimate ( MLE ) 8
9 Naïve Bayes Classifiers: Step 1 Assume that instane X desribed by n-dimensional vetor of attributes then 1 2 n X ) ( argma 2 1 n C MAP P ) ( ) ( ) ( argma 2 1 2 1 n n C P P P ) ( ) ( argma 2 1 P P n C
Naïve Bayes Classifier: Step 2 To estimate: P( j ): Can be estimated from the frequeny of lasses in the training eamples. P( 1 2 n j ): Problem!! O( X n C ) parameters required to estimate full joint prob. distribution Solution: argma P( ) P( ) MAP 1 2 Naïve Bayes Conditional Independene Assumption: P(... ) P( ) C i 2 n j i j i n 10
Naïve Bayes Classifier for Binary variables Flu P X 1 X 2 X 3 X 4 X 5 runnynose sinus ough fever musle-ahe Conditional Independene Assumption: features are independent of eah other given the lass: ( 5 X1 X5 C) P( X1 C) P( X2 C) P( X C) 11
Learning the Model C X 1 X 2 X 3 X 4 X 5 X 6 First attempt: maimum likelihood estimates Given training data for N individuals where ount(x=) is the number of those individuals for whih X= e.g Flu=true For eah ategory and eah value for a variable X ount( C ) P ˆ( ) N ount( X C ) Pˆ( ) ount ( C ) 12
Problem with Ma Likelihood for Naïve Bayes Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus ough fever musle-ahe P( X X Flu) P( X Flu) P( X Flu) P( X Flu) 1 5 1 2 5 What if no training ases where patient had musle ahes but no flu? ount( X t flu) P X t flu ˆ( ) 5 5 ount ( flu ) 0 So if X t P( X X flu) 5 1 5 0 Zero probabilities overwhelm any other evidene! 13
Add-1 Laplae Smoothing to Avoid Overfitting ount( X C ) 1 Pˆ( ) ount( C ) X # of values of X i here 2 Slightly better version ount( X C ) Pˆ( ) ount( C ) X etent of smoothing 14
Using Naive Bayes Classifiers to Classify Tet: Basi method As a generative model: 1. Randomly pik a ategory aording to P() 2. For a doument of length N for eah word i : 1. Generate word i aording to P(w ) N 1 2 n i1 P( D w w... w ) P( ) P( w ) This is a Naïve Bayes lassifier for multinomial variables. Note that word order really doesn t matter here Uses same parameters for eah position Result is bag of words model Views doument not as an ordered list of words but as a multiset i 15
Naïve Bayes: Learning (First attempt) From training orpus etrat Voabulary Calulate required estimates of P() and P(w ) terms For eah j in C do ountdos ( C ) P () dos where ount dos () is the number of douments for whih is true. For eah word w i Voabulary and C where ount dotokens () is the number of tokens over all douments for whih is true of that doument and that token P( w ) i ountdotokens ( W wi C ) ount ( C ) dotokens 16
Naïve Bayes: Learning (Seond attempt) Laplae smoothing must be done over the voabulary items. We an assume we have at least one instane of eah ategory so we don t need to smooth these. Assume a single new word UNK that ours nowhere within the training doument set. Map all unknown words in douments to be lassified (test douments) to UNK. For 0 1 P( w ) i ountdotokens ( W wi C ) ount ( C ) a( V 1) dotokens 17
Naïve Bayes: Classifying Compute NB using either N arg ma P( ) P( w ) NB i1 where ount(w): the number of times word w ours in do (The two are equivalent..) i ( ) arg ma P( ) P( w ) ount w NB wv 18
PANTEL AND LIN: SPAMCOP Uses a Naïve Bayes lassifier M is spam if P(Spam M) > P(NonSpam M) Method Tokenize message using Porter Stemmer Estimate P( k C) using m-estimate (a form of smoothing) Remove words that do not satisfy ertain onditions Train: 160 spams 466 non-spams Test: 277 spams 346 non-spams Results: ERROR RATE of 4.33% Worse results using trigrams 19
Naive Bayes is (was) Not So Naive Naïve Bayes: First and Seond plae in KDD-CUP 97 ompetition among 16 (then) state of the art algorithms Goal: Finanial servies industry diret mail response predition model: Predit if the reipient of mail will atually respond to the advertisement 750000 reords. A good dependable baseline for tet lassifiation But not the best by itself! Optimal if the Independene Assumptions hold: If assumed independene is orret then it is the Bayes Optimal Classifier for problem Very Fast: Learning with one pass over the data; Testing linear in the number of attributes and doument olletion size Low Storage requirements 20
Engineering: Underflow Prevention Multiplying lots of probabilities whih are between 0 and 1 by definition an result in floating-point underflow. Sine log(y) = log() + log(y) it is better to perform all omputations by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability sore is still the most probable. NB argma log P( j ) log P( wi j ) jc ipositions 21
REFERENCES Mosteller F. & Wallae D. L. (1984). Applied Bayesian and Classial Inferene: the Case of the Federalist Papers (2nd ed.). New York: Springer-Verlag. P. Pantel and D. Lin 1998. SPAMCOP: A Spam lassifiation and organization program In Pro. Of the 1998 workshop on learning for tet ategorization AAAI Sebastiani F. 2002 Mahine Learning in Automated Tet Categorization ACM Computing Surveys 34(1) 1-47 25