Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007
Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples Spell Checking Spam Filtering Ensemble Learning Outline
Bayes Rule Bayes Three parts. What can be said about the third part for classification tasks?
Bayes Rule Three parts. What can be said about the third part for classification tasks? Unnecessary if we only care about the classification, not the probability estimation. May result in division by zero in domains where previously unseen features arise. Can you think of such a domain? So the denominator is either ignored entirely, or represented as a constant under tasks in which we need the estimates.
Bayes Rule What about the class priors P(C)? How do they effect the probability estimates? Do we need the class priors?
Bayes Rule: ML and MAP ML (Maximum Likelihood) is selecting the class that maximizes P(d c) -Class priors are uniform, or ignored MAP (Maximum a Posteriori) is selecting the class that maximizes P(d c)p(c) Both are embodiments of Ockham s razor ML may be problematic when the data is small MAP may be less appropriate when the class priors are suspect
Bayes Rule Finally, if we assume conditional independence of the features, Is this assumption reasonable?
Bayes Rule: Naïve Bayes And, finally, we arrive at Naïve Bayes MAP Naïve Bayes
Smoothing
Time/Space Complexities Training: O(examples*features) Decision Tree: O(examples*features^2) What about space?
Feature Preparation Filtering TFIDF (Lift) Mutual Information Time Decay Tokenization
Feature Filtering Why? Efficiency Text classification often involves a huge number of features Remove features while maintaining accuracy Features which are independent of the class provide no information Accuracy Helps prevent over-fitting
Feature Filtering: Lift The lift of a feature value is the ratio of the confidence of the feature value to the expected confidence of the feature value. Local (individual example) confidence vs. global (all examples) confidence How do we use lift? Order features by lift Keep top X features, or features above a certain threshold
Feature Filtering: TFIDF TFIDF is one lift measure which is useful in text classification tasks. TFIDF (Term Frequency, Inverse Document Frequency) - Intuitively, its how important a word is to a document in a collection - Has its own Wikipedia page - TFIDF, or TF/DF, is df D D ni Examples (word = tf/df = lift) some =.005/.8 =.006 a =.01/1 =.01 football =.01/.05 =.2 Packers =.01/.01 = 1
Feature Filtering: TFIDF TFIDF Filtering Benefits Accuracy (2.2% increase-yahoo!) Speed, memory (less features)
Feature Filtering: Mutual Information Another manner of filtering is to measure how well a feature discriminates between classes.
Feature Decay We already saw how we can weight the individual feature values with lift. We can also weight an example as a whole. Often we want to reduce the contribution of an example after it gets old. decay = reduce contribution of example to classifier t Use chemistry formula Nt N 0 e 7 day half life = 0.099 180 day half life = 0.00385 Example: 180 day half life. 30 days old. 1.0 is decayed to 0.89
Tokenization Add phrases as features Use sliding window - Example - Example Spam: Mr. Holloway, I invite you to use our consolidated student loan services. We can save you $50,000 on your student loans Window of size 2, new features: Mr. Holloway, Holloway I, I invite, invite you, you to, to use, and so on. Use lift to weed out poor combinations Why? If we know of dependencies, but want to keep the independence assumption, explicitly adding the dependent features as a new feature may improve performance.
Example 1: Spell Checker
Spell Checker from Peter Norvig http://www.norvig.com/spell-correct.html Source code provided in Python, Scheme, Perl, C, Java, Haskell, F#, Ruby, Erlang, and Rebol
Spell Checker P(c), the language model, is the probability that a proposed correction c. Intuitively, How likely is c to appear in an English text? P("the") would have a relatively high probability P("zxzxzxzyyy") would be near zero. Should we use words or phrases or something else? P(w c), the error model, is the probability that w would be typed in a text when the author meant c. Intuitively, How likely is it that the author would type w by mistake when c was intended?"
Spell Checker Where does P(c) come from? Read in a bunch of books, webpages, Wikipedia, etc Google Makes available its phrase counts data (http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html) 24 GB compressed, just to warn you What about unseen classes?
Spell Checker Where does P(w c) come from? Trivial model: Use edit distance to generate and score possibilities Consider only possibilities that have already been seen (real words / phrases) Can you think of another way to get these probabilities?
Spell Checker Can you think of another way to get these probabilities? Get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters. Incorporate feedback from users
Spell Checker Could we personalize this spell checker? Would it make sense to do so? Any questions about this example? Comments?
Example 2: Spam Filter
Spam Filter From Paul Graham s essays A Plan for Spam http://www.paulgraham.com/spam.html Better Bayesian Filtering http://www.paulgraham.com/better.html Better tokenization (more separators) Note: These are non-personalized filters
Spam Filter Feature Preparation 1. Gather spam and non-spam emails 2. Convert the emails to sets of features (sometimes called bag of words ) Tokenize Use TFIDF to remove common words Remove duplicates (Should we do this?) Example: The CSGA is meeting for lunch today. Free pizza will be served at the meeting. => CSGA, meeting, lunch, today, free, pizza, served
Spam Filter I get a lot of email containing the word "Lisp", and (so far) no spam that does. P(C) C is binary (spam, not spam) Graham uses an equal number of spam and non-spam messages ML What are the conditions under which we should think seriously about this parameter? (remember ML vs. MAP discussion)
Spam Filter P(F c) Just count the tokens and divide by the number of emails in the class Any observations? P(f spam) Examples perl 0.01 python 0.01 tcl 0.01 scripting 0.01 morris 0.01 graham 0.01491078 guarantee 0.9762507 cgi 0.9734398 paul 0.027040077 quite 0.030676773 pop3 0.042199217 various 0.06080265 prices 0.9359873 managed 0.06451222
Spam Filter How to use the spam filter 1. New email arrives. It is converted to tokens as the training examples were. 2. For each token in the new email, we look up (constant time) the probability, and multiply them together. 3. We then have the probability that its spam and the probability its not spam. We choose the greater of the two (MAP) and filter the email appropriately.
Spam Filter Improvements 1. Add bias We would rather misclassify as not spam than spam 2. Personalize How do we do this? Any other ideas?
Ensemble Version Using AdaBoost Increase weights of misclassified examples Use weights directly with Bayes Generate a fixed number of classifiers Does not changes the runtime or space complexities May be similar to learning in humans Learning a boosted naive Bayesian classifier can be done by rehearsing past experiences (Elkan 1997)
Ensemble Approaches Diabetes in Pima Indians. German Credit Elkan, C. Boosting and Naive Bayesian Learning. 1997.
Summary From Bayes Rule to Naïve Bayes MAP vs. ML Practicality Spell Checker Spam Filter Ensemble Version
Questions / Comments
Sources