Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Size: px

Start display at page:

Download "Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007"

Solomon Bailey
8 years ago
Views:

1 Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

2 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples Spell Checking Spam Filtering Ensemble Learning Outline

3 Bayes Rule Bayes Three parts. What can be said about the third part for classification tasks?

4 Bayes Rule Three parts. What can be said about the third part for classification tasks? Unnecessary if we only care about the classification, not the probability estimation. May result in division by zero in domains where previously unseen features arise. Can you think of such a domain? So the denominator is either ignored entirely, or represented as a constant under tasks in which we need the estimates.

May result in division by zero in domains where previously unseen features arise.

5 Bayes Rule What about the class priors P(C)? How do they effect the probability estimates? Do we need the class priors?

6 Bayes Rule: ML and MAP ML (Maximum Likelihood) is selecting the class that maximizes P(d c) -Class priors are uniform, or ignored MAP (Maximum a Posteriori) is selecting the class that maximizes P(d c)p(c) Both are embodiments of Ockham s razor ML may be problematic when the data is small MAP may be less appropriate when the class priors are suspect

the class that maximizes P(d c)p(c) Both are embodiments of Ockham s razor ML may be

7 Bayes Rule Finally, if we assume conditional independence of the features, Is this assumption reasonable?

8 Bayes Rule: Naïve Bayes And, finally, we arrive at Naïve Bayes MAP Naïve Bayes

9 Smoothing

10 Time/Space Complexities Training: O(examples*features) Decision Tree: O(examples*features^2) What about space?

11 Feature Preparation Filtering TFIDF (Lift) Mutual Information Time Decay Tokenization

12 Feature Filtering Why? Efficiency Text classification often involves a huge number of features Remove features while maintaining accuracy Features which are independent of the class provide no information Accuracy Helps prevent over-fitting

of features Remove features while maintaining accuracy

13 Feature Filtering: Lift The lift of a feature value is the ratio of the confidence of the feature value to the expected confidence of the feature value. Local (individual example) confidence vs. global (all examples) confidence How do we use lift? Order features by lift Keep top X features, or features above a certain threshold

Local (individual example) confidence vs.

14 Feature Filtering: TFIDF TFIDF is one lift measure which is useful in text classification tasks. TFIDF (Term Frequency, Inverse Document Frequency) - Intuitively, its how important a word is to a document in a collection - Has its own Wikipedia page - TFIDF, or TF/DF, is df D D ni Examples (word = tf/df = lift) some =.005/.8 =.006 a =.01/1 =.01 football =.01/.05 =.2 Packers =.01/.01 = 1

a document in a collection - Has its own Wikipedia page - TFIDF, or TF/DF, is df D D ni Examples

15 Feature Filtering: TFIDF TFIDF Filtering Benefits Accuracy (2.2% increase-yahoo!) Speed, memory (less features)

16 Feature Filtering: Mutual Information Another manner of filtering is to measure how well a feature discriminates between classes.

17 Feature Decay We already saw how we can weight the individual feature values with lift. We can also weight an example as a whole. Often we want to reduce the contribution of an example after it gets old. decay = reduce contribution of example to classifier t Use chemistry formula Nt N 0 e 7 day half life = day half life = Example: 180 day half life. 30 days old. 1.0 is decayed to 0.89

Often we want to reduce the contribution of an example after it gets old.

18 Tokenization Add phrases as features Use sliding window - Example - Example Spam: Mr. Holloway, I invite you to use our consolidated student loan services. We can save you $50,000 on your student loans Window of size 2, new features: Mr. Holloway, Holloway I, I invite, invite you, you to, to use, and so on. Use lift to weed out poor combinations Why? If we know of dependencies, but want to keep the independence assumption, explicitly adding the dependent features as a new feature may improve performance.

We can save you $50,000 on your student loans Window of size 2, new features: Mr.

19 Example 1: Spell Checker

20 Spell Checker from Peter Norvig Source code provided in Python, Scheme, Perl, C, Java, Haskell, F#, Ruby, Erlang, and Rebol

21 Spell Checker P(c), the language model, is the probability that a proposed correction c. Intuitively, How likely is c to appear in an English text? P("the") would have a relatively high probability P("zxzxzxzyyy") would be near zero. Should we use words or phrases or something else? P(w c), the error model, is the probability that w would be typed in a text when the author meant c. Intuitively, How likely is it that the author would type w by mistake when c was intended?"

22 Spell Checker Where does P(c) come from? Read in a bunch of books, webpages, Wikipedia, etc Google Makes available its phrase counts data ( 24 GB compressed, just to warn you What about unseen classes?

23 Spell Checker Where does P(w c) come from? Trivial model: Use edit distance to generate and score possibilities Consider only possibilities that have already been seen (real words / phrases) Can you think of another way to get these probabilities?

24 Spell Checker Can you think of another way to get these probabilities? Get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters. Incorporate feedback from users

25 Spell Checker Could we personalize this spell checker? Would it make sense to do so? Any questions about this example? Comments?

26 Example 2: Spam Filter

27 Spam Filter From Paul Graham s essays A Plan for Spam Better Bayesian Filtering Better tokenization (more separators) Note: These are non-personalized filters

28 Spam Filter Feature Preparation 1. Gather spam and non-spam s 2. Convert the s to sets of features (sometimes called bag of words ) Tokenize Use TFIDF to remove common words Remove duplicates (Should we do this?) Example: The CSGA is meeting for lunch today. Free pizza will be served at the meeting. => CSGA, meeting, lunch, today, free, pizza, served

29 Spam Filter I get a lot of containing the word "Lisp", and (so far) no spam that does. P(C) C is binary (spam, not spam) Graham uses an equal number of spam and non-spam messages ML What are the conditions under which we should think seriously about this parameter? (remember ML vs. MAP discussion)

30 Spam Filter P(F c) Just count the tokens and divide by the number of s in the class Any observations? P(f spam) Examples perl 0.01 python 0.01 tcl 0.01 scripting 0.01 morris 0.01 graham guarantee cgi paul quite pop various prices managed

31 Spam Filter How to use the spam filter 1. New arrives. It is converted to tokens as the training examples were. 2. For each token in the new , we look up (constant time) the probability, and multiply them together. 3. We then have the probability that its spam and the probability its not spam. We choose the greater of the two (MAP) and filter the appropriately.

32 Spam Filter Improvements 1. Add bias We would rather misclassify as not spam than spam 2. Personalize How do we do this? Any other ideas?

33 Ensemble Version Using AdaBoost Increase weights of misclassified examples Use weights directly with Bayes Generate a fixed number of classifiers Does not changes the runtime or space complexities May be similar to learning in humans Learning a boosted naive Bayesian classifier can be done by rehearsing past experiences (Elkan 1997)

34 Ensemble Approaches Diabetes in Pima Indians. German Credit Elkan, C. Boosting and Naive Bayesian Learning

35 Summary From Bayes Rule to Naïve Bayes MAP vs. ML Practicality Spell Checker Spam Filter Ensemble Version

36 Questions / Comments

37 Sources

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE