Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007



Similar documents
Machine Learning Final Project Spam Filtering

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

1 Maximum likelihood estimation

Simple Language Models for Spam Detection

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Data Mining Practical Machine Learning Tools and Techniques

Spam Filtering using Naïve Bayesian Classification

Anti Spamming Techniques

CSE 473: Artificial Intelligence Autumn 2010

Machine Learning in Spam Filtering

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Discrete Structures for Computer Science

Bayes and Naïve Bayes. cs534-machine Learning

Question 2 Naïve Bayes (16 points)

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Bayesian Spam Filtering

Big Data & Scripting Part II Streaming Algorithms

Supervised Learning (Big Data Analytics)

Chapter 6. The stacking ensemble approach

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Spam Detection A Machine Learning Approach

Search Engines. Stephen Shaw 18th of February, Netsoc

Tweaking Naïve Bayes classifier for intelligent spam detection

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Data Mining - Evaluation of Classifiers

Content-Based Recommendation

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Adaption of Statistical Filtering Techniques

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

4.5 Symbol Table Applications

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Bayesian Spam Detection

Immunity from spam: an analysis of an artificial immune system for junk detection

1 Introductory Comments. 2 Bayesian Probability

Learning to classify

Not So Naïve Online Bayesian Spam Filter

Search and Information Retrieval

Knowledge Discovery and Data Mining

Multi-Protocol Content Filtering

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Statistical Machine Learning

Knowledge Discovery and Data Mining

Spam Filtering with Naive Bayesian Classification

Statistical Feature Selection Techniques for Arabic Text Categorization

Using MS Excel to Analyze Data: A Tutorial

Knowledge Discovery and Data Mining

Linear Classification. Volker Tresp Summer 2015

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

L4: Bayesian Decision Theory

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Car Insurance. Havránek, Pokorný, Tomášek

Chapter 5. Phrase-based models. Statistical Machine Translation

CS570 Data Mining Classification: Ensemble Methods

Learning from Data: Naive Bayes

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

Distributed Computing and Big Data: Hadoop and MapReduce

DEFENDER SERVICES

Sentiment analysis using emoticons

BUILDING A SPAM FILTER USING NAÏVE BAYES. CIS 391- Intro to AI 1

INFO 2950 Intro to Data Science. Lecture 17: Power Laws and Big Data

WE DEFINE spam as an message that is unwanted basically

Projektgruppe. Categorization of text documents via classification

Predictive Modeling Techniques in Insurance

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

MHI3000 Big Data Analytics for Health Care Final Project Report

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Combining Global and Personal Anti-Spam Filtering

Automated News Item Categorization

Monotonicity Hints. Abstract

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

: Introduction to Machine Learning Dr. Rita Osadchy

Machine learning for algo trading

Why is Internal Audit so Hard?

Investigation of Support Vector Machines for Classification

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

SOPS: Stock Prediction using Web Sentiment

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Mining a Corpus of Job Ads

Robust personalizable spam filtering via local and global discrimination modeling

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Handling Unsolicited Commercial (UCE) or spam using Microsoft Outlook at Staffordshire University

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Classification Problems

Lecture 9. Semantic Analysis Scoping and Symbol Table

CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering

An Efficient Spam Filtering Techniques for Account

Web Document Clustering

Social Media Mining. Data Mining Essentials

Predicting the Stock Market with News Articles

Transcription:

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples Spell Checking Spam Filtering Ensemble Learning Outline

Bayes Rule Bayes Three parts. What can be said about the third part for classification tasks?

Bayes Rule Three parts. What can be said about the third part for classification tasks? Unnecessary if we only care about the classification, not the probability estimation. May result in division by zero in domains where previously unseen features arise. Can you think of such a domain? So the denominator is either ignored entirely, or represented as a constant under tasks in which we need the estimates.

Bayes Rule What about the class priors P(C)? How do they effect the probability estimates? Do we need the class priors?

Bayes Rule: ML and MAP ML (Maximum Likelihood) is selecting the class that maximizes P(d c) -Class priors are uniform, or ignored MAP (Maximum a Posteriori) is selecting the class that maximizes P(d c)p(c) Both are embodiments of Ockham s razor ML may be problematic when the data is small MAP may be less appropriate when the class priors are suspect

Bayes Rule Finally, if we assume conditional independence of the features, Is this assumption reasonable?

Bayes Rule: Naïve Bayes And, finally, we arrive at Naïve Bayes MAP Naïve Bayes

Smoothing

Time/Space Complexities Training: O(examples*features) Decision Tree: O(examples*features^2) What about space?

Feature Preparation Filtering TFIDF (Lift) Mutual Information Time Decay Tokenization

Feature Filtering Why? Efficiency Text classification often involves a huge number of features Remove features while maintaining accuracy Features which are independent of the class provide no information Accuracy Helps prevent over-fitting

Feature Filtering: Lift The lift of a feature value is the ratio of the confidence of the feature value to the expected confidence of the feature value. Local (individual example) confidence vs. global (all examples) confidence How do we use lift? Order features by lift Keep top X features, or features above a certain threshold

Feature Filtering: TFIDF TFIDF is one lift measure which is useful in text classification tasks. TFIDF (Term Frequency, Inverse Document Frequency) - Intuitively, its how important a word is to a document in a collection - Has its own Wikipedia page - TFIDF, or TF/DF, is df D D ni Examples (word = tf/df = lift) some =.005/.8 =.006 a =.01/1 =.01 football =.01/.05 =.2 Packers =.01/.01 = 1

Feature Filtering: TFIDF TFIDF Filtering Benefits Accuracy (2.2% increase-yahoo!) Speed, memory (less features)

Feature Filtering: Mutual Information Another manner of filtering is to measure how well a feature discriminates between classes.

Feature Decay We already saw how we can weight the individual feature values with lift. We can also weight an example as a whole. Often we want to reduce the contribution of an example after it gets old. decay = reduce contribution of example to classifier t Use chemistry formula Nt N 0 e 7 day half life = 0.099 180 day half life = 0.00385 Example: 180 day half life. 30 days old. 1.0 is decayed to 0.89

Tokenization Add phrases as features Use sliding window - Example - Example Spam: Mr. Holloway, I invite you to use our consolidated student loan services. We can save you $50,000 on your student loans Window of size 2, new features: Mr. Holloway, Holloway I, I invite, invite you, you to, to use, and so on. Use lift to weed out poor combinations Why? If we know of dependencies, but want to keep the independence assumption, explicitly adding the dependent features as a new feature may improve performance.

Example 1: Spell Checker

Spell Checker from Peter Norvig http://www.norvig.com/spell-correct.html Source code provided in Python, Scheme, Perl, C, Java, Haskell, F#, Ruby, Erlang, and Rebol

Spell Checker P(c), the language model, is the probability that a proposed correction c. Intuitively, How likely is c to appear in an English text? P("the") would have a relatively high probability P("zxzxzxzyyy") would be near zero. Should we use words or phrases or something else? P(w c), the error model, is the probability that w would be typed in a text when the author meant c. Intuitively, How likely is it that the author would type w by mistake when c was intended?"

Spell Checker Where does P(c) come from? Read in a bunch of books, webpages, Wikipedia, etc Google Makes available its phrase counts data (http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html) 24 GB compressed, just to warn you What about unseen classes?

Spell Checker Where does P(w c) come from? Trivial model: Use edit distance to generate and score possibilities Consider only possibilities that have already been seen (real words / phrases) Can you think of another way to get these probabilities?

Spell Checker Can you think of another way to get these probabilities? Get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters. Incorporate feedback from users

Spell Checker Could we personalize this spell checker? Would it make sense to do so? Any questions about this example? Comments?

Example 2: Spam Filter

Spam Filter From Paul Graham s essays A Plan for Spam http://www.paulgraham.com/spam.html Better Bayesian Filtering http://www.paulgraham.com/better.html Better tokenization (more separators) Note: These are non-personalized filters

Spam Filter Feature Preparation 1. Gather spam and non-spam emails 2. Convert the emails to sets of features (sometimes called bag of words ) Tokenize Use TFIDF to remove common words Remove duplicates (Should we do this?) Example: The CSGA is meeting for lunch today. Free pizza will be served at the meeting. => CSGA, meeting, lunch, today, free, pizza, served

Spam Filter I get a lot of email containing the word "Lisp", and (so far) no spam that does. P(C) C is binary (spam, not spam) Graham uses an equal number of spam and non-spam messages ML What are the conditions under which we should think seriously about this parameter? (remember ML vs. MAP discussion)

Spam Filter P(F c) Just count the tokens and divide by the number of emails in the class Any observations? P(f spam) Examples perl 0.01 python 0.01 tcl 0.01 scripting 0.01 morris 0.01 graham 0.01491078 guarantee 0.9762507 cgi 0.9734398 paul 0.027040077 quite 0.030676773 pop3 0.042199217 various 0.06080265 prices 0.9359873 managed 0.06451222

Spam Filter How to use the spam filter 1. New email arrives. It is converted to tokens as the training examples were. 2. For each token in the new email, we look up (constant time) the probability, and multiply them together. 3. We then have the probability that its spam and the probability its not spam. We choose the greater of the two (MAP) and filter the email appropriately.

Spam Filter Improvements 1. Add bias We would rather misclassify as not spam than spam 2. Personalize How do we do this? Any other ideas?

Ensemble Version Using AdaBoost Increase weights of misclassified examples Use weights directly with Bayes Generate a fixed number of classifiers Does not changes the runtime or space complexities May be similar to learning in humans Learning a boosted naive Bayesian classifier can be done by rehearsing past experiences (Elkan 1997)

Ensemble Approaches Diabetes in Pima Indians. German Credit Elkan, C. Boosting and Naive Bayesian Learning. 1997.

Summary From Bayes Rule to Naïve Bayes MAP vs. ML Practicality Spell Checker Spam Filter Ensemble Version

Questions / Comments

Sources