BUILDING A SPAM FILTER USING NAÏVE BAYES. CIS 391- Intro to AI 1



Similar documents
Bayes and Naïve Bayes. cs534-machine Learning

1 Maximum likelihood estimation

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Question 2 Naïve Bayes (16 points)

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

Recommending Questions Using the MDL-based Tree Cut Model

Spam Filtering with Naive Bayesian Classification


Capacity at Unsignalized Two-Stage Priority Intersections

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

State of Maryland Participation Agreement for Pre-Tax and Roth Retirement Savings Accounts

Programming Basics - FORTRAN 77

Ranking Community Answers by Modeling Question-Answer Relationships via Analogical Reasoning

Hierarchical Clustering and Sampling Techniques for Network Monitoring

Retirement Option Election Form with Partial Lump Sum Payment

Chapter 1 Microeconomics of Consumer Theory

Channel Assignment Strategies for Cellular Phone Systems

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

User s Guide VISFIT: a computer tool for the measurement of intrinsic viscosities

Health Savings Account Application

5.2 The Master Theorem

Simple Language Models for Spam Detection

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

From a strategic view to an engineering view in a digital enterprise

FIRE DETECTION USING AUTONOMOUS AERIAL VEHICLES WITH INFRARED AND VISUAL CAMERAS. J. Ramiro Martínez-de Dios, Luis Merino and Aníbal Ollero

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

Sebastián Bravo López

How To Fator

i e AT 1 of 2012 DEBT RECOVERY AND ENFORCEMENT ACT 2012

i e AT 35 of 1986 ALCOHOLIC LIQUOR DUTIES ACT 1986

Basic Properties of Probability

3F3: Signal and Pattern Processing

Deadline-based Escalation in Process-Aware Information Systems

CSE 473: Artificial Intelligence Autumn 2010

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

A Keyword Filters Method for Spam via Maximum Independent Sets

Social Network Analysis Based on BSP Clustering Algorithm

Scalable Hierarchical Multitask Learning Algorithms for Conversion Optimization in Display Advertising

SLA-based Resource Allocation for Software as a Service Provider (SaaS) in Cloud Computing Environments

' R ATIONAL. :::~i:. :'.:::::: RETENTION ':: Compliance with the way you work PRODUCT BRIEF

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Machine Learning in Spam Filtering

CIS570 Lecture 4 Introduction to Data-flow Analysis 3

Intuitive Guide to Principles of Communications By Charan Langton

Basics of Statistical Machine Learning

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Big Data Analysis and Reporting with Decision Tree Induction

1.3 Complex Numbers; Quadratic Equations in the Complex Number System*

AT 6 OF 2012 GAMBLING DUTY ACT 2012

Inference of Probability Distributions for Trust and Security applications

Parametric model of IP-networks in the form of colored Petri net

DSP-I DSP-I DSP-I DSP-I

Discovering Trends in Large Datasets Using Neural Networks

Hierarchical Beta Processes and the Indian Buffet Process

Neural network-based Load Balancing and Reactive Power Control by Static VAR Compensator

An Efficient Network Traffic Classification Based on Unknown and Anomaly Flow Detection Mechanism

Fixed-income Securities Lecture 2: Basic Terminology and Concepts. Present value (fixed interest rate) Present value (fixed interest rate): the arb

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

tr(a + B) = tr(a) + tr(b) tr(ca) = c tr(a)

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

MATE: MPLS Adaptive Traffic Engineering

Taking Advantage of the Web for Text Classification with Imbalanced Classes *

Electrician'sMathand BasicElectricalFormulas

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

SCHEME FOR FINANCING SCHOOLS

Customer Efficiency, Channel Usage and Firm Performance in Retail Banking

Bayes Bluff: Opponent Modelling in Poker

Classical Electromagnetic Doppler Effect Redefined. Copyright 2014 Joseph A. Rybczyk

A Context-Aware Preference Database System

CHAPTER 2 Estimating Probabilities

Findings and Recommendations

Open and Extensible Business Process Simulator

Linear Classification. Volker Tresp Summer 2015

Bayesian Spam Detection

Behavior Analysis-Based Learning Framework for Host Level Intrusion Detection

Relativistic Kinematics -a project in Analytical mechanics Karlstad University

Learning from Data: Naive Bayes

Interpretable Fuzzy Modeling using Multi-Objective Immune- Inspired Optimization Algorithms

Revista Brasileira de Ensino de Fsica, vol. 21, no. 4, Dezembro, Surface Charges and Electric Field in a Two-Wire

Granular Problem Solving and Software Engineering

Information Security 201

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

i e AT 11 of 2006 INSURANCE COMPANIES (AMALGAMATIONS) ACT 2006

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Outline. Planning. Search vs. Planning. Search vs. Planning Cont d. Search vs. planning. STRIPS operators Partial-order planning.

Disability Discrimination (Services and Premises) Regulations 2016 Index DISABILITY DISCRIMINATION (SERVICES AND PREMISES) REGULATIONS 2016

Transcription:

BUILDING A SPAM FILTER USING NAÏVE BAYES 1

Spam or not Spam: that is the question. From: "" <takworlld@hotmail.om> Subjet: real estate is the only way... gem oalvgkay Anyone an buy real estate with no money down Stop paying rent TODAY! There is no need to spend hundreds or even thousands for similar ourses I am 22 years old and I have already purhased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW! ================================================= Clik Below to order: http://www.wholesaledaily.om/sales/nmd.htm ================================================= 2

Categorization/Classifiation Problems Given: A desription of an instane X where X is the instane language or instane spae. (Issue: how do we represent tet douments?) A fied set of ategories: C = { 1 2 n } Determine: The ategory of : ()C where () is a ategorization funtion whose domain is X and whose range is C. We want to automatially build ategorization funtions ( lassifiers ). 3

EXAMPLES OF TEXT CATEGORIZATION Categories = SPAM? spam / not spam Categories = TOPICS finane / sports / asia Categories = OPINION like / hate / neutral Categories = AUTHOR Shakespeare / Marlowe / Ben Jonson The Federalist papers 4

Bayesian Methods for Classifiation Uses Bayes theorem to build a generative model that approimates how data is produed. First step: P( C X ) P( X C) P( C) PX ( ) Where C: Categories X: Instane to be lassified Uses prior probability of eah ategory given no information about an item. Categorization produes a posterior probability distribution over the possible ategories given a desription of eah instane. 6

Maimum a posteriori (MAP) Hypothesis Let MAP be the most probable ategory. Then goodbye to that nasty normalization!! MAP argma P( X ) C No need to ompute P(X)!!!! argma C P( D ) P( ) PX ( ) argma P( X ) P( ) C As P(X) is onstant 7

Maimum likelihood Hypothesis If all hypotheses are a priori equally likely to find the maimally likely ategory ML we only need to onsider the P(X ) term: ML argma P( X ) C Maimum Likelihood Estimate ( MLE ) 8

9 Naïve Bayes Classifiers: Step 1 Assume that instane X desribed by n-dimensional vetor of attributes then 1 2 n X ) ( argma 2 1 n C MAP P ) ( ) ( ) ( argma 2 1 2 1 n n C P P P ) ( ) ( argma 2 1 P P n C

Naïve Bayes Classifier: Step 2 To estimate: P( j ): Can be estimated from the frequeny of lasses in the training eamples. P( 1 2 n j ): Problem!! O( X n C ) parameters required to estimate full joint prob. distribution Solution: argma P( ) P( ) MAP 1 2 Naïve Bayes Conditional Independene Assumption: P(... ) P( ) C i 2 n j i j i n 10

Naïve Bayes Classifier for Binary variables Flu P X 1 X 2 X 3 X 4 X 5 runnynose sinus ough fever musle-ahe Conditional Independene Assumption: features are independent of eah other given the lass: ( 5 X1 X5 C) P( X1 C) P( X2 C) P( X C) 11

Learning the Model C X 1 X 2 X 3 X 4 X 5 X 6 First attempt: maimum likelihood estimates Given training data for N individuals where ount(x=) is the number of those individuals for whih X= e.g Flu=true For eah ategory and eah value for a variable X ount( C ) P ˆ( ) N ount( X C ) Pˆ( ) ount ( C ) 12

Problem with Ma Likelihood for Naïve Bayes Flu X 1 X 2 X 3 X 4 X 5 runnynose sinus ough fever musle-ahe P( X X Flu) P( X Flu) P( X Flu) P( X Flu) 1 5 1 2 5 What if no training ases where patient had musle ahes but no flu? ount( X t flu) P X t flu ˆ( ) 5 5 ount ( flu ) 0 So if X t P( X X flu) 5 1 5 0 Zero probabilities overwhelm any other evidene! 13

Add-1 Laplae Smoothing to Avoid Overfitting ount( X C ) 1 Pˆ( ) ount( C ) X # of values of X i here 2 Slightly better version ount( X C ) Pˆ( ) ount( C ) X etent of smoothing 14

Using Naive Bayes Classifiers to Classify Tet: Basi method As a generative model: 1. Randomly pik a ategory aording to P() 2. For a doument of length N for eah word i : 1. Generate word i aording to P(w ) N 1 2 n i1 P( D w w... w ) P( ) P( w ) This is a Naïve Bayes lassifier for multinomial variables. Note that word order really doesn t matter here Uses same parameters for eah position Result is bag of words model Views doument not as an ordered list of words but as a multiset i 15

Naïve Bayes: Learning (First attempt) From training orpus etrat Voabulary Calulate required estimates of P() and P(w ) terms For eah j in C do ountdos ( C ) P () dos where ount dos () is the number of douments for whih is true. For eah word w i Voabulary and C where ount dotokens () is the number of tokens over all douments for whih is true of that doument and that token P( w ) i ountdotokens ( W wi C ) ount ( C ) dotokens 16

Naïve Bayes: Learning (Seond attempt) Laplae smoothing must be done over the voabulary items. We an assume we have at least one instane of eah ategory so we don t need to smooth these. Assume a single new word UNK that ours nowhere within the training doument set. Map all unknown words in douments to be lassified (test douments) to UNK. For 0 1 P( w ) i ountdotokens ( W wi C ) ount ( C ) a( V 1) dotokens 17

Naïve Bayes: Classifying Compute NB using either N arg ma P( ) P( w ) NB i1 where ount(w): the number of times word w ours in do (The two are equivalent..) i ( ) arg ma P( ) P( w ) ount w NB wv 18

PANTEL AND LIN: SPAMCOP Uses a Naïve Bayes lassifier M is spam if P(Spam M) > P(NonSpam M) Method Tokenize message using Porter Stemmer Estimate P( k C) using m-estimate (a form of smoothing) Remove words that do not satisfy ertain onditions Train: 160 spams 466 non-spams Test: 277 spams 346 non-spams Results: ERROR RATE of 4.33% Worse results using trigrams 19

Naive Bayes is (was) Not So Naive Naïve Bayes: First and Seond plae in KDD-CUP 97 ompetition among 16 (then) state of the art algorithms Goal: Finanial servies industry diret mail response predition model: Predit if the reipient of mail will atually respond to the advertisement 750000 reords. A good dependable baseline for tet lassifiation But not the best by itself! Optimal if the Independene Assumptions hold: If assumed independene is orret then it is the Bayes Optimal Classifier for problem Very Fast: Learning with one pass over the data; Testing linear in the number of attributes and doument olletion size Low Storage requirements 20

Engineering: Underflow Prevention Multiplying lots of probabilities whih are between 0 and 1 by definition an result in floating-point underflow. Sine log(y) = log() + log(y) it is better to perform all omputations by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability sore is still the most probable. NB argma log P( j ) log P( wi j ) jc ipositions 21

REFERENCES Mosteller F. & Wallae D. L. (1984). Applied Bayesian and Classial Inferene: the Case of the Federalist Papers (2nd ed.). New York: Springer-Verlag. P. Pantel and D. Lin 1998. SPAMCOP: A Spam lassifiation and organization program In Pro. Of the 1998 workshop on learning for tet ategorization AAAI Sebastiani F. 2002 Mahine Learning in Automated Tet Categorization ACM Computing Surveys 34(1) 1-47 25