1 Introductory Comments. 2 Bayesian Probability



Similar documents
A Three-Way Decision Approach to Spam Filtering

Discrete Structures for Computer Science

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

If n is odd, then 3n + 7 is even.

Week 2: Conditional Probability and Bayes formula

A Game Theoretical Framework for Adversarial Learning

Adaptive Filtering of SPAM

Impact of Feature Selection Technique on Classification

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

Filters that use Spammy Words Only

Three-Way Decisions Solution to Filter Spam An Empirical Study

6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

Worksheet for Teaching Module Probability (Lesson 1)

Image Spam Filtering by Content Obscuring Detection

6.3 Conditional Probability and Independence

Simple Language Models for Spam Detection

An Efficient Three-phase Spam Filtering Technique

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Math 141. Lecture 2: More Probability! Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

Notes from February 11

6.2 Permutations continued

Image Spam Filtering Using Visual Information

TOPOLOGY: THE JOURNEY INTO SEPARATION AXIOMS

The Basics of Graphical Models

Partitioned Logistic Regression for Spam Filtering

Chapter 6: Sensitivity Analysis

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Recursive Estimation

Machine Learning in Spam Filtering

How To Filter Spam With A Poa

An Efficient Spam Filtering Techniques for Account

If A is divided by B the result is 2/3. If B is divided by C the result is 4/7. What is the result if A is divided by C?

Detecting Spam Using Spam Word Associations

Math 4310 Handout - Quotient Vector Spaces

Game Theory: Supermodular Games 1

Accelerating Techniques for Rapid Mitigation of Phishing and Spam s

Combining Global and Personal Anti-Spam Filtering

Bayesian Spam Filtering

Section 2.5 Average Rate of Change

Similarity and Diagonalization. Similar Matrices

= = 3 4, Now assume that P (k) is true for some fixed k 2. This means that

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Tweaking Naïve Bayes classifier for intelligent spam detection

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation

1 Portfolio Selection

LINES AND PLANES CHRIS JOHNSON

IMPROVING SPAM FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Spam Filters: Bayes vs. Chi-squared; Letters vs. Words

Lecture 19: Chapter 8, Section 1 Sampling Distributions: Proportions

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Savita Teli 1, Santoshkumar Biradar 2

4: SINGLE-PERIOD MARKET MODELS

Recall this chart that showed how most of our course would be organized:

Spam Detection A Machine Learning Approach

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Bayes and Naïve Bayes. cs534-machine Learning

Introduction to Auction Design

Chapter ML:IV. IV. Statistical Learning. Probability Basics Bayes Classification Maximum a-posteriori Hypotheses

Notes V General Equilibrium: Positive Theory. 1 Walrasian Equilibrium and Excess Demand

9.2 Summation Notation

CHAPTER 1. Compound Interest

Chapter 21: The Discounted Utility Model

GENERATING SETS KEITH CONRAD

Spam Filtering Based on Latent Semantic Indexing

Chapter 7: Products and quotients

BREAK-EVEN ANALYSIS. In your business planning, have you asked questions like these?

How To Reduce Spam To A Legitimate From A Spam Message To A Spam

Monte Carlo Methods in Finance

A CONSTRUCTION OF THE UNIVERSAL COVER AS A FIBER BUNDLE

Find-The-Number. 1 Find-The-Number With Comps

Chapter 3. Distribution Problems. 3.1 The idea of a distribution The twenty-fold way

Lesson Plan. N.RN.3: Use properties of rational and irrational numbers.

Bayesian Spam Detection

Blocking Blog Spam with Language Model Disagreement

Learning is a very general term denoting the way in which agents:

Part II Management Accounting Decision-Making Tools

2.2. Instantaneous Velocity

α = u v. In other words, Orthogonal Projection

Antispam Evaluation Guide. White Paper

Using Biased Discriminant Analysis for Filtering

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

High School Algebra Reasoning with Equations and Inequalities Solve equations and inequalities in one variable.

Linear Programming I

Duplicating and its Applications in Batch Scheduling

Transcription:

Introductory Comments First, I would like to point out that I got this material from two sources: The first was a page from Paul Graham s website at www.paulgraham.com/ffb.html, and the second was a paper by I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos, titled An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages, which appeared in the Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pages 60-67). The Graham paper is interesting, but is written more for those with almost no mathematical background, and it doesn t explain the math behind the algorithm; and, even though Graham s paper gives a link to a page describing the math, that linked page also does not do an adequeate job, since it does not place the result proved and used in its proper Bayesian context. Here in these notes I will give a more formal treatment, and will be explicit about the conditional independence assumptions that one makes. 2 Bayesian Probability In this section I will prove a few basic results that we will use. Some of these results are proved in your book, but I will prove them here again anyway, to make these notes self-contained. First, we have Bayes s Theorem: Theorem (Bayes s Theorem). Suppose that S is a sample space, and Σ is a σ-algebra on S having probability measure P. Further, suppose that we have a partition of S into (disjoint) events C, C 2,..., C k ; that is, S = k C i, and, for i j, C i C j =. i= Then, for any i =, 2,..., k, we have P (C i A) = P (A C i )P (C i ) k P (A C j)p (C j ).

Proof. The proof is really obvious, once you know what everything means. First, we note that A can be partitioned as follows: A = k A C j, where we notice that the sets A C j are all disjoint, since the sets C j are disjoint. Thus, we have P (A) = k P (A C j ) = k P (A C j )P (C j ). () The second equality here just follows from the definition of conditional probability P (C D) P (C D) =. P (D) Now, then, via obvious manipulations we get P (C i A) = P (A C i)p (C i ), P (A) and then replacing the P (A) on the denominator with the right-hand-side in () the theorem now follows. The theorem we will actually use is what I will call the Spam Theorem. The setup is as follows: Suppose we have two classes of objects, spam and Legitimate email. We will use Spam to denote the spam messages, and Legit to denote the legitimate ones. We also suppose we have a series of k attributes that a Spam or Legit message can have. Our sample space for the probability theory will be as follows: S will be all ordered (k + )-tuples of the form (Spam, t,..., t k ) or (Legit, t,..., t k ), where each t i is or 0; t i = 0 means the message does not have attribute i, and t i = means it does have attribute i. In total there are 2 k+ elements in S. These elements of S you can think of as the space of all possible message attributes. For example, if k = 3, a typical element of S looks like 2

(Spam,, 0, ), which would mean that you have a Spam message having attributes and 3, but not attribute 2. These k attributes you can think of as corresponding to whether or not the message contains a particular word. So, for example, attribute might be The message contains the word SAVE, attribute 2 might be The message contains the word MONEY, and attribute 3 might be The message contains the word NOW. Thus, a message having attribute vector (Spam,, 0, ) would be a spam email containing the words SAVE and NOW, but not the word MONEY. We assume that we have some probability measure P on Σ = 2 S. Now, we let Spam denote the subset of S consisting of all vectors with Spam for the first entry; and we let Legit denote the vectors in S that begin with Legit. We let W i denote the subset of S consisting of all vectors with t i = ; that is, you can think of a message as having an attribute vector lying inside W i if it is a message that contains the ith word somewhere in its text. We further make the following, crucial conditional independence assumptions: Conditional Independence Assumptions. Suppose that W i,..., W il are distinct words chosen from among W,..., W k. Then, we assume that P (W i W i2 W il Spam) = P (W i W i2 W il Legit) = P (W ij Spam); and P (W ij Legit). Comments: The first assumption here roughly says that for any given piece of spam, the probability that that message contains any given word on our list is independent of the probability that it contains any other combination of the other words on our list. This is maybe not a valid assumption, since for instance, if a spam contains the word SAVE, it very likely contains the word MONEY ; so, these probabilities are not independent, or uncorrelated. Nonetheless, we will assume that these conditions hold in order to make our model simple. The second assumption has a similar interpretation. Now we come to our theorem 3

Spam Theorem. Suppose that S, Σ, and P as above, and let p i = P (Spam W i ). Further, suppose that the conditional independence assumption holds. Then, for any subset W i,..., W il of W,..., W k we have that P (Spam W i W il ) = where x = P (Spam). p i p il p i p il + ( ) x l x ( pi ) ( p il ), This theorem is telling us that if we are given a document containing the words W i,..., W il, then we can calculate the probability that it is a spam message by plugging in the respective probabilities into the formula. How do we determine the probabilities p i? That will be described in the next section. For now we will just assume we know what they are. Let us now see how to prove the theorem. Proof. We apply Bayes s Theorem with the disjoint events C = Spam, and C 2 = Legit = C (which partition S), and with A = W i W il. With these choices of parameters we get that the numerator in the formula for P (C A) in Bayes s Theorem has the value P (A C )P (C ) = P (C ) = x l P (W ij C ) = P (C ) p ij P (W ij ). P (C W ij ) P (W i j ) P (C ) To get this expansion we have used our conditional independence assumption, together with the basic fact that P (F G) = P (G F )P (F )/P (G). Note here that we also made the substitution P (C ) = P (Spam) = x. To get the denominator in Bayes s formula for P (C A), we have to find P (A C )P (C ) + P (A C 2 )P (C 2 ). We have already found the first term here; so, we just need to find the second term, which is P (A C 2 )P (C 2 ) = P (C 2 ) = P (W ij C 2 ) = P (C 2 ) ( x) l P (W ij )( p ij ). 4 P (C 2 W ij ) P (W i j ) P (C 2 )

So, we have P (C A) = = l x l p i j P (W ij ) l x l p i j P (W ij ) + l ( x) l ( p i j )P (W ij ) p i p il p i p il + ( x x ) l ( pi ) ( p il ). The theorem now follows. 3 Applying the Spam Theorem to do the Filtering The idea is to start with, say 000 messages, with 500 of them spam, and 500 legitimate. Then, you find the best 20 words which appear in a significant percentage of the spam and legit messages. By best I mean here that the word is biased towards being in either a spam or legit message. For example, if I were told that I am about to receive a message containing the word sell, and if I know that I typically get as many spams as legit emails, then there is a better than 50 percent chance that the message is spam. On the other hand, if I was told the message contains the word the, then that would tell me little about whether the message is spam or legit; that is, I would only know the message is spam with 50 percent chance. So, the word the would not go on my list of best words, but the word sell just might. The correct balance between how often a word appears in emails, and how biased it is to being spam or legit, in determining whether the word is one of the best words, is left up to the user. After we have our list of words, we then compute the numbers p i = P (Spam W i ) as follows: Given a word W i, let N be the total number of my 500 spam emails containing W i, and let M be the total number of my 500 legit emails containing W i. Then, we let P (W i Spam) = N/500 and P (W i Legit) = M/500. The quantity P (Spam) is the parameter x in the spam theorem, which is left up to the user to supply. Finally, by Bayes s Theorem we have p i = P (W i Spam)P (Spam) P (W i Spam)P (Spam) + P (W i Legit)P (Legit) = xn xn + M( x). 5

Now suppose you are given some message. It is not too difficult to have a computer program read through the message to determine which of our 20 best words appear in the message. After this is done, suppose that the words appearing in the message, which are among our 20 best, are W i,..., W il. Then, the computer can calculate the probability that that message is a spam using the spam theorem. If the chance that the message is spam is sufficiently high, then you can have the computer reject the message or to put it into some kind of holding folder, which you can look through from time to time, just in case a legit message was misclassified as spam. 6