Spam Detection. A Bayesian approach to filtering spam



Similar documents
Modified Line Search Method for Global Optimization

How To Train A Spam Classifier

Review: Classification Outline

I. Chi-squared Distributions

Soving Recurrence Relations

Asymptotic Growth of Functions

Output Analysis (2, Chapters 10 &11 Law)

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Department of Computer Science, University of Otago

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

Chapter 7 Methods of Finding Estimators

Systems Design Project: Indoor Location of Wireless Devices

Domain 1: Designing a SQL Server Instance and a Database Solution

BENEFIT-COST ANALYSIS Financial and Economic Appraisal using Spreadsheets

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Incremental calculation of weighted mean and variance

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

Sequences and Series

CREATIVE MARKETING PROJECT 2016

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

Confidence Intervals for One Mean

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

LECTURE 13: Cross-validation

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

PUBLIC RELATIONS PROJECT 2016

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

A Distributed Dynamic Load Balancer for Iterative Applications

Measures of Spread and Boxplots Discrete Math, Section 9.4

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Lesson 15 ANOVA (analysis of variance)

CHAPTER 3 THE TIME VALUE OF MONEY

Firewall Modules and Modular Firewalls

A Mathematical Perspective on Gambling

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Agency Relationship Optimizer

Log-Logistic Software Reliability Growth Model

Infinite Sequences and Series

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

Is there employment discrimination against the disabled? Melanie K Jones i. University of Wales, Swansea

INDEPENDENT BUSINESS PLAN EVENT 2016

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Evaluation of Different Fitness Functions for the Evolutionary Testing of an Autonomous Parking System

Repeating Decimals are decimal numbers that have number(s) after the decimal point that repeat in a pattern.

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

Plug-in martingales for testing exchangeability on-line

Annuities Under Random Rates of Interest II By Abraham Zaks. Technion I.I.T. Haifa ISRAEL and Haifa University Haifa ISRAEL.

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Study on the application of the software phase-locked loop in tracking and filtering of pulse signal

Lecture 2: Karger s Min Cut Algorithm

Statistical inference: example 1. Inferential Statistics

Baan Service Master Data Management

On Formula to Compute Primes. and the n th Prime

Amendments to employer debt Regulations

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

SEQUENCES AND SERIES

Hypothesis testing. Null and alternative hypotheses

Estimating Probability Distributions by Observing Betting Practices

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

CHAPTER 3 DIGITAL CODING OF SIGNALS

Theorems About Power Series

The Stable Marriage Problem

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

Simple Annuities Present Value.

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Normal Distribution.

Listing terms of a finite sequence List all of the terms of each finite sequence. a) a n n 2 for 1 n 5 1 b) a n for 1 n 4 n 2

NEW HIGH PERFORMANCE COMPUTATIONAL METHODS FOR MORTGAGES AND ANNUITIES. Yuri Shestopaloff,

Desktop Management. Desktop Management Tools

INVESTMENT PERFORMANCE COUNCIL (IPC) Guidance Statement on Calculation Methodology

A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Domain 1: Configuring Domain Name System (DNS) for Active Directory

Quadrat Sampling in Population Ecology

(VCP-310)

Document Control Solutions

Faulty Clock Detection for Crypto Circuits Against Differential Fault Analysis Attack

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

Lesson 17 Pearson s Correlation Coefficient

1. Introduction. Scheduling Theory

MTO-MTS Production Systems in Supply Chains

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

1 Computing the Standard Deviation of Sample Means

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

Extracting Similar and Opposite News Websites Based on Sentiment Analysis

Automatic Tuning for FOREX Trading System Using Fuzzy Time Series

Cooley-Tukey. Tukey FFT Algorithms. FFT Algorithms. Cooley

3 Basic Definitions of Probability Theory

Engineering Data Management

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

Bond Valuation I. What is a bond? Cash Flows of A Typical Bond. Bond Valuation. Coupon Rate and Current Yield. Cash Flows of A Typical Bond

Transcription:

Spam Detectio A Bayesia approach to filterig spam Kual Mehrotra Shailedra Watave

Abstract The ever icreasig meace of spam is brigig dow productivity. More tha 70% of the email messages are spam, ad it has become a challege to separate such messages from the legitimate oes. We have developed a spam idetificatio egie which employs aïve Bayesia classifier to idetify spam. This probabilistic classifier was traied o TREC 2006, a corpus of kow spam/legitimate messages ad it takes ito accout a comprehesive set of phrasal ad domai specific features (o phrasal features viz. email cotaiig attachmets, emails set from.edu domai etc) that are arrived at by usig stadard dimesioality reductio algorithms. The cost of classifyig a legitimate message as spam (false positive) far outweighs the cost of classifyig spam as legitimate (false egative). This cost sesitivity was icorporated ito the spam egie ad we have achieved high precisio ad recall, thereby reducig the false positive rates. Keywords: Naïve Bayesia Classifier, Support Vector Machies, Precisio, Recall 1. Itroductio Spam is a usolicited email that is set idiscrimiately to mailig lists, idividuals ad ewsgroups. This misuse of the electroic message system is becomig rampat as spammig is ecoomically feasible. A recet study says that more tha 70% of the total messages that are set over the iteret are spam [1]. Spam brigs dow the productivity as users have to sift through their ibox to segregate legitimate email messages from spam. Hece the developmet of a effective ad efficiet spam filter is highly imperative. We have developed a spam idetificatio egie that idetifies ad segregates spam messages from legitimate oes. The classical aïve Bayesia approach was used to develop the spam filter. The use of aïve Bayesia classifier has become highly prevalet as the esuig system will be less complex. Naïve Bayesia classifier is a probabilistic classifier based o Bayes theorem. The theorem assumes that each feature is coditioally idepedet of each other. The TREC 2006 email corpus was used to trai ad test our filter. We made use of 70% (approx. 25,475) of the total messages from the corpus to trai our filter. The remaiig 30% (approx. 10918) of the messages were used to test the filter. Page 2

2. Literature Survey Recetly, varied techiques have bee applied to idetify spam. The techique proposed by Sahami et al was amog the first studies that focused o this task. The aïve Bayesia approach was preferred because of its robustess ad ease of implemetatio i cost sesitive decisio framework. Jaso Reie's ifile program was the first ati spam filter developed usig the Bayes classifier. Few others have also implemeted variatios of the above techique. Paul Graham wrote a article A Pla for Spam which was iteded for the geeral audieces ad was well received. Other techiques like RIPPER, Esembles of Decisio Trees, Boostig ad Istace-based learig, SVM etc. were proposed subsequetly. Experimets coducted by Drucker et al. verified the effectiveess of the SVM techiques. The study cocluded that SVM ad boostig are the top performig methods. 3. Project Descriptio The objective is to implemet a Naïve Bayesia ati-spam filter to segregate spam from ham ad measure its efficacy usig various cost effective measures. The results are measured-up with a third party filter, LIBSVM based o aother classificatio techique, called Support Vector Machie (SVM). A supervised learig approach is used to eable the filter to differetiate betwee spam ad ham. The filter is traied o 70% of spam & ham corpus that requires Feature Extractio ad calculatio of spam probability of the extracted feature, fi, usig a aïve Bayesia expressed as: P(SPAM fi) = P(SPAM fi) = P(fi SPAM) P(SPAM) P(fi) P(fi SPAM) P(SPAM) P(fi Spam) P(SPAM) + k. P(fi HAM) P(HAM) We base our calculatio o a assumptio that a probability a email is either SPAM or NOT is 50%. That is, the prior probabilities: P(SPAM) = P(HAM) = 0.5. A k factor has bee itroduced that ca be tued to reduce the umber of false positives the umber of HAMS misclassified as SPAMS. Page 3

Validatio of each i comig email is attaied by tokeizig the email ad usig the precalculated spam probability of each feature to classify the icomig email as SPAM or HAM usig followig aïve Bayesia expressio: P(SPAM f 1, f 2, f 3. f i ) = P(f 1, f 2, f 3. f i SPAM) P(SPAM) P(f 1, f 2, f 3. f i ) P(SPAM f 1, f 2, f 3. f i ) = P(f 1, f 2, f 3. f i SPAM) P(SPAM) P(HAM) P(f 1, f 2, f 3. f i SPAM) P(SPAM) + P(f 1,f 2, f 3. f i HAM) Sice aïve Bayes classifier estimates the class-coditioal probability by assumig that attributes are coditioally idepedet, the above equatio ca be re-writte as: P(SPAM f 1, f 2, f 3. f i ) = P(SPAM ) Π i=1 P(f i SPAM) HAM) P(SPAM ) Π i=1 P(f i SPAM) + P(HAM ) Π i=1 P(f i Sice, P(SPAM) = P(HAM) = 0.5 P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(f i SPAM) Π i=1 P(f i SPAM) + Π i=1 P(f i HAM) P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(SPAM f i ) P(f i ) / P (SPAM) (HAM) Π i=1 P(SPAM f i ) P(f i ) / P (SPAM) + Π i=1 P(HAM f i ) P(f i ) / P Page 4

Sice, P(SPAM) = P(HAM) = 0.5 P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(SPAM f i ) P(f i ) Π i=1 P(SPAM f i ) P(f i ) + Π i=1 P(HAM f i ) P(f i ) Dividig omiator ad deomiator by P(f i ) to get: P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(SPAM f i ) Π i=1 P(SPAM f i ) + Π i=1 P(HAM f i ) Sice, P(SPAM f i ) = 1 - P(HAM f i ) P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(SPAM f i ) Π i=1 P(SPAM f i ) + Π i=1 (1 - P(SPAM f i )) Here = 15. That is, fiftee most iterestig features are cosidered i the tokeized email to classify it either as SPAM or HAM ad the iterestigess of each feature is computed as follows: I f = 0.5 - P f where P f = P(SPAM f) = Prior probability for SPAM give the feature. Mistakely blockig a legitimate (ham) message is more severe tha lettig a spam message pass the filter. Let, H -> S deote HAM misclassified as SPAM S -> H deote SPAM misclassified as HAM Page 5

Assumig that H->S is λ times more costly tha S->H, we classified a message as spam oly if: P(SPAM f 1, f 2, f 3. f i ) > λ P(HAM f 1, f 2, f 3. f i ) Sice, P(HAM f 1, f 2, f 3. f i ) = 1 - P(SPAM f 1, f 2, f 3. f i ) the classificatio criterio ca be re-formulated as follows: P (SPAM f 1, f 2, f 3. f i ) > t, with t = λ / (1 + λ) Here λ determies the severity of pealty for misclassifyig a legitimate email as SPAM. This cost sesitivity is icorporated ito the system as threshold, give as λ / (1 + λ). The model is re-cofigured ad evaluated o differet severity levels of λ. The table below details various levels of cost sesitivity of model that has bee cosidered: λ Threshold t = λ / (1 + λ) What it meas to have such cost sesitivity? 999 0.999 Blocked messages are discarded without further processig. 9 0.9 Blockig a legitimate message is pealized mildly more tha lettig a spam message pass. To model the fact that re-sedig a blocked message ivolves more work (by the seder) tha maually deletig a spam message 1 0.5 If the recipiet does ot care much about losig a legitimate message. Page 6

4. Cost-sesitive evaluatio measures The classificatio model is usually evaluated o accuracy ad error rate. Sice the cost of classifyig a legitimate message as spam (false positive) far outweighs the cost of classifyig spam as legitimate (false egative), the cost sesitivity is cosidered i accuracy ad error rate by treatig each legitimate message as if it were λ messages. As a result, whe a legitimate message is mis-classified, it will cout as λ errors. Thus, Wacc = λ. L->L + S->S WErr = λ. L->S + S->L λ. N L + N s λ. N L + N s A better measure of the filter is the relative compariso of the results of the model with a case whe o filter is used. That is how the filter measure up with the baselie case whe o filter is used. A ew measure, called Total Cost Ratio (TCR) is cosidered for the same. A TCR is defied as the ratio of Baselied Weighted Error rate to Weighted Error rate. That is, TCR = WErr b / WErr = Ns λ. L->S + S->L where, WErr b = Baselied Weighted Error rate = Ns λ. N L + N S Page 7

A higher TCR idicate a better performace. If the TCR is less tha 1, tha ot usig the filter is better. A effective spam filter should be able to achieve a TCR value greater tha 1 to be useful i real world applicatios. As show i the esuig experimets, we have ru our filter o differet values of λ for variety of test cases to evaluate the efficacy of the filter uder differet scearios. Page 8

5. Experimetal Results We coducted a series of experimets ad the results are tabulated as uder. Each test case cosisted of a collectio of spam ad o spam messages. All the tests were executed with a three differet values of λ. The messages that were part of the test cases are: Test Case 1: A total of 5000 messages cosistig of, o 2500 o spam messages from the traiig set o 2500 spam messages from the traiig set Test Case 2: A total of 5000 messages cosistig of, o 1250 o spam messages from the test set o 1250 spam messages from the test set o 1250 o spam messages from the traiig set o 1250 spam messages from the traiig set Test Case 3: A total of 5000 messages cosistig of, o 2500 o spam messages from the test set o 2500 spam messages from the test set Test Case 4: A total of 10917 messages cosistig of, o 3778 o spam messages test set o 7139 spam messages from the test set Page 9

Test Case λ Spam Precisio Weighted Accuracy TCR Test Case 1 Test Case 2 1 1 99.92% 98.04% 99.92% 98.04% 625 25.51 Test Case 3 Test Case 4 1 1 96.86% 96.46% 96.86% 96.46% 15.92 18.49 Test Case 1 9 99.92% 99.98% 625 Test Case 2 9 97.92% 99.45% 18.38 Test Case 3 Test Case 4 9 9 96.70% 96.38% 99.02% 98.55% 10.20 11.99 Test Case 1 Test Case 2 Test Case 3 Test Case 4 999 999 999 999 99.92% 97.72% 96.36% 95.84% 99.99% 99.83% 99.71% 99.56% 625 0.6088 0.3487 0.434 Table 1. Results o TREC 2006 corpus. Figure 1: Weighted Accuracy vs. λ for differet Test iputs Page 10

38.1 36.1 34.1 32.1 30.1 28.1 26.1 24.1 C22.1 R20.1 T18.1 16.1 14.1 12.1 10.1 8.1 6.1 4.1 2.1 0.1 Plot of TRC Vs.?? = 1? = 9? = 999 Test Case 1 Test Case 2 Test Case 3 Test Case 4 Figure 2: TRC vs. λ for differet Test iputs Page 11

6. Screeshots Screeshots of the Spam filter have bee show below to demostrate the workig of the applicatio. The above screeshot is the Traiig cotrols scree. A optioal textbox is preseted to provide the path to the spam ad ham traiig sets. Oce the traiig is completed the first 100 features are displayed i the table at the bottom of the scree. Page 12

I the ext screeshot we load a ibox with sample messages as show below. Page 13

Now we test how the filter classifies the sample messages ito spam ad ham messages. The results are provided i the table as show below. Page 14

7. Coclusio The efficacy of our filter egie is evaluated agaist three levels of pealty (λ =1, λ = 9, λ=999). A high value ( > 1) of the cost sesitive measure Total Cost Ratio, o λ = 9 (threshold = 0.9) suggests that our filter is fit to be used i real world applicatios. However, the performace of the filter degrades to TRC < 1 whe a threshold of 0.999 (for λ = 0.999) is eforced, thus makig the model ifeasible whe blocked messages are straightaway deleted. The compariso of Naïve Bayesia approach with SVM techique is still i works. We are i the process of fie tuig the pealty parameters C, k so as to achieve a improved accuracy. Some of our prelimiary work aroud the same is as follows: Testcases λ 1:1 (Traiig set Optimized with Cross Validatio Accuracy = 85.4132% ) Test 1 : 88.96 (4448/5000) Test 2 : Accuracy = 81.84% (4092/5000) Test 3 : Accuracy = 65.36% (3268/5000) Test 4 : Accuracy = 75.387% (8230/10917) λ 1:9 λ 1:999 Accuracy = 79.72% (3986/5000) (classificatio) Accuracy = 50% (2500/5000) Accuracy = 50% (2500/5000) Accuracy = 34.6066% (3778/10917) Accuracy = 76.44% (3822/5000) Accuracy = 50% (2500/5000) Accuracy = 50% (2500/5000) Accuracy = 34.6066% (3778/10917) As ca be see that the accuracy with test-1 is 88.96% whe 2500 SPAM traiig ad 2500 HAM traiig messages are validated o SVM. This is quite low cosiderig SVM filter classifies o a part of traiig set. As suggested by Chih-Je Li et al [7], a grid-search" o C ad ϒ usig cross-validatio is beig performed. All the possible pairs of (C, ϒ) are beig tried ad the oe with the best cross-validatio accuracy is picked. It is suggested to try expoetial growig sequeces of C ad ϒ to idetify good parameters (for example, C = 2-5 ; 2-3 ;, 2 15, ϒ = 2-15 ; 2-13 ;, 2 3 ). We are hopeful to cofigure libsvm to achieve satisfactory accuracy o the traiig set i the comig days. Page 15

Refereces [1] Adroutsopoulos, J. Koutsias, K.V. Chadrios, George Paliouras, ad C.D. Spyropoulos (2000). A Evaluatio of Naive Bayesia Ati-Spam Filterig. [2] Cormack V. Gordo & Lyam R. Thomas (2006). Overview of the TREC 2006 Spam Track. [3] M. Sahami, S. Dumais, D. Heckerma, E. Horvitz (1998). A Bayesia approach to filterig juk e-mail. [4] Migju La, Walei Zhou(2005). Spam Filterig based o Preferece Rakig. [5] Paul Graham(2002). A Pla for Spam, http://paulgraham.com/spam.html [6] Ahmed Obied. Bayesia Spam Filterig. [7] Chih-Chug Chag ad Chih-Je Li (2001). LIBSVM : a library for support vector machies. Software available at http://www.csie.tu.edu.tw/~cjli/libsvm Page 16