Spam Detection. A Bayesian approach to filtering spam

Spam Detectio A Bayesia approach to filterig spam Kual Mehrotra Shailedra Watave

Abstract The ever icreasig meace of spam is brigig dow productivity. More tha 70% of the email messages are spam, ad it has become a challege to separate such messages from the legitimate oes. We have developed a spam idetificatio egie which employs aïve Bayesia classifier to idetify spam. This probabilistic classifier was traied o TREC 2006, a corpus of kow spam/legitimate messages ad it takes ito accout a comprehesive set of phrasal ad domai specific features (o phrasal features viz. email cotaiig attachmets, emails set from.edu domai etc) that are arrived at by usig stadard dimesioality reductio algorithms. The cost of classifyig a legitimate message as spam (false positive) far outweighs the cost of classifyig spam as legitimate (false egative). This cost sesitivity was icorporated ito the spam egie ad we have achieved high precisio ad recall, thereby reducig the false positive rates. Keywords: Naïve Bayesia Classifier, Support Vector Machies, Precisio, Recall 1. Itroductio Spam is a usolicited email that is set idiscrimiately to mailig lists, idividuals ad ewsgroups. This misuse of the electroic message system is becomig rampat as spammig is ecoomically feasible. A recet study says that more tha 70% of the total messages that are set over the iteret are spam [1]. Spam brigs dow the productivity as users have to sift through their ibox to segregate legitimate email messages from spam. Hece the developmet of a effective ad efficiet spam filter is highly imperative. We have developed a spam idetificatio egie that idetifies ad segregates spam messages from legitimate oes. The classical aïve Bayesia approach was used to develop the spam filter. The use of aïve Bayesia classifier has become highly prevalet as the esuig system will be less complex. Naïve Bayesia classifier is a probabilistic classifier based o Bayes theorem. The theorem assumes that each feature is coditioally idepedet of each other. The TREC 2006 email corpus was used to trai ad test our filter. We made use of 70% (approx. 25,475) of the total messages from the corpus to trai our filter. The remaiig 30% (approx. 10918) of the messages were used to test the filter. Page 2

2. Literature Survey Recetly, varied techiques have bee applied to idetify spam. The techique proposed by Sahami et al was amog the first studies that focused o this task. The aïve Bayesia approach was preferred because of its robustess ad ease of implemetatio i cost sesitive decisio framework. Jaso Reie's ifile program was the first ati spam filter developed usig the Bayes classifier. Few others have also implemeted variatios of the above techique. Paul Graham wrote a article A Pla for Spam which was iteded for the geeral audieces ad was well received. Other techiques like RIPPER, Esembles of Decisio Trees, Boostig ad Istace-based learig, SVM etc. were proposed subsequetly. Experimets coducted by Drucker et al. verified the effectiveess of the SVM techiques. The study cocluded that SVM ad boostig are the top performig methods. 3. Project Descriptio The objective is to implemet a Naïve Bayesia ati-spam filter to segregate spam from ham ad measure its efficacy usig various cost effective measures. The results are measured-up with a third party filter, LIBSVM based o aother classificatio techique, called Support Vector Machie (SVM). A supervised learig approach is used to eable the filter to differetiate betwee spam ad ham. The filter is traied o 70% of spam & ham corpus that requires Feature Extractio ad calculatio of spam probability of the extracted feature, fi, usig a aïve Bayesia expressed as: P(SPAM fi) = P(SPAM fi) = P(fi SPAM) P(SPAM) P(fi) P(fi SPAM) P(SPAM) P(fi Spam) P(SPAM) + k. P(fi HAM) P(HAM) We base our calculatio o a assumptio that a probability a email is either SPAM or NOT is 50%. That is, the prior probabilities: P(SPAM) = P(HAM) = 0.5. A k factor has bee itroduced that ca be tued to reduce the umber of false positives the umber of HAMS misclassified as SPAMS. Page 3

Validatio of each i comig email is attaied by tokeizig the email ad usig the precalculated spam probability of each feature to classify the icomig email as SPAM or HAM usig followig aïve Bayesia expressio: P(SPAM f 1, f 2, f 3. f i ) = P(f 1, f 2, f 3. f i SPAM) P(SPAM) P(f 1, f 2, f 3. f i ) P(SPAM f 1, f 2, f 3. f i ) = P(f 1, f 2, f 3. f i SPAM) P(SPAM) P(HAM) P(f 1, f 2, f 3. f i SPAM) P(SPAM) + P(f 1,f 2, f 3. f i HAM) Sice aïve Bayes classifier estimates the class-coditioal probability by assumig that attributes are coditioally idepedet, the above equatio ca be re-writte as: P(SPAM f 1, f 2, f 3. f i ) = P(SPAM ) Π i=1 P(f i SPAM) HAM) P(SPAM ) Π i=1 P(f i SPAM) + P(HAM ) Π i=1 P(f i Sice, P(SPAM) = P(HAM) = 0.5 P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(f i SPAM) Π i=1 P(f i SPAM) + Π i=1 P(f i HAM) P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(SPAM f i ) P(f i ) / P (SPAM) (HAM) Π i=1 P(SPAM f i ) P(f i ) / P (SPAM) + Π i=1 P(HAM f i ) P(f i ) / P Page 4

Sice, P(SPAM) = P(HAM) = 0.5 P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(SPAM f i ) P(f i ) Π i=1 P(SPAM f i ) P(f i ) + Π i=1 P(HAM f i ) P(f i ) Dividig omiator ad deomiator by P(f i ) to get: P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(SPAM f i ) Π i=1 P(SPAM f i ) + Π i=1 P(HAM f i ) Sice, P(SPAM f i ) = 1 - P(HAM f i ) P(SPAM f 1, f 2, f 3. f i ) = Π i=1 P(SPAM f i ) Π i=1 P(SPAM f i ) + Π i=1 (1 - P(SPAM f i )) Here = 15. That is, fiftee most iterestig features are cosidered i the tokeized email to classify it either as SPAM or HAM ad the iterestigess of each feature is computed as follows: I f = 0.5 - P f where P f = P(SPAM f) = Prior probability for SPAM give the feature. Mistakely blockig a legitimate (ham) message is more severe tha lettig a spam message pass the filter. Let, H -> S deote HAM misclassified as SPAM S -> H deote SPAM misclassified as HAM Page 5

Assumig that H->S is λ times more costly tha S->H, we classified a message as spam oly if: P(SPAM f 1, f 2, f 3. f i ) > λ P(HAM f 1, f 2, f 3. f i ) Sice, P(HAM f 1, f 2, f 3. f i ) = 1 - P(SPAM f 1, f 2, f 3. f i ) the classificatio criterio ca be re-formulated as follows: P (SPAM f 1, f 2, f 3. f i ) > t, with t = λ / (1 + λ) Here λ determies the severity of pealty for misclassifyig a legitimate email as SPAM. This cost sesitivity is icorporated ito the system as threshold, give as λ / (1 + λ). The model is re-cofigured ad evaluated o differet severity levels of λ. The table below details various levels of cost sesitivity of model that has bee cosidered: λ Threshold t = λ / (1 + λ) What it meas to have such cost sesitivity? 999 0.999 Blocked messages are discarded without further processig. 9 0.9 Blockig a legitimate message is pealized mildly more tha lettig a spam message pass. To model the fact that re-sedig a blocked message ivolves more work (by the seder) tha maually deletig a spam message 1 0.5 If the recipiet does ot care much about losig a legitimate message. Page 6

4. Cost-sesitive evaluatio measures The classificatio model is usually evaluated o accuracy ad error rate. Sice the cost of classifyig a legitimate message as spam (false positive) far outweighs the cost of classifyig spam as legitimate (false egative), the cost sesitivity is cosidered i accuracy ad error rate by treatig each legitimate message as if it were λ messages. As a result, whe a legitimate message is mis-classified, it will cout as λ errors. Thus, Wacc = λ. L->L + S->S WErr = λ. L->S + S->L λ. N L + N s λ. N L + N s A better measure of the filter is the relative compariso of the results of the model with a case whe o filter is used. That is how the filter measure up with the baselie case whe o filter is used. A ew measure, called Total Cost Ratio (TCR) is cosidered for the same. A TCR is defied as the ratio of Baselied Weighted Error rate to Weighted Error rate. That is, TCR = WErr b / WErr = Ns λ. L->S + S->L where, WErr b = Baselied Weighted Error rate = Ns λ. N L + N S Page 7

A higher TCR idicate a better performace. If the TCR is less tha 1, tha ot usig the filter is better. A effective spam filter should be able to achieve a TCR value greater tha 1 to be useful i real world applicatios. As show i the esuig experimets, we have ru our filter o differet values of λ for variety of test cases to evaluate the efficacy of the filter uder differet scearios. Page 8

5. Experimetal Results We coducted a series of experimets ad the results are tabulated as uder. Each test case cosisted of a collectio of spam ad o spam messages. All the tests were executed with a three differet values of λ. The messages that were part of the test cases are: Test Case 1: A total of 5000 messages cosistig of, o 2500 o spam messages from the traiig set o 2500 spam messages from the traiig set Test Case 2: A total of 5000 messages cosistig of, o 1250 o spam messages from the test set o 1250 spam messages from the test set o 1250 o spam messages from the traiig set o 1250 spam messages from the traiig set Test Case 3: A total of 5000 messages cosistig of, o 2500 o spam messages from the test set o 2500 spam messages from the test set Test Case 4: A total of 10917 messages cosistig of, o 3778 o spam messages test set o 7139 spam messages from the test set Page 9

Test Case λ Spam Precisio Weighted Accuracy TCR Test Case 1 Test Case 2 1 1 99.92% 98.04% 99.92% 98.04% 625 25.51 Test Case 3 Test Case 4 1 1 96.86% 96.46% 96.86% 96.46% 15.92 18.49 Test Case 1 9 99.92% 99.98% 625 Test Case 2 9 97.92% 99.45% 18.38 Test Case 3 Test Case 4 9 9 96.70% 96.38% 99.02% 98.55% 10.20 11.99 Test Case 1 Test Case 2 Test Case 3 Test Case 4 999 999 999 999 99.92% 97.72% 96.36% 95.84% 99.99% 99.83% 99.71% 99.56% 625 0.6088 0.3487 0.434 Table 1. Results o TREC 2006 corpus. Figure 1: Weighted Accuracy vs. λ for differet Test iputs Page 10

38.1 36.1 34.1 32.1 30.1 28.1 26.1 24.1 C22.1 R20.1 T18.1 16.1 14.1 12.1 10.1 8.1 6.1 4.1 2.1 0.1 Plot of TRC Vs.?? = 1? = 9? = 999 Test Case 1 Test Case 2 Test Case 3 Test Case 4 Figure 2: TRC vs. λ for differet Test iputs Page 11

6. Screeshots Screeshots of the Spam filter have bee show below to demostrate the workig of the applicatio. The above screeshot is the Traiig cotrols scree. A optioal textbox is preseted to provide the path to the spam ad ham traiig sets. Oce the traiig is completed the first 100 features are displayed i the table at the bottom of the scree. Page 12

I the ext screeshot we load a ibox with sample messages as show below. Page 13

Now we test how the filter classifies the sample messages ito spam ad ham messages. The results are provided i the table as show below. Page 14

7. Coclusio The efficacy of our filter egie is evaluated agaist three levels of pealty (λ =1, λ = 9, λ=999). A high value ( > 1) of the cost sesitive measure Total Cost Ratio, o λ = 9 (threshold = 0.9) suggests that our filter is fit to be used i real world applicatios. However, the performace of the filter degrades to TRC < 1 whe a threshold of 0.999 (for λ = 0.999) is eforced, thus makig the model ifeasible whe blocked messages are straightaway deleted. The compariso of Naïve Bayesia approach with SVM techique is still i works. We are i the process of fie tuig the pealty parameters C, k so as to achieve a improved accuracy. Some of our prelimiary work aroud the same is as follows: Testcases λ 1:1 (Traiig set Optimized with Cross Validatio Accuracy = 85.4132% ) Test 1 : 88.96 (4448/5000) Test 2 : Accuracy = 81.84% (4092/5000) Test 3 : Accuracy = 65.36% (3268/5000) Test 4 : Accuracy = 75.387% (8230/10917) λ 1:9 λ 1:999 Accuracy = 79.72% (3986/5000) (classificatio) Accuracy = 50% (2500/5000) Accuracy = 50% (2500/5000) Accuracy = 34.6066% (3778/10917) Accuracy = 76.44% (3822/5000) Accuracy = 50% (2500/5000) Accuracy = 50% (2500/5000) Accuracy = 34.6066% (3778/10917) As ca be see that the accuracy with test-1 is 88.96% whe 2500 SPAM traiig ad 2500 HAM traiig messages are validated o SVM. This is quite low cosiderig SVM filter classifies o a part of traiig set. As suggested by Chih-Je Li et al [7], a grid-search" o C ad ϒ usig cross-validatio is beig performed. All the possible pairs of (C, ϒ) are beig tried ad the oe with the best cross-validatio accuracy is picked. It is suggested to try expoetial growig sequeces of C ad ϒ to idetify good parameters (for example, C = 2-5 ; 2-3 ;, 2 15, ϒ = 2-15 ; 2-13 ;, 2 3 ). We are hopeful to cofigure libsvm to achieve satisfactory accuracy o the traiig set i the comig days. Page 15

Refereces [1] Adroutsopoulos, J. Koutsias, K.V. Chadrios, George Paliouras, ad C.D. Spyropoulos (2000). A Evaluatio of Naive Bayesia Ati-Spam Filterig. [2] Cormack V. Gordo & Lyam R. Thomas (2006). Overview of the TREC 2006 Spam Track. [3] M. Sahami, S. Dumais, D. Heckerma, E. Horvitz (1998). A Bayesia approach to filterig juk e-mail. [4] Migju La, Walei Zhou(2005). Spam Filterig based o Preferece Rakig. [5] Paul Graham(2002). A Pla for Spam, http://paulgraham.com/spam.html [6] Ahmed Obied. Bayesia Spam Filterig. [7] Chih-Chug Chag ad Chih-Je Li (2001). LIBSVM : a library for support vector machies. Software available at http://www.csie.tu.edu.tw/~cjli/libsvm Page 16