Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

Moder Appled Scece October, 2009 Applcatos of Support Vector Mache Based o Boolea Kerel to Spam Flterg Shugag Lu & Keb Cu School of Computer scece ad techology, North Cha Electrc Power Uversty Hebe 071003, Cha E-mal: lsg69@sohu.com, cepuckb@163.com Abstract Spam s so wdely speared that has a bad effect o daly use of E-mal. Nowadays, amog the prmary techologes of spam flterg, support vector mache (SVM) s appled wdely, because t s effcet ad has hgh separatg accuracy. The ma problem of support vector mache arthmetc s how to choose the kerel fucto. To solve ths problem people propose spam flterg arthmetc of support vector mache based o Boolea kerel. The arthmetc uses flterg methods based o attrbutes, such as IP address, subject words, keywords cotet, eclosure formato, etc. These attrbutes compose the feature vectors, ad the vectors are classfed by SVM-MDNF based o Boolea kerel. The expermet results show that ths arthmetc has hgh separatg accuracy, hgh recall rato ad precso rato. The arthmetc has ts value theory ad applcato. Keywords: Spam, Support Vector Mache, Boolea Kerel 1. Itroducto E-mal s oe of the ma meas for people to commucate formato o Iteret. As the Iteret s so wdely used, sedg ad recevg E-mal has almost become a part of cosderable amout of people s daly lfe. However, wth the coveece the Iteret brgs, t also brgs the exstece ad wde spread of spams, whch cause a lot of troubles to people. It s evdet that people s work effcecy ad ther emoto wll be flueced, f they have to sped tme ad efforts o detfcato E-mal every day. So to auto-dstgush spam has mportat meag ad applyg value(shawe-taylor J, Crsta N. KereI. 2005). Spam meas that publczg E-mals, cotag all kds of publctes, such as ads, electroc publcatos, are ot requested or accepted by recevers advace. To classfy the techologes of spam flterg, they ca be classfed to two kds: server spam flterg ad clet spam flterg, accordg to dfferet places the flter s executed. But f we classfy the techologes based o dfferet flterg methods, there are three ways: spam flterg based o blacklst/ whtelst, spam flterg based o prcples ad spam flterg based cotet. 1) Spam Flterg Based O Blacklst/Whtelst Ay E-mals, set by seders the whtelst, are cosdered legal E-mals, whle ay E-mals set by the seders the blacklst are treated as spams. The followg method s wdely used spam flterg recetly. Usually t collects a blacklst ad a whtelst. I these lsts, the cotet ca be E-mal addresses, the DNS of E-mal servers or IP addresses. They help recevers to check seders real tme. 2) Spam Flterg Based O Prcples Ths method eeds people to set some prcples. Ad the spam s the E-mal that meets oe of several prcples. These prcples always clude aalyss o header, flterg o multple sed, accurate matchg o keywords ad other features of the E-mal. 3) Spam Flterg Based O Cotet Actually, the producers who sed spam vary cotuously. So the blacklst/whtelst has great lmtatos. Ad spam flterg based o prcples also has some dsadvatages: prcples are made by people, ad those users who are lack of experece wll affect the valdty ad accuracy of prcples. Therefore, may experts come up wth a dea that aalyze the cotet of E-mal frst, ad the dstgush whether t s a spam. Ths method combes spam flterg wth other techologes, such as text classfcato ad formato flterg. It requres the arthmetc of text classfcato 27

Vol. 3, No. 10 Moder Appled Scece ad formato flterg to be troduced to the spam flterg. To solve ths problem, a great amout of measures have bee adopted, such as exteso of E-mal protocols, certfcato of E-mal server, spam flterg ad legslato. Amog these measures, the spam flterg s more realstc. Nowadays, may arthmetc of text classfcato have bee troduced to applcatos of spam flterg based o cotet, lke Bayes, Decso Tree, K-Most Neghborg Arthmetc, Support Vector Maches, etc(wag b, Pa wefeg. 2005). Ad applcatos of SVM are more successful spam flterg. 2. Evaluate Stadard of Spam Flterg System The performace evaluato o spam flterg ofte makes use of some related dexes text classfcato. The stadard, whch ca decde whether text classfcato s mature or ot, s the mappg accuracy ad mappg speed. Ad the mappg speed s decded by the complexty of mappg arthmetc; the mappg accuracy s evaluated by formato retreval evaluato. The followgs are the deftos about two commo dexes: Recall Rato ad Precso Rato of formato retreval spam flterg feld(c.j. va Rjsberge. 1979). Def 1: Recall Rato s the rato of the amout of spam that has bee fltered to the amout of E-mals that should be fltered. The computg formula of Recall Rato s: amout of fltered spam Re call (1) amout of E mals that should be fltered Def 2: Precso Rato s the rato of the amout of spam that has bee fltered to the amout of E-mals that have bee fltered. The computg formula of Precso Rato s: amout of fltered spam Pr ecso (2) amout of E mals that have bee fltered Both of Recall Rato ad Precso Rato reflect the qualty of E-mal classfcato. They should be cosdered together rather tha oly oe of them s pad atteto. So F1 Test Value s ofte used to pla the classfcato result of evaluato E-mals as a whole. The computg formula of F1 Test Value s: Pr ecso Rato Re call Rato 2 F1 (3) Pr ecso Rato + Re call Rato Ad there are Mcro Average ad Macro Average to calculate Recall Rato, Precso Rato ad F1 Test Value. Mcro Average couts respectvely every kd s recall rato, precso rato ad test value; ad Macro Average ufedly calculates all kds recall rato, precso rato ad test value. It s evdet that all E-mal flterg arthmetc s amed at reachg the performace requsto of recall rato ad precso rato E-mal classfcato the ed. 3. Support Vector Mache Based o Boolea Kerel Fucto Support vector mache (SVM)(Zhag Yag, L Zhahua, Tag Ya, Cu Keb, DRC-BK. 2004) s a learg method proposed by Vapk ad the research group, whch s led by hm Bell Laboratory. Ad ths method s based o statstcs. SVM s developed from Optmal Separatg Plae o lear classfyg. The basc dea of t s maxmum-separato (marg). The so called optmal meas that separatg plae s requred ot oly to separate two kds of text correctly, but also to fd a max marg. Actually, the maxmum-marg s the cotrol of promoto ablty. Lear support vector mache separates the yes ad o examples, through costructg optmal hyperplae W, X + b 0 put space. Here the <,> represets the er product; W R, b R, to make that: y W, X + b 1 0 1,2,..., d (4) It ca be proved that the optmal separatg plae s what leads to mmum 1 2 2 W put space. To solve ths problem we eed to trasform t to dual form wth Lagrage Optmzato. The dual form ca also be called costrats: The solvg s as follows: d y 0 1, 2,..., d 1 α (5) 28

Moder Appled Scece October, 2009 d 1 d d r r Q( α) arg max α y y X, X 1 α 2 1 j 1 α j j j α (6) α s the correspodg Lagrage multpler of costrat (5) prmary problem. Ths s a problem of seekg optmzato for quadratc fucto o the costrat of equalty ad t has uque aswer. It ca be proved easly that oly part (ofte a lttle part) of α aswers are ot equal to zero, ad the correspodg examples are the support vector. Through workg out the above-metoed problem, we get the optmal separatg fucto. That s: d r f ( X) sg( α y X, X b) 1 + (7) I the fucto: fact, the summato oly works support vector. The b s separatg threshold. It ca be worked out wth ay support vector (satsfyg formula 5th) or through the meda of ay par of support vectors two classes. d r r b ys α y X, Xs, s 0 1 α (8) Here, the sg() s a symbol fucto. Wth No-lear-Mappgφ, vectors of put space ca be trasformed to vectors of hgher-dmeso space, whch s amed as feature space. The feature space has a hgher dmeso tha the put space. No-lear SVM makes use of No-lear-Mappg φ to trasform vectors of put space to vectors of hgh-dmeso space. Therefore, X r, X the above equato are respectvely replaced by φ( X r ), φ ( X ). So we ca get that: I the fucto: d r f ( X) sg( α y ( X ), ( X) b) 1 φ φ + d r r b ys α y ( X ), ( Xs), s 0 1 φ φ α We ame the fucto lke K( xy, ) ( φ( x), φ( y)) 2 x x 1) Gaussa Radal bass fuctos: K( x, x ) exp( ) 2 2σ 2) Polyomal: K( x, x ) ((, ) 1) d x x +, for d1,..,n 3) Hyperbolc taget: K( x, x ) tamj( β x + b) as kerel fucto. Some commo kerel fuctos clude: 4) Sple kerel fuctos: K( x, x) B2+ 1( x x) Choosg dfferet kerel fuctos, you ca get dfferet No-lear support vector mache. If x ad y the kerel fuctos above are Boolea, the we ca suppose that U {0,1}, V {0,1}, σ > 0, p N, for I represets ut vector. So: K MDNF ( U, V) 1 + ( σu V 1) 1 + (9) (10) (11) We call K MDNF as Mootoe Dsjuctve Normal Form (MDNF) kerel fucto. MDNF kerel fucto s the kerel fucto we use ths paper as SVM arthmetc. 4. SVM Spam Flterg Based o Boolea Kerel Fucto ad the Expermet Results 4.1 The Strategy of SVM Spam Flterg Based O Boolea Kerel Ths expermet adopts Ero-spam E-mal dataset. Ad the dataset cludes two parts: pre-processed s the set of E-mals that have bee pretreated, ad the part raw are pretreated based o eeds to get preprocessed. Our expermet cramps out some preprocessed as trag set, ad some as testg set. We select 2000 E-mals. Amog these E-mals, 1100 are spam ad 900 are ormal E-mals. The specfc procedures of the strategy of SVM spam flterg based o Boolea kerel are as followgs: 1) Frstly, we process the dataset wth stadard. Wpe off the ose words (such as spellg mstakes, etc), ad 29

Vol. 3, No. 10 Moder Appled Scece flter words whose text frequecy are betwee 2 ad 8000; set dfferet weghg to the subject ad text cotet of every E-mal, ad the subject s set hgher weghg to cocer the words appearg the E-mal subject. Takg subject, text cotet ad may other features of the E-mal to cosderato, we wll get the feature vector of every E-mal. 2) Make baryzato towards the features the feature vector. That s to gve every feature the value 0 or 1. Sce we use Boolea kerel MDNF here, there s a eed to trasform the feature vector to Boolea feature vector. 3) Flter spam wth SVM based o MDNF Boolea kerel. I order to verfy whether the arthmetc s vald or ot, we use k cross for our expermet. K cross s to separate E-mals to k parts. We make use of the k-1 parts for trag, ad the remag for testg. The procedure loops k tmes, so every part has bee tested. Fally, the average of tests values s used as the result of test for evaluato. Here we make k equal 10. 4.2 Expermet Result ad Aalyss I ths expermet, we compare the separatg accuracy of the spam flterg arthmetc based o Boolea kerel SVM wth that of some arthmetc-naïve Bayes, lear SVM ad No-lear SVM based o radal bass fuctos. The result s show s the table 1: From the comparso result of separatg accuracy, t s evdet that the hghest s SVM based o MDNF Boolea kerel. Secod top s the No-lear SVM based o radal bass fuctos. The lowest s Naïve Bayes. Durg the evaluato of the effcecy of E-mal separatg arthmetc, t caot evaluate the arthmetc completely oly to compare the separatg accuracy. So we evaluate the arthmetc further usg precso rato, recall rato ad F 1 gve the Secto 2. I table 2, t compares the recall rato, precso rato ad F. Ad from these targets, we ca evaluate the valdty of the arthmetc a more comprehesve way. From the expermet result, we ca fd that SVM based o MDNF Boolea kerel has the best spam flterg effect, comparg wth the other three. 5. Cocluso After the aalyss of all the characterstcs of spam, we propose the SVM based o MDNF Boolea kerel spam flterg arthmetc whe we make the feature vector usg E-mal subject, text cotet, etc. The expermet shows that ths arthmetc has hgher separatg accuracy, ad has better spam flterg effect recall rato ad precso rato, comparg wth Naïve Bayes, Lear SVM ad SVM based o radal bass fuctos. Ad the expermets thereafter, we wll apply SVM wth more Boolea kerels to spam flterg, ad look forward a better effect. Refereces C.J. va Rjsberge. (1979). Iformato Retreval (2d edto), Butterworths, Lodo, 1979. http://www.cs.cmu.edu/~ero/ Shawe-Taylor J, Crsta N. KereI. (2005). Methods for Patter Aa1yss. Be jg: Cha Mache Press, 2005:60-74. Wag, b, Pa, wefeg. (2005). Cotet-based spam flterg techology. Joural of Chese Iformato Processg. Be jg: 2005, 19(5):1-10. Zhag Yag, L Zhahua, Tag Ya, Cu Keb, DRC-BK. (2004). Mg Classfcato Rules wth Help of SVM. I the Proceedgs of the 8th Pacfc-Asa Coferece o Kowledge Dscovery ad Data Mg(PAKDD'04), Lecture Notes Artfcal Itellgece, Volume 3056, Sprger-Verlag Press, 2004. Table 1. Comparso of separatg accuracy Classfy algorthm Classfy accuracy NB 92.5% Ler SVM 93.8% RBF kerel SVM 94.7% MDNF-SVM 97.8% 30

Moder Appled Scece October, 2009 Table 2. Comparso of recall, precso ad F Classfy algorthm recall precso F 1 NB 90.4% 88.7% 89.5% Ler SVM 91.2% 90.5% 90.8% RBF kerel SVM 92.2% 92.5% 92.3% MDNF-SVM 94.2% 95.5% 94.8% 31