An Evaluation of Naïve Bayesian Anti-Spam Filtering Techniques

Proceedgs of the 2007 IEEE Workshop o Iformato Assurace Uted tates Mltary Academy, West Pot, Y 20-22 Jue 2007 A Evaluato of aïve Bayesa At-pam Flterg Techques Vkas P. Deshpade, Robert F. Erbacher, ad Chrs Harrs Abstract A effcet at-spam flter that would block all spam, wthout blockg ay legtmate messages s a growg eed. To address ths problem, we exame the effectveess of statstcally-based approaches aïve Bayesa at-spam flters, as t s cotet-based ad self-learg (adaptve) ature. Addtoally, we desged a dervatve flter based o relatve umbers of tokes. We tra the flters usg a large corpus of legtmate messages ad spam ad we test the flter usg ew comg persoal messages. More specfcally, four flterg techques avalable for a aïve Bayesa flter are evaluated. We look at the effectveess of the techque, ad we evaluate dfferet threshold values order to fd a optmal at-spam flter cofgurato. Based o cost-sestve measures, we coclude that addtoal safety precautos are eeded for a Bayesa at-spam flter to be put to practce. However, our techque ca make a postve cotrbuto as a frst pass flter. Idex Terms pam flter, aïve Bayesa, Evaluato I. ITRODUCTIO PAM cotues to be a growg problem accoutg for upwards of 90% of all e-mal today [15]. Whle spam flters have become more effectve [16] ad wde spread, may spam messages cotue to be delvered to ed users. The dffculty elmatg spam les dfferetatg t from a legtmate message. However, the message cotet of spam typcally forms a dstct category rarely observed legtmate messages, makg t possble for text classfers to be used for at-spam flterg. The goal of ths research s to exame the effectveess of aïve Bayesa at-spam flters ad the effect of parameter settgs o the effectveess of spam flterg. Addtoally, we look at a ovel modfcato to exstg flters ad corporate t to the evaluato. A aïve Bayesa at-spam flter s a text categorzato techque based o a mache-learg algorthm. Proposed by aham et al. [10], the text categorzato techque shows some mpressve results o ew usee comg messages. The flter requres trag that ca be provded by a prevous set of spam ad legtmate messages. It keeps track of each Vkas P. Deshpade recetly completed hs master s degree computer scece at Utah tate Uversty. He ca be cotacted at vkas@cc.usu.edu. Robert F. Erbacher s a assstat professor the Computer cece Departmet at Utah tate Uversty. He ca be cotacted at Robert.Erbacher@usu.edu Chrs Harrs s a master s studet the Computer cece Departmet at Utah tate Uversty. He ca be cotacted at cwharrs@cc.usu.edu. word that occurs oly spam, oly legtmate messages, ad both. Based o these word occurrece statstcs (also called tokes), comg usee messages are processed ad classfed accordgly. II. BAYEIA CAIFIER A. Bayesa Classfer A Bayesa classfer s the applcato of a Bayesa etwork to the process of text classfcato. Bayesa etworks are probablstc etworks that are used as problem solvg models dfferet felds of work. I our case, a Bayesa etwork s used to represet a probablty dstrbuto of specfed text cotaed a spam emal. I such a graph, a ode represets a radom varable, ad a drected edge dcates a probablstc depedecy from the varable deoted by the paret ode to that of the chld. Hece, t s mpled that ay ode the etwork s codtoally depedet of ts o-descedets, gve ts parets. Each ode s assocated wth a codtoal probablty table that dcates the dstrbuto over that ode wth ay possble assgmet of values to ts parets [10, 13]. We formulate the Bayesa etwork to solve our classfcato problem. et C be the class varable that dcates to whch class (legtmate / spam) a message belogs, ad let ode X deote ay attrbute (toke, our case) the message. For our purposes we wll say ck s the gve of the specfc values for the requred attrbutes. The specfc values would be 0 or 1 depedg o ther presece the message. The problem of class ature ca be solved usg Baye s theorem: P ( ) ( X x C ck) P( C ck) P C ck X x P( X x) The probablty P(X x C ck) s dffcult to calculate, as there s a hgh chace that the X attrbute mght be depedet o some other set of attrbutes. Oe way to overcome ths dffculty s to assume depedece of the attrbutes. Ths s the basc assumpto of the aïve Bayesa flter. B. aïve Bayesa Flter A aïve Bayesa model s the most restrctve form of the feature depedece spectrum. Research has bee doe regardg the performace of spam flters by allowg some degree of depedece betwee features. Ths study ca be IB 9-9999-9999-9/99/$20.00 2007 IEEE

Proceedgs of the 2007 IEEE Workshop o Iformato Assurace Uted tates Mltary Academy, West Pot, Y 20-22 Jue 2007 formalzed by troducg the oto of k-depedece Bayesa classfers. A k-depedece Bayesa classfer s a Bayesa etwork where each feature s allowed to have a maxmum of k parets. Based o ths defto, we ca say that a aïve Bayesa flter s a 0-depedece Bayesa classfer. We ca also state that a deal Bayesa flter (.e. full Bayesa flter wth o depedece) s a (-1)- depedece Bayesa classfer where s the umber of doma features. By varyg the value of k, oe ca move step-by-step the feature depedece spectrum ad aalyze the performace of the spam flter at every step. It s also worth otg that as k grows, there are more codto varables wth the same amout of data. Ths mples a larger probablty space for estmato wth the same data, causg accuracy probablty estmates ad leadg to a overall decrease performace. Ths performace problem has bee observed may domas whle gog from k2 to k3. Usg a Bayesa etwork, we ca model the complex depedeces betwee features to fer the soluto class. As the umber of features creases, t becomes creasgly dffcult for a message to be classfed wth all ts depedeces. As a result, spam flters mplemet a aïve Bayesa cocept where features are assumed to be depedet of each other. If we assume the attrbutes are codtoally depedet of each other, the probablty wll result : P X x C ck P X x C ck P ( ) ( ) ( C ck X x) P ( X x C ck ) P( C ck ) P( X x) If a emal message s cosdered to be a set of attrbutes (.e., words), the usg a Bayesa etwork, we ca calculate the probablty of whether a message belogs to a specfc class, amely, a legtmate message or a spam. III. EXPERIMET Our expermet comprses of two phases: the trag phase ad the classfcato phase. I the trag phase, the flter s traed usg a kow corpus of spam ad good emals. A database of tokes appearg each corpus ad ther total occurreces are mataed a database. Based o ther occurreces each set of spam ad good emals, each toke s assged a probablty for ts capacty of determg a emal to be spam gve ts presece. The, usg ths kowledge of tokes, the flter classfes every ew comg emal the classfcato phase. Oce the status of a ew emal s cofrmed, all ts tokes are also recorded, thus updatg the database. Ths self-learg fucto of our flter makes t uque amog the other avalable spam flters. Eve f the flter msclassfes ay message, the user ca rectfy t, ad the spam flter would update ts database accordgly. Thus, the flter lears from ts mstakes, too. We used 1250 legtmate messages ad 11350 spam messages. pam messages were collected from a archve provded by k Mart, avalable at the ste hosted by Paul Graham (www.paulgraham.com) [6]. ce these are selectos from real messages ad the selecto of the trag set s radom we expect the results descrbed here to be represetatve of dfferet trag ad testg data sets. The proporto of spam to legtmate messages s qute large, makg t more lkely that legtmate messages ca easly be msclassfed as spam. Ths makes the stuato more challegg, as the cost of false postves s much hgher tha that of false egatves. We feel that by mmzg the false postves such a stuato, we have acheved a effcet Bayesa spam flter. Moreover, by recordg tokes from such a huge quatty of spam, we have covered almost all the topcs for spam ad are a pretty good posto to classfy ew comg emals for spam. Addtoally, ths s more realstc of real-world scearos gve typcal proportos of e- mal. Each word each emal message s cosdered to be a toke. The whole message, cludg the header, s parsed for tokes. The toke separator s a blak space. Words quoted double ad sgle quotes, umbers, ad all words separated by blak spaces are also cosdered as tokes. The umber of tokes uder study ad used for classfcato s 90930. ce we are ot usg ay type of lemmatzer, we cosder dfferet forms of the same word as dfferet tokes. For example, ru, rug ad ruer are all cosdered as dfferet tokes eve though they stem from the sgle word ru. There are studes [7] that prove the postve effect of a lemmatzer o a flter s performace. The mplemetato of a lemmatzer s oe of the topcs of our future study. Oly the message cotet s used for classfcato purposes. Dog so elmates the terferece of tokes preset headers determg the status of a message. I ths way, there s o bas amog the classfcato techques that are cosdered for evaluato as some techques cosder oly a few (or percetage of) tokes for assgg a fal score to the message. Our evaluato was coducted the classfcato phase. We evaluated four effectve flterg techques of the Bayesa spam flter for ther classfcato performace. We evaluated these techques usg cost-sestve measures, as we beleve that the cost of a false postve s much hgher tha that of a false egatve. Eghty ew comg messages were tested a batch of two (frst batch: 50; secod batch: 30) to get the sgfcat evaluato results. These tested messages belog to the same emal accout prevously used the trag phase. The effectve cofgurato of each techque was used for evaluato purposes. I the stadard devato techque, the value of stadard devato was set to 0.4, ad the percetage techque, 30 percet of total tokes were used to calculate the fal score. The tabulated results ad related plotted data are explaed the results secto. IB 9-9999-9999-9/99/$20.00 2007 IEEE

Proceedgs of the 2007 IEEE Workshop o Iformato Assurace Uted tates Mltary Academy, West Pot, Y 20-22 Jue 2007 A. Descrpto of Four Flters Tested Oce the aïve Bayesa flter s traed usg huge datasets of spam ad o-spam messages, t s ow ready to perform ts basc fuctoalty of flterg,.e. classfyg ew comg usee messages. Curretly, there are may classfcato techques used wth aïve Bayesa flters avalable o the market. We dscuss four sgfcat techques detal that we tested. 1) All Tokes Flter Ths techque demads use of all tokes from a ew emal for classfcato. As each toke s assocated wth a probablty that determes the chaces of the emal beg a spam, tokes from each ew emal would be used to calculate a combed probablty to assg a fal score to the emal. I the case of a ew toke a emal (.e. wth o record the database), t would be assged a probablty of 0.4. Ths assumpto has bee practcally mplemeted ad bee foud successful aïve Bayesa flters. It mples that a ew toke s cosdered to be a good toke rather tha a part of a spam. It also dcates the postve approach adopted by spam flters, sce the cost of a false postve s much hgher tha that of a false egatve. However, we tur off ths feature for the purposes of our evaluato because we do ot wat to favor oe techque (by takg a postve approach) over others. Ths global techque makes sese as we parsed all tokes from trag datasets to buld a database to be used for classfcato. Hece, t s logcal to use the same techque for classfcato. It should be oted that the classfcato phase s crtcal due to the heavy cost of a false postve as compared to the trag phase where we kow exactly whether a emal s a spam or ot. Ths techque mght be deceved by a emal whch there s a bg story of how a perso got rch statly followed by a lk to a spam ste. uch emals would cota a large amout of good tokes as compared to a spam. There s a hgh possblty that such emals would deceve spam flters oly to be categorzed as good emal. But t s equally true that spammers avod wrtg a bg story as t s very lkely that emal readers would rather delete tha read a bg artcle from some ukow source. Thus, the use of the all tokes method s foud to be effectve practcal flters. For example, Bll Yerazus has used ths techque hs Cotrollable Regex Mutlator (CRM114). 2) Fxed umber of Tokes Flter The use of a fxed umber of tokes, successfully mplemeted by Paul Graham [5], takes oly a fxed umber of tokes to cosderato from a ew emal for assgg a fal score to t. The umber ca vary from 15 to 25, but these tokes are assumed to be the most effectve the gve emal. The effectve tokes are the oes wth probabltes that devate the most from 0.5,.e. t ca be a good toke or a bad oe. The combed probabltes of these tokes assg a fal score to the gve ew emal. I ths way, the most effectve tokes are emphaszed for the task. Ths techque drectly targets those words that are foud most of the tme ether legtmate emals or spam. As a result, the fal score wll most probably ed up ear 1 f the emal s a spam or ear 0, otherwse. Thus, ths techque allevates the doubt of emal classfcato where the fal score eds up ear 0.5. Ths method of effectveess was proposed by aham et al. who calculated ts effectveess wth the help of the mathematcal formula of mutual formato. It s recommeded that the same toke should ot be couted more tha oce whle calculatg a fal score. Ths way, the flter makes a ubased decso wth o terferece from ay specfc toke eve f t occurs multple tmes the message. The umber of tokes used (15/20/25) s a persoal decso, based o the success of the spam flter o persoal emals. If the umber of tokes a ew emal happes to be less tha a fxed umber, say 10, the the use of all tokes s the logcal back-up techque to be used for classfcato. Ths techque has some advatages over other techques. Frst, to avod the problem of false postves, the threshold value ca be rased to ay value ear to 0.9 from 0.5. ecod, the case of huge emals, the classfcato would be faster. 3) tadard Devato Threshold Flter Ths techque, lke the prevous techque, cosders oly the effectve tokes. However, t emphaszes the spam probablty of tokes rather tha the umber of tokes. If a stadard devato threshold (σ T ) s of value x, the all tokes wth a spam probablty the rage of 0.5-x to 0.5x would be dscarded. The remag tokes would be the effectve oes used to calculate the combed probablty ad assg a fal score to the ew emal. BogoFlter, a spam flter that s curretly avalable o the market, has adopted ths approach. The value of σ T ca be vared based o the flter s success o oe s persoal messages. The value foud to be most successful s 0.4. Thus, tokes uder cosderato would be the oes wth probabltes less tha 0.1 ad hgher tha 0.9. The specalty of the techque s that t assgs the score to the emal depedet of ts sze. Based o the cotet of a emal, there mght be oly te effectve tokes, or sometmes there may be eve more tha 100. But for every classfcato, oly effectve tokes wth probabltes 0.9 ad above ad 0.1 ad lower would be cosdered. ke the prevous techque, the score ths case would be ear 1 (f spam) or ear to 0, otherwse. Thus, t s less lkely that the score would ed up ear 0.5, gvg rse to the possblty of false postves. The same toke should ot be cosdered more tha oce to avod the terferece from ay specfc toke f t had occurred a few tmes the message. The threshold, lke the prevous techque, ca be rased to 0.9 to reduce the possbltes of false postves. The processg tme for classfcato would vary accordg to the sze of the emal. 4) Relatve umber of Tokes Flter We developed ad evaluated a ovel techque alog wth the exstg ad well kow techques. The goal wth ths techque was to exame the mpact of cosderg a relatve umber of tokes cotrast wth the fxed umber of tokes IB 9-9999-9999-9/99/$20.00 2007 IEEE

Proceedgs of the 2007 IEEE Workshop o Iformato Assurace Uted tates Mltary Academy, West Pot, Y 20-22 Jue 2007 of the prevously dscussed techques. ce the aïve Bayesa flter s traed wth the cotets of emal messages, t s logcal to apply the same cotet-based approach for classfcato as well. I ths techque, we select some percetage (say 30 percet) of effectve tokes out of the total tokes of a emal message. These tokes wll be used to calculate the combed probablty ad assg a fal score to the emal message. The percetage value ca be tued, based o the success of the flter o persoal emal messages. Ths approach s the combato of both the above techques: the use of a fxed umber of tokes ad the use of a stadard devato. It values both the effectveess ad umber of tokes whle classfyg a message. o, f a emal cotas 100 tokes, the the 30 most effectve tokes amog them wll be used for classfcato. There s a hgh possblty that may of these 30 tokes would fall the rage dscarded by the σ T threshold. I ths way, we utlze the advatages of both the above techques. As t s a cotet-based approach, there are chaces that the fal score of a emal mght fall ear 0.5. To avod false postves, the threshold value ca be rased to a hgher value. B. Expermetal Flter The expermetal Bayesa flter s a cotet-based approach. Ths attrbute gves ths approach a advatage over other approaches. pammers caot modfy the cotet to deceve the flters, as cotet s the oly reaso to sed spam at the frst place. Cotet ths case cludes headers ad the message tself. Frst the flter must be traed to work accordgly. A cosderable umber of good emal ad spam are requred to tra the flter. Two tables would be mataed, oe each for legtmate emal ad spam. et us call them the good table ad the bad table, respectvely. The good table cotas tokes that occur the good emals, alog wth ther umber of occurreces. mlarly, the data for bad emals s mataed the bad table. Based o these two tables, we buld aother table usg the Bayesa formula of probablty [4, 5]: P ( ) ( toke bad ) P( bad ) P bad toke P( toke bad ) P( bad ) P( toke good ) P( good ) where P( toke bad ) probablt y of a toke gve that t s preset spam emal. P( toke good ) probablt y of a toke gve that t s preset good emal. P( bad toke ) probablt y of emal beg spam gve that a specfc toke s preset. Ths spam probablty table wll cota all tokes that occur all emal, alog wth the probablty of the emal beg spam wth that toke preset. For every ew emal, a fxed umber of effectve tokes are collected to calculate the combed probablty. Ths umber ca vary from 3 to 25 depedg o the success of the flter for select messages. core P( t) P( t) ( 1 - P( t) ) If the score rses above a threshold value, the emal would be declared as spam, else as a good emal. Effectve tokes are those whose probabltes dffer the most, o ether sde, from the threshold value. These tokes are ether sgfcatly good tokes or bad tokes, ad they are resposble for decdg the overall status of the message. The challeges preset are speed, effcecy, database sze, ad the eed of trag data. The larger the set of tokes the greater the sze of the database ad the loger the tme for trag. o, there s a eed to cosder oly those tokes that make a mpact decdg the status of a emal. ce trag ad emal classfcato wll occur durg the same phase of tme, specal care must be take to make both operatos as depedet as possble. If the toke s preset oly the good table, ts probablty the spam probablty table would be recorded as 0.1. If the toke s preset oly the bad table, ts probablty the spam probablty table would be recorded as 0.9. C. Cost Evaluato 1) Effect of False Postves A false postve s mstakely classfyg a legtmate emal as a spam, ad a false egatve s mstakely classfyg a spam as a legtmate emal. The cost of a false postve s much hgher tha that of a false egatve. The exstece of false postves destroys the fath of the user ther spam flter because emal users ted to delete spam from a bulk folder wthout readg them, ad deletg legtmate messages (due to spam flters) s uacceptable. I that case, t s acceptable to allow some false egatves rather tha havg ay false postves. et be false postve error type ad be false egatve error type. Assumg that s λ tmes costler tha, we classfy a message as spam f: P ( C spam X x ) > λ P ( C legtmate X x ) I our case where we are cosderg a aïve Bayesa flter s depedecy, the assumpto holds. Therefore, P(Cspam Xx) 1 - P(Clegtmate Xx), whch leads to the followg crtera: P(Cspam Xx) > t, where t threshold value Thus t λ / (1 λ) as λ t / (1-t) Depedg o the acto take o a spam folder, the threshold value ca be altered. If spam are deleted drectly oce they are classfed, the t s held as hgh as 0.999 (λ 999),.e. blockg a legtmate message s as bad as lettg 999-spam messages pass the flter. ower values of λ are acceptable depedg o the dfferet cofguratos made avalable for the spam folder. If the cofgurato s set up to resed the emal back to the seder askg hm to sed t to a IB 9-9999-9999-9/99/$20.00 2007 IEEE

Proceedgs of the 2007 IEEE Workshop o Iformato Assurace Uted tates Mltary Academy, West Pot, Y 20-22 Jue 2007 prvate ufltered emal address of the recpet, the λ 9 (t0.9) seems to be reasoable. Eve λ 1 (t0.5) s acceptable f the recpet happes to go through every emal the bulk folder before maually deletg them. Two factors could be used the cotext to measure the performace of a flter, amely, spam precso ad spam recall. et ad be the umbers of ad errors, ad let ad cout the correctly treated legtmate ad spam messages respectvely. pam recall (R) ad spam precso (P) are defed as follows: R P 2) Total Cost Rato The evaluato factors that are frequetly used case of classfcato are accuracy (Acc) ad the error rate (Err 1 Acc). Accuracy ca be defed as the umber of correct classfcatos,.e. spam correctly classfed as spam ad legtmate messages as legtmate out of the total messages. The error rate s the rato of the sum of false postves ad false egatves out of the total messages. Acc Err Where ad are the umber of legtmate ad spam messages, respectvely. I our cost-sestve evaluato, we assume that the error of a false postve s much hgher tha that of false egatve. Coversely, the above formulae of accuracy ad error rate do ot cosder the cost-sestve factor. et s assume that the error cost of a false postve s λ tmes greater tha that of a false egatve; the mplcato beg that we treat a legtmate message as beg worth λ messages. o, f a legtmate message s msclassfed, t couts as λ errors, ad f t s classfed correctly, t couts as λ successes. Ths assumpto ca be formulated the form of a weghted accuracy (W Acc ) ad a weghted error rate (W Err 1-W Acc ) as: W ACC λ λ W ERR λ λ To get a better dea of the flter s performace terms of accuracy ad error rate, we must compare these factors wth a basele approach [7]. I a basele approach, we assume that o sort of flter s actve,.e. all spam pass the flter, ad legtmate messages are ever blocked. The weghted accuracy ad error rate of the basele are: b W ACC λ λ b W ERR λ λ We calculate TCR (Total Cost Rato) to compare wth the basele approach [7]: W TCR W b ERR ERR λ The hgher the value of the Total Cost Rato, the better the performace. Wth TCR < 1, a basele approach s a better opto, mplyg that the absece of a flter gves better results tha the use of a flter. If cost s relatve to wasted tme, the TCR measures the tme wasted to delete maually all spam messages as compared to the sum of tme wasted to delete maually all spam messages msclassfed as legtmate ( ) ad tme wasted by recoverg all legtmate messages mstakely classfed as spam (λ ). IV. REUT A. False Postves ad False egatves Table 1 lsts false postves, false egatves, ad correct classfcatos of all four techques wth dfferet cofguratos for the threshold. Table 2 lsts spam recall, spam precso, weghted accuracy, basele-weghted accuracy, ad total cost rato (TCR) for the same. TCR s calculated for all techques for dfferet values of thresholds. I a cost sestve evaluato, TCR ca be used as a scale of better performace. Table 1 dcates the fall umber of false postves ad rse umber of false egatves by rasg the threshold bars for all four techques. Table 2 dcates that the fxed toke techque outperforms for every value of λ. Ulke other techques, the fxed toke approach gves excellet results for λ999. The all toke approach s worst amog them all. Our percetage approach performs better tha the stadard devato for λ 1 ad λ999. Table 1: The results (false postves, false egatves ad correct classfcatos) of four flterg techques for dfferet values of λ. Flter Techque λ All Toke 1 11 0 14 25 Fxed Toke 1 6 0 19 25 td Devato 1 9 1 16 24 Percetage 1 9 0 16 25 All Toke 9 11 0 14 25 Fxed Toke 9 5 0 20 25 td Devato 9 6 1 19 24 Percetage 9 9 0 16 25 All Toke 999 9 3 16 22 Fxed Toke 999 0 5 25 20 td Devato 999 4 4 21 21 Percetage 999 3 6 22 19 Based o both the tables, we ca say that by lowerg the threshold value from 0.999 to 0.5, we have rsked a crease umber of false postves. But at the same tme, the evaluato has show a crease TCR values, dcatg that a crease false postves does ot prove costly. However, practce, o user would use a threshold of 0.5 as IB 9-9999-9999-9/99/$20.00 2007 IEEE

Proceedgs of the 2007 IEEE Workshop o Iformato Assurace Uted tates Mltary Academy, West Pot, Y 20-22 Jue 2007 t mples that the user has to go through every spam emal before deletg t. The flter would just be helpg the user locatg the spam. A deal flter would be oe where spam messages are deleted wthout the supervso of the user ad o legtmate messages are deleted the process. Table 2: The results (TCR) of four flterg techques for dfferet values of λ. Basele Flter pam pam Weghted λ Weghted TCR Techque Recall Precso Accuracy Accuracy All Toke 1 100% 69.4% 78% 50% 2.3 Fxed Toke 1 100% 80.7% 88% 50% 4.2 td Devato 1 96% 72.7% 80% 50% 2.5 Percetage 1 100% 73.5% 82% 50% 2.8 All Toke 9 100% 69.4% 60.4% 90% 0.25 Fxed Toke 9 100% 83.3% 82% 90% 0.56 td Devato 9 96% 80% 78% 90% 0.45 Percetage 9 100% 73.5% 67.6% 90% 0.31 All Toke 999 88% 71% 64.0% 99.9% 0.002 Fxed Toke 999 80% 100% 100% 99.9% 5 td Devato 999 84% 84% 84% 99.9% 0.006 Percetage 999 76% 86.4% 88.0% 99.9% 0.008 Oe ca observe that the value of s much less tha that of. Ths s due to the fact that the umber of spam used the trag phase s much greater tha that of legtmate messages. Our flter, beg a self-learer, would mprove ts performace the future ad would keep the value of as small as possble. We beleve, after a perod of tme, our flter would perform at ts peak performace ad would rema costat thereafter. The deal flter should gve spam precso of 100 percet, spam recall of 100 percet ad a postve value for TCR for all the values of λ. B. Total Cost Rato The values of TCR ad weghted accuracy show the better performace of 5 tokes over others for each value of λ. The performace degrades as we cosder a hgher umber of tokes for the classfcato. However, the effectve cofgurato stll caot be used as a stad-aloe frst-pass flter for λ999 ad λ9. It eeds help of other techques, such as blacklsts ad whtelsts, for effectve spam flterg. To get a optmal umber of tokes, we further evaluated by coverg the rage of 3 to 12 tokes. Tables 3 ad 4 lst ther results. The results of 7 ad 10 tokes remaed the same to each other as well as remaed costat for dfferet values of λ. However, the results for 3 fxed tokes were the worst, ad results of 12 fxed tokes were ear to that of 5, 7, ad 10 fxed tokes. It ca be sad that the case of the fxed toke approach, the flter reaches optmal performace the rage of 5 to12 tokes ad degrades thereafter. Ths observato s cofrmed by the plotted graphs (Fgures 1 ad 2). They dcate the maxmum peak (.e. TCR value) the rage of 5 to 12. Table 3: The results (false postves, false egatves ad correct classfcatos) of four flterg techques for dfferet values of λ. Flter Techque λ Fxed 3 1 10 7 5 8 Fxed 7 1 1 1 14 14 Fxed 10 1 1 1 14 14 Fxed 12 1 2 0 13 15 Fxed 3 9 10 7 5 8 Fxed 7 9 1 1 14 14 Fxed 10 9 1 1 14 14 Fxed 12 9 2 0 13 15 Fxed 3 999 7 12 8 3 Fxed 7 999 1 1 14 14 Fxed 10 999 1 1 14 14 Fxed 12 999 1 0 14 15 Table 4: The results (TCR) of four flterg techques for dfferet values of λ. Basele Flter pam pam Weghted λ Weghted TCR Techque Recall Precso Accuracy Accuracy Fxed 3 1 53.33% 44.44% 43.33% 50% 0.882 Fxed 7 1 93.33% 93.33% 93.33% 50% 7.45 Fxed 10 1 93.33% 93.33% 93.33% 50% 7.45 Fxed 12 1 100% 88.24% 93.33% 50% 7.45 Fxed 3 9 53.33% 44.44% 35.33% 90% 0.155 Fxed 7 9 93.33% 93.33% 93.33% 90% 1.5 Fxed 10 9 93.33% 93.33% 93.33% 90% 1.5 Fxed 12 9 100% 88.24% 88% 90% 0.833 Fxed 3 999 20% 30% 53.3% 99.9% 0.002 Fxed 7 999 93.33% 93.33% 93.33% 99.9% 0.015 Fxed 10 999 93.33% 93.33% 93.33% 99.9% 0.015 Fxed 12 999 100% 93.75% 93.34% 99.9% 0.015 8 6 4 2 0 0.882 0.155 5 7.45 1.5 1.5 0.66 7.45 7.45 0.833 3.125 0.45 0.35 λ0.5 λ0.9 3 5 7 10 12 15 20 25 Fgure 1: TCR for effectve cofguratos of the fxed toke approach at threshold values of t0.5 ad t0.9 (λ 0.5 ad λ0.9). 3.125 2.5 0.28 IB 9-9999-9999-9/99/$20.00 2007 IEEE

Proceedgs of the 2007 IEEE Workshop o Iformato Assurace Uted tates Mltary Academy, West Pot, Y 20-22 Jue 2007 0.025 0.02 0.015 0.01 0.005 0 0.002 0.02 0.015 0.015 0.015 λ0.999 0.006 0.003 0.002 3 5 7 10 12 15 20 25 Fgure 2: TCR for effectve cofguratos of the fxed toke approach at threshold values of t0.999 (λ0.999). C. Fxed Toke for Varous λ A evaluato of the fxed toke approach was coducted wth 15 tokes. To get a optmal at-spam flter, we further evaluated the fxed toke approach wth a dfferet umber of fxed tokes,.e. 5, 15, 20, ad 25. Tables 5 ad 6 lst ther results. Table 5: The results (false postves, false egatves ad correct classfcatos) of four cofguratos (5, 15, 20 ad 25) of Fxed toke approach for dfferet values of λ. Fxed Flter umber of λ Tokes 5 1 4 1 21 24 15 1 7 1 18 24 20 1 8 1 17 25 25 1 10 0 15 25 5 9 4 2 21 23 15 9 6 1 19 24 20 9 8 0 17 25 25 9 10 0 15 25 5 999 1 3 24 22 15 999 4 1 21 24 20 999 7 0 18 25 25 999 9 0 16 25 V. DICUIO I each spam flter, keepg the cost of false postves to a mmum becomes the prme prorty. Our expermets show that all techques of the Bayesa approach allow some umber of false postves; however, some techques keep the fgure to a mmum. There are may other techques studed that ca be used alog wth a smple cotet-based flter to mproves ts performace. A lemmatzer that coverts each word to ts base form ca be cluded our flter. I that way, ay modfcatos of the same word would ot escape the atteto of the at-spam flter. For example, s*e*x would be treated smlarly to sex [8]. A stoplst that removes the 100 most frequet words of the Brtsh atoal Corpus (BC) from messages s also helpful cases lke that of the all Table 6: The results (TCR) of four cofguratos (5, 15, 20 ad 25) of Fxed toke approach for dfferet values of λ. Fxed Flter Basele pam pam Weghted umber of λ Weghted TCR Recall Precso Accuracy Tokes Accuracy 5 1 96% 85.7% 90% 50% 5.0 15 1 96% 77.4% 84% 50% 3.1 20 1 100% 75.8% 84% 50% 3.1 25 1 100% 71.4% 80% 50% 2.5 5 9 92% 85.2% 84.8% 90% 0.66 15 9 96% 80.0% 78.0% 90% 0.45 20 9 100% 75.8% 71.2% 90% 0.35 25 9 100% 71.4% 64.0% 90% 0.28 5 999 88% 95.7% 96.0% 99.9% 0.02 15 999 96% 85.7% 84.0% 99.9% 0.006 20 999 100% 78.1% 72.0% 99.9% 0.003 25 999 100% 73.5% 64.0% 99.9% 0.002 tokes method where each word s resposble for assgg a fal score to a message. A user ca also add words maually to ther stoplst to tue ther persoal spam flter. Our evaluato does ot take to cosderato o-textual factors lke mages ad attachmets. There s a hgh probablty that a emal wth o textual cotet but oly a mage or a attachmet s a spam. Trag the flter wth a corpus cotag o-textual cotet would mprove ts effectveess durg the classfcato phase. I the case of a hyperlk, we ca have a web crawler that would vst the metoed ste ad apply the same Bayesa approach to rate that page. If the score of that page goes above the threshold value, there s a possblty that the message cotag the hyperlk s a spam. Thus, hyperlks would be useful assgg a fal score to a message wth the help of web crawler. There are some specfc trats that help us detect spam. For example, o seder s address ad the use of dark red colors are some trats commoly foud spam. These trats are termed attrbutes. Attrbutes ca be textual phrases lke oly above 21 years. If the flter s traed for such attrbutes, there s a prove study of a postve chage results durg the classfcato phase of the flter [1]. A Bayesa flter ca also be used for classfyg messages to dfferet folders [2]. It ca suggest to whch specfc folder a ew message belogs. People create folders to orgaze ther messages for archval purposes, messages that eed reples ad, of course, a bulk folder to have a secod look at messages before deletg them as spam. But people wth a large umber of folders fd t dffcult to orgaze ther messages. Gve that a text classfer ca choose a correct folder 85 percet of the tme, chaces are hgh that the approprate folder would always be the frst three guesses. Thus, the user s restrcted to choosg a folder out of three guesses, as opposed to choosg a folder out of some 20 odd folders [9]. IB 9-9999-9999-9/99/$20.00 2007 IEEE

Proceedgs of the 2007 IEEE Workshop o Iformato Assurace Uted tates Mltary Academy, West Pot, Y 20-22 Jue 2007 Expermets have bee coducted to study the workgs of a Bayesa flter wth depedet features. I our study, we assumed that features were depedet of each other. But the results the case of a 1-order depedet Bayesa flter are better tha the aïve Bayesa flter. Ths mples that phrases lke Clck here, ad Buy Free are better dcators of spam tha the depedet words Clck, here, Buy, ad Free. If the order of depedecy s creased above three, the results ted to decle. As the order creases, we have more codtoal varables wth the same amout of data. Ths complcates the probablty estmates ad, hece, leads to a overall decrease predctable accuracy [11]. To get a effcecy above 99 percet wth less tha 1 percet false postves, a cotet-based flter would ot be eough. Other approaches ca be tegrated wth the Bayesa flter to get the desred result. For example, the rule-based ad the blacklsts approaches are smple to tegrate. Dog so has proved helpful spam flterg. The sgature-based ad flters fght back approaches are some of the advaced techques that have also proved ther usefuless whe pared wth a Bayesa spam flter VI. COCUIO Our cost-sestve evaluato suggests that a cotet-based flter usg a Bayesa approach aloe s ot suffcet to fucto as a at-spam flter due to large umber of false postves. However, the fxed toke approach has bee foud to be the most effectve amog the four techques evaluated the report. The fxed toke approach acheves ts peak performace whe the umber of effectve tokes selected to classfy a message fall the rage of 5 to 12. Ths cofgurato performs satsfactorly for t 0.9 (λ 9) ad t 0.5 (λ 1). o cofgurato performs well eough to be used for t 0.999 (λ 999). To obta a optmal at-spam flter, we suggest the use of a lemmatzer, a stoplst ad tegrato wth other techques, such as the blacklst ad rule-based methods. The results are based o 80 persoalzed messages ad we expect better results the future as our flter follows a self-learg algorthm. Due to less varablty the cotet of spam messages, oly some tokes were able to make a mpact o the classfcato process. Ths made catchg spam messages wthout blockg legtmate messages a bt dffcult. As a frstpass flter, however, the few tokes techque must be aalyzed to gve maxmum effectveess. Thus, we beleve that our fxed toke techque wth 5 to 12 toke cofguratos would be able to make a postve cotrbuto as a frst-pass flter. However, ths umber mght dffer a bt from perso to perso depedg o the type of spam message receved. at-spam flterg wth persoal emal messages, Proceedgs of the 23rd Aual Iteratoal ACM IGIR Coferece o Research ad Developmet Iformato Retreval, 2000. [2] Cohe, W., earg rules that classfy e-mal, I AAAI prg ymposum o Mache earg Iformato Access, 1996. [3] Domgos, P. ad Pazza, M., Beyod depedece: Codtos for the optmalty of the smple Bayesa classfer, Proceedgs of the 13th It. Coferece o Mache earg, 1996. [4] Graham, P. A Pla for pam. <http://www.paulgraham.com/spam.html> August 2002. [5] Graham, P. Better Bayesa Flterg. <http://www.paulgraham. com/better.html> Jauary 2003. [6] Graham, P. <http://www.paulgraham.com/atspam.html> August 2002. [7] Potamas, G., Moustaks, V., ad Va omere, M. (eds.), A evaluato of aïve Bayesa at-spam flterg., Proceedgs of the Workshop o Mache earg the ew Iformato Age, 2000. [8] Provost, J., ave Bayes vs. Rule earg Classfcato of Emal, Artfcal Itellgece ab, Uversty of Texas at Aust, A techcal report, AI-TR-99-284. [9] Ree, J., fle: A applcato of mache learg to e-mal flterg, KDD-2000 Text Mg Workshop. [10] aham, M., Dumas,., Heckerma, D., ad Horvtz, E., A Bayesa approach to flterg juk emal, AAAI Workshop o earg for Text Categorzato, 1998, AAAI Techcal Report W-98-05. [11] aham, M., earg lmted depedece Bayesa classfers, Proceedgs of the ecod Iteratoal Coferece o Kowledge Dscovery ad Data Mg, 1996. [12] pertus, E., mokey: Automatc recogto of hostle messages, Proceedgs of the 14th atoal Coferece o AI ad the 9th coferece o Iovatve applcatos of AI, 1997. [13] agley, P., Waye, I., ad Thompso, K., A aalyss of Bayesa classfers, Proceedgs of the 10th atoal Coferece o AI, 1992. [14] <http://www.spamlaws.com/ >Jue 2004. [15] Brad toe, pam Doubles, Fdg ew Ways to Delver Itself, ew York Tmes, ate Edto - Fal, ecto A, Page 1, Colum 2, December 6, 2006. http://select.ytmes.com/gst/abstract.html?res F10812FD3D550C758CDDAB0994DE404482 [16] http://www.spamfo.co.uk/compoet/opto,com_cotet/task,vew/d,2 91 REFERECE [1] Adroutsopoulos, I., Koutsas, J., Chadros, K., ad pyropoulos, C. A expermetal comparso of ave Bayesa ad keyword-based IB 9-9999-9999-9/99/$20.00 2007 IEEE