Searching for Interacting Features for Spam Filtering

Searchng for Interactng Features for Spam Flterng Chuanlang Chen 1, Yun-Chao Gong 2, Rongfang Be 1,, and X. Z. Gao 3 1 Department of Computer Scence, Bejng Normal Unversty, Bejng 100875, Chna 2 Software Insttute, Nanjng Unversty, Nanjng, Chna 3 Department of Electrcal Engneerng, Helsnk Unversty of Technology, Otakaar 5 A, 02150 Espoo, Fnland C.L.Chen86@gmal.com, rfbe@bnu.edu.cn, gao@cc.hut.f Abstract. In ths paper, we propose a novel feature selecton method INTERACT to select relevant words of emals for spam emal flterng,.e. classfyng an emal as spam or legtmate. Four tradtonal feature selecton methods n text categorzaton doman, Informaton Gan, Gan Rato, Ch Squared, and RelefF, are also used for performance comparson. Three classfers, Support Vector Machne (SVM), Naïve Bayes and a novel classfer Locally Weghted learnng wth Naïve Bayes (LWNB) are dscussed n ths paper. Four popular datasets are employed as the benchmark corpora n our experments to examne the capabltes of these fve feature selecton methods and the three classfers. In our smulatons, we dscover that the LWNB mproves the Naïve Bayes and gan hgher predcton results by learnng local models, and ts performance s sometmes better than that of the SVM. Our study also shows the INTERACT can result n better performances of classfers than the other four tradtonal methods for the spam emal flterng. Key words: Interactng Features, Feature Selecton, Naïve Bayes, Spam Flterng. 1 Introducton The ncreasng popularty of electronc mals has ntrgued drect marketers to flood the malboxes of mllons of users wth unsolcted messages. These messages are usually referred to as spam or, more formally, Unsolcted Bulk E-mal (UBE), and may advertse anythng, from vacatons to get-rch schemes [1]. The negatve effect of spam has nfluenced people s daly lves: fllng malboxes, engulfng mportant personal mals, wastng network bandwdth, consumng users' tme and energy to solve t, not to menton all the other problems assocated wth t (crashed mal-servers, pornography advertsements sent to chldren, etc.). A study n 1997 ndcated that the spam messages consttuted approxmately 10% of the ncomng messages to a corporate network [4]. CAUBE.AU reports that ther statstcs show the volume of spam s ncreasng at an alarmng rate, and some people clam they are even Correspondng author.

abandonng ther emal accounts because of spam [3]. Ths stuaton seems to be worsenng wth tme, and wthout approprate counter-measures, spam messages could eventually undermne the usablty of e-mals. These serous threats from spam make the spam flterng, whose task s to rule out unsolcted emals automatcally from the emal stream, more mportant and n need of solvng. In recent years, many studes address the ssue of spam flterng based on machne learnng, because the attempts to ntroduce legal measures aganst spam malng have lmted effect. Several supervsed learnng algorthms have been successfully appled to spam flterng: Naïve Bayes [5,6,7,8], Support Vector Machne [9,10], Memory Based Learnng methods [11,12], and Decson Tree [13]. Among these classfcaton methods, the Naïve Bayes s partcularly attractve for spam flterng, as ts performance s surprsngly good [12]. The Naïve Bayes classfer has been the flterng engne of many commercal ant-spam software. Therefore, n ths paper, we am at mprovng the predcton ablty of the Naïve Bayes by ntroducng locally learned model. In order to tran or test classfers, t s necessary to go through large corpus wth spam and legtmate emals. E-mals of corpuses have to be preprocessed to extract ther words (features) belongng to the message subjects, the bodes and/or the attachments. As the number of features n a corpus can end up beng very hgh, t s usual to choose those features that better represent each message before carryng out the flter tranng to prevent the classfers from over-fttng [14]. The effectveness of the classfers reles on the approprate choce of these features, the preprocessng steps of the e-mal features extracton, and the selecton of the most representatve features are crucal for the performance of the flters [15]. In ths paper, a novel feature selecton method INTERACT and a novel classfer LWNB are ntroduced to deal wth spam flterng. The remander of ths paper s organzed as follows. Secton 2 demonstrates the INTERACT algorthm for the spam flterng. We explan the prncples of the e-mal representaton and preprocessng n Secton 3. Classfers used n ths paper are presented n Secton 4. We report the performances of the four feature selecton methods and three classfers usng F measure and accuracy n Secton 5. Secton 6 concludes our study wth a few remarks and conclusons. 2 INTERACT Algorthm Interactng features challenge the current feature selecton methods for classfcaton. A feature by tself may have lttle correlaton wth the target concept. However, when t s combned wth some other features, they can be strongly correlated wth the target concept [2]. Many tradtonal feature selecton methods usually unntentonally remove these features, and thus result n the poor classfcaton performances. The INTERACT algorthm can effcently handle the feature nteracton wth much lower tme cost than the tradtonal methods. A bref descrpton of the INTERACT algorthm s presented below, and more detals can be found n [2]. The INTERACT algorthm searches for the nteractng features by solvng two key problems: how to update c-contrbuton effectvely, and how to deal wth the feature

order problem? C-contrbuton of a feature s an ndcator about how sgnfcantly the elmnaton of that feature wll affect consstency. Especally, the C-contrbuton of an rrelevant feature s zero. To solve the frst problem, the INTERACT algorthm calculates the C-contrbuton effcently wth a hashng mechansm [2]: each nstance s nserted nto the hash table, and ts values of those features n S lst are used as the hash keys, where S lst s the set of the ranked features not yet elmnated (S lst s ntalzed wth the full set of features). Instances wth the same hash keys wll be nserted nto the same entry n the hash table and cover the old nformaton of the labels. For the second problem, we assume that the set of features can be dvded nto subset S 1 ncludng relevant features, and subset S 2 contanng rrelevant ones. The INTERACT algorthm ntends to remove the features n S 2 frst, and preserve features n S 1, whch more probably reman n the fnal set of selected features. The INTERACT algorthm acheves ths target by applyng a heurstc to rank the ndvdual features usng symmetrcal uncertanty (SU) n an descendng order so that the (heurstcally) most relevant feature s postoned at the begnnng of the lst. SU has been descrbed n the nformaton theory books and numercal recpes. It s often used as a fast correlaton measure to evaluate the relevance of ndvdual features [12,17]. The INTERACT s a flterng algorthm that employs backward elmnaton to remove the features wth no or low C-contrbuton. Gven a full set wth N features and a class attrbute C, the INTERACT fnds a feature subset S best for the class concept [2]. The algorthm conssts of two major parts: frstly, the features are ranked n the descendng order based on ther Symmetrcal Uncertanty values; secondly, the features are evaluated one by one startng from the end of the ranked feature lst. The process s shown as follows. Algorthm 1. INTERACT Algorthm. Input: Output: Process: F s the full features set wth N features{f 1,F 2,, F N }; C s the class label; δ s a predefned threshold. S best s subset of selected features. S best = for = 1 to N then calculate SU F,c for F append F to S best end sort S best n descendng order accordng to SU,c F last element of S best repeat f F NULL then p c-contrbuton of F f p δ then remove F from S best end end untl F = NULL return S best

3 Preprocessng of Corpus and Message Representaton 3.1 Feature Selecton Methods for Comparson Other four feature selecton methods are used n ths paper to test the capablty of the 2 INTERACT algorthm. They are Ch Squared (.e. χ ) Statstc, Informaton Gan, Gan Rato, and RelefF. Ther defntons are gven as follows. In the followng formulas, m s the number of classes (n spam flterng doman, m s 2), and C denotes the th class. V represents the number of parttons a feature can splt the tranng set nto. Let N s the total number of samples, and NC s that of class ( v). In the vth partton, NC denotes the number of samples belongng to class. Ch Squared: The Ch Squared Statstc s calculated by comparng the obtaned frequency wth the pror frequency of the same class. The defnton s: v ( ) ( v) m V v 2 ( N ) 2 C NC χ =. (1) ( v) = 1 v= 1 N C where ( ) ( v) NC = ( N / N) N C denotes the pror frequency. Informaton Gan: Informaton Gan s based on the feature s mpact on the decreasng entropy, and s defned as follows: N N N N InfoGan = [ ( )log( )] [ ( ) ( )log( )] m V ( v) m ( v) ( v) C C N C C ( v) ( v) = 1 N N v= 1 N = 1 N N. (2) Gan Rato: Gan Rato s frstly used n C4.5, whch s defned as (3): m NC N C GanRato = InfoGan /[ ( )log( )]. (3) = 1 N N RelefF: The key dea of Relef s to estmate the features accordng to how well ther values dstngush among the nstances that are near to each other. The RelefF s an extenson of the Relef, mprovng the Relef algorthm by estmatng the probabltes more relably and extendng to deal wth the ncomplete and multclass data sets. More detals can be found n [17]. 3.2 Corpus Preprocessng and Message Representaton Each e-mal n the corpora s represented as a set of words. After analyzng all the e- mals of a corpus, a dctonary wth N words/features s formed. Every e-mal s represented as a feature vector ncludng N elements, and the th word of the vector s a bnary varable representng whether ths word s n ths e-mal. Durng preprocessng, we perform the word stemmng, stop-word removable and Document

Frequency Threshold (DFT), n order to reduce the dmenson of feature space. The HTML tags of the e-mals are also removed durng preprocessng. Fnally, we extract the frst 5,000 tokens of the dctonary accordng to ther mutual nformaton to form the corpora used n ths paper. 4 Classfers for Spam Flterng In ths paper, we use three classfers to test the capabltes of the aforementoned feature selecton methods. The three classfers are Support Vector Machne (SVM), Naïve Bayes, and Locally Weghted learnng wth Naïve Bayes (LWNB) that s an mprovement of Naïve Bayes frstly ntroduced nto spam flterng doman by us. We here only brefly ntroduce the LWNB, and more detals can be found n [1]. In the LWNB, the Naïve Bayes s learned locally n the same way as the lnear regresson s used n locally weghted lnear regresson. A local Naïve Bayes model s ft to a subset of the data that s n the neghborhood of the nstance, whose class value s to be predcted [1]. The tranng samples n ths neghborhood are weghted, and further ones are assgned wth less weght. The classfcaton s then obtaned from these Naïve Bayes models. The subset of the data used to tran each locally weghted Naïve Bayes model s determned by a nearest neghbors algorthm. In the LWNB, the frst k nearest neghbors are selected to form ths subset, where k s a user-specfed parameter. How to determne the weght of each nstance of the subset? As n [1], we use a lnear weghtng functon n our experments, whch s defned as: f = 1 d / d, (4) lnear k where d s the Eucldean dstance to the th nearest neghbor x. Obvously, by usng f lnear, the weght decreases lnearly wth the dstance. Emprcal study shows the LWNB s not partcularly senstve to the choce of k as long as k s not too small [1]. Too small k may cause the local Naïve Bayes model to ft the nose n the data. The Naïve Bayes calculates the posteror probablty of class c for a test nstance wth m attrbute values a 1, a 2,, a m as follows: pc ( ) pa ( c) m l 1 2 m l j= 1 j l = C m pc 1 j= 1 paj c = pc ( a, a,..., a ) [ ( ) ( )], (5) where C s the total number of classes. In the LWNB, the ndvdual probabltes on the rght-hand sde of (5) are estmated based on the weghted data. The pror probablty for class c l becomes: 1 + I( c ) 0 = c l w = pc ( l ) = n C+ w n = 0, (6) where c s the class value of the th tranng nstance, and the ndcator functon I(x=y) s 1 ff x = y.

The attrbute of data s assumed nomnal, and as for the numerc attrbutes, they are dscretzed. The condtonal probablty of a j s gven by: 1 + I( a = a ) I( c = c ) w = 0 j j l j cl = n nj + I( a ) 0 j = aj w = pa ( ) n, (7) n j s the number of values of attrbute j, and a j s the value of attrbute j of th nstance. 5 Experments and Analyss 5.1 Corpus n Smulatons The experments are based on four popular benchmark corpora, PU1, PU2, PUA, and Lng Spam, whch are all avalable on [16]. In all PU corpora and Lng Spam corpus, attachments, html tags, and header felds other than the subjects are removed, leavng only subject lnes and mal body texts. In order to address prvacy, each token of a corpus s encoded to a unque nteger. The detals about each corpus are gven below. PU1 Corpus: The PU1 corpus conssts of 1,099 messages, whch has 481 spam messages and 618 legtmated ones. The spam rate s 43.77%. PU2 Corpus: The PU2 corpus contans less messages than PU1, whch has 721 messages. Among them, there are 579 messages labeled legtmate and 142 spam. PUA Corpus: The PUA corpus has 1,142 messages, half of whch,.e., 571 messages, are marked as spam and the other half legtmate. Lng Spam Corpus: The Lng spam corpus ncludes 2,412 legtmate messages from a lngustc malng lst and 481 spam ones collected by the author. The spam rate s 16.63%. Dfferent from PU corpora, the messages of Lng spam corpus come from dfferent sources: the legtmate messages are collected from a spam-free, topcspecfc malng lst and the spam ones from a personal malbox. Therefore, the dstrbuton of mals s less smlar from the normal user s mal stream, whch makes the messages of Lng spam corpus easly separated. 5.2 Performance Measures We use two popular evaluaton metrcs of the text categorzaton doman to measure the performance of the classfers: accuracy and F measure. Accuracy: Accuracy s the percentage of the correct predctons n the total predctons. It s defned as follows: Pc Accuracy = 100%. (8) P t

where P c s the number of the correct predctons, and P t s the number of the total predctons. The hgher of the accuracy, the better. F measure: The defnton of F measure s as follows: 2R P F = R+ P, (9) where R represents Recall, whch s the percentage of the messages for a gven category that are classfed correctly; P s the Precson, the percentage of the predcted messages for a gven class that are classfed correctly. F measure ranges from 0 to 1, and the hgher, the better. 5.3 Results and Analyss The followng classfcaton performance s measured through a 10-fold crossvaldaton. We select all of the nteractng features,.e., features wth non-negatve C- contrbuton. Table 1 summarzes the results of dmenson reducton after the INTERACT selects the features. Table 1. Summary of results of INTERACT selected features on the four benchmark corpora. PU1 PU2 PUA Lng Spam Num. of features wth non-negatve c-contrbuton 43 43 42 64 From Table 1, we can fnd that the dmensons of data have been reduced sharply after removng rrelevant features by the INTERACT. Therefore, we just run the classfers on these data rather than reducng them further by adjustng the parameter δ. From Table 1, we also can conclude that there are many rrelevant words/features exstng n corpus for the spam flterng, and more than 99% of the features are removed by the INTERACT. The followng hstograms show the performances of the three classfers, SVM (usng lnear kernel), Naïve Bayes, and LWNB, on the four corpora. As for other four feature selecton methods for comparson, we select the frst M features accordng to the features scores, where M s the number of the nteractng features found by the INTERACT algorthm. From Fg. 1 and Fg. 2, we dscover that the INTERACT algorthm can mprove the performances of all the three classfers. Ther performances on the reduced corpus are equal to or better than those on the full corpus, evaluated by the accuracy and F measure. For example, the performances of the SVM on PU1 and PU2 corpora reduced by the INTERACT s equal to those on the full corpora, and ts performance on PUA corpus reduced by the INTERACT s better than that on the full corpus. However, the performance of the SVM on Lng Spam corpus reduced by the INTERACT s slghtly worse than that on the full corpus. The feature selecton capablty of the INTERACT s obvously better than the other popular feature selecton methods. The compettve performances of the classfers on the data handled by the INTERACT show that only a few relevant words can stll dstngush between the spam and legtmate emals. Ths s true n practce, for example, t s

well known that the words buy, purchase, jobs, usually appear n the spam e- mals, and they thus are useful emal category dstngushers. 95% 96% 94% INTERACT Ch Squared InfoGan 95% INTERACT Ch Squared InfoGan GanRato RelefF Full GanRato RelefF Full 93% 94% Accuracy 92% 91% Accuracy 93% 92% 91% 90% 90% 89% 89% 88% 88% SVM LWNB NaveBayes SVM LWNB NaveBayes (a) Results on PU1 (b) Results on PU2 96% 100% 94% 98% 96% 92% 94% Accuracy 90% 88% INTERACT Ch Squared InfoGan GanRato RelefF Full Accuracy 92% 90% 88% INTERACT Ch Squared InfoGan 86% 86% GanRato RelefF Full 84% 84% 82% SVM LWNB NaveBayes (c) Results on PUA 82% SVM LWNB NaveBayes (d) Results on Lng Spam Fg. 1. Performances of aforementoned three classfers and four feature selecton methods on PU1, PU2, PUA, and Lng Spam benchmark corpora wth accuracy evaluaton measure. The performance of the LWNB s also promsng. On Lng Spam corpus, ts performance s even better than that of the SVM, whch s a well-known powerful classfer. On PU1 and Lng Spam corpora, the LWNB successfully mproves the performance of the Naïve Bayes by usng locally weghted model. However, ts performance s worse than that of the Naïve Bayes on PU2 and PUA corpora. The reason may be that the task of the spam flterng suts the hypothess of the class condtonal ndependence of the Naïve Bayes, that s, gven the class label of the others, the frequences of the words n one emal are condtonally ndependent of one another. Based on a careful observaton, we have another queston why the LWNB performs poorly on full corpus? The reason s: there are many rrelevant features exstng on full corpus, whch can be also concluded from the feature selecton results by performng the INTERACT. When determnng the neghbors, all the features take part n calculatng dstance, and too many rrelevant features conceal the truly useful effects of the relevant features, and therefore result n that the LWNB fnds the wrong or rrelevant neghbors to generate locally weghted Naïve Bayes models. However, the LWNB s stll a promsng classfer for the spam flterng, when combned wth some excellent feature selecton methods, such as the INTERACT.

0.88 0.96 0.94 0.95 INTERACT Ch Squared InfoGan GanRato RelefF Full 0.92 INTERACT Ch Squared InfoGan GanRato RelefF Full 0.94 0.90 F-measure 0.93 0.92 0.91 F-measure 0.88 0.86 0.90 0.84 0.89 0.82 0.80 SVM LWNB NaveBayes SVM LWNB NaveBayes (a) Results on PU1 (b) Results on PU2 0.96 1.00 0.94 0.95 0.92 F-measure 0.90 0.88 F-measure 0.90 0.85 INTERACT Ch Squared InfoGan GanRato RelefF Full 0.86 INTERACT Ch Squared InfoGan GanRato RelefF Full 0.80 0.84 0.82 SVM LWNB NaveBayes (c) Results on PUA 0.75 SVM LWNB NaveBayes (d) Results on Lng Spam Fg. 2. Performances of aforementoned classfers and four feature selecton methods on PU1, PU2, PUA and Lng Spam benchmark corpora wth F measure evaluaton measure. 6 Conclusons In ths paper, we present our work on the spam flterng. Frstly, we ntroduce the INTERACT algorthm to select nteractng words/features for the spam flterng. Other four tradtonal feature selecton methods are also performed n the experments for performance comparson. Secondly, we propose a novel classfer LWNB to mprove the performance of the Naïve Bayes, a most popular classfer n the spam flterng area, to deal wth the spam flterng. Totally, three classfers, SVM, Naïve Bayes and LWNB, are run on four corpora preprocessed by the fve feature selecton methods and correspondng full corpora n our smulatons. Two popular evaluaton metrcs, accuracy and F measure, are used to measure the performances of these three classfers. Our emprcal study shows that the INTERACT feature selecton can mprove all of the three classfers performances, and ts feature selecton ablty s better than that of the four tradtonal feature selecton methods. We brefly analyze the reason why the INTERACT and other four methods can work together to perform well. We also fnd out that the LWNB can mprove the performance of the Naïve Bayes, whch s sometmes superor to the SVM. Acknowledgements. The research work presented n ths paper was supported by grants from the Natonal Natural Scence Foundaton of Chna (Project No.

10601064). X. Z. Gao's research work was funded by the Academy of Fnland under Grant 214144. References 1. Frank, E., Hall, M., Pfahrnger, B.: Locally Weghted Nave Bayes. Proceedngs of the Conference on Uncertanty n Artfcal Intellgence. Morgan Kaufmann, pp.249-256, 2003. 2. Zhao, Z., Lu, H.: Searchng for Interactng Features. In: Proc. of Internatonal Jont Conference on Artfcal Intellgence (IJCAI), Hyderabad, Inda, 2007. 3. CAUBE.AU: http://www.caube.org.au/spamstats.html 2006. 4. Cranor, L.F., LaMaccha, B.A.: Spam! Communcatons of ACM, 41(8):74-83, 1998. 5. Saham, M, Dumas, S., Heckerman, D., et al.: A Bayesan approach to flterng junk e- mal. In Learnng for Text Categorzaton, Madson, Wsconsn, 1998. 6. Schneder, K.M.: A Comparson of Event Models for Naïve Bayes Ant-Spam E-Mal Flterng. In: Proc. of the 10th Conference of the European Chapter of the Assocaton for Computatonal Lngustcs, Budapest, Hungary, pp. 307-314, 2003. 7. Androutsopoulos, I, Palouras, G., Karkaletss, V., et al.: Learnng to Flter Spam E-mal: A Comparson of a Naïve Bayesan and a Memory-based Approach. In: Proc. of the Workshop on Machne Learnng and Textual Informaton Access, pp. 1-13, 2000. 8. Zhang, L., Zhu, J., Yao, T.: An Evaluaton of Statstcal Spam Flterng Technques. ACM Trans. Asan Lang. Inf. Process, 3(4):243-269, 2004. 9. Drucker, H., Wu, D., Vapnk, V.N.: Support vector machnes for spam categorzaton. IEEE Trans. on Neural Networks, 10(5):1048-1054, 1999. 10. Kolcz, A., Alspector, J.: SVM-based flterng of e-mal spam wth content-specfc msclassfcaton costs. In: Proc. of the TextDM'01 Workshop on Text Mnng - held at the 2001 IEEE Internatonal Conference on Data Mnng, 2001. 11. Sakks, G., Androutsopoulos, I., Palouras, G., et al.: A Memory-Based Approach to Ant- Spam Flterng for Malng Lsts. Informaton Retreval, 6(1):49-73, 2003. 12. Yu, L., Lu, H.: Feature selecton for hgh-dmensonal data: a fast correlaton-based flter soluton. In: Proc. of ICML 2003, 2003. 13. Carreras, X., Marquez, L.: Boostng trees for ant-spam emal flterng. In: Proc. of RANLP01, 4th Int. Conference on Recent Advances n Natural Language Processng, 2001. 14. Méndez, J.R., Iglesas, E.L., Fdez-Rverola, F., et al.: Analyzng the Impact of Corpus Preprocessng on Ant-Spam Flterng Software. Research on Computng Scence, 17:129-138, 2005. 15. Méndez, J.R., Fdez-Rverola, F., Díaz, F., et al.: A Comparatve Performance Study of Feature Selecton Methods for the Ant-spam Flterng Doman. In: Proc. of Advances n Data Mnng, Applcatons n Medcne, Web Mnng, Marketng, Image and Sgnal Mnng (ICDM 2006), pp. 106-120, 2006. 16. Emal Benchmark Corpus, http://www.aueb.gr/users/on/publcatons.html, 2006. 17. Kononenko, I.: Estmatng Attrbutes: Analyss and Extensons of Relef. In: Proc. of European Conference on Machne Learnng, pp. 171-182, 1994.