Iteratioal Joural o Artificial Itelligece Tools Vol. XX, No. X (2006) 20 World Scietific Publishig Compay WORDS VS. CHARACTER N-GRAMS FOR ANTI-SPAM FILTERING IOANNIS KANARIS, KONSTANTINOS KANARIS, IOANNIS HOUVARDAS, ad EFSTATHIOS STAMATATOS Dept. of Iformatio ad Commuicatio Systems Eg., Uiversity of the Aegea, Karlovassi, Samos - 83200, Greece stamatatos@aegea.gr Received (Day Moth Year) Revised (Day Moth Year) Accepted (Day Moth Year) The icreasig umber of usolicited e-mail messages (spam) reveals the eed for the developmet of reliable ati-spam filters. The vast majority of cotet-based techiques rely o word-based represetatio of messages. Such approaches require reliable tokeizers for detectig the toke boudaries. As a cosequece, a commo practice of spammers is to attempt to cofuse tokeizers usig uexpected puctuatio marks or special characters withi the message. I this paper we explore a alterative low-level represetatio based o character -grams which avoids the use of tokeizers ad other laguage-depedet tools. Based o experimets o two well-kow bechmark corpora ad a variety of evaluatio measures, we show that character -grams are more reliable features tha word-tokes despite the fact that they icrease the dimesioality of the problem. Moreover, we propose a method for extractig variable-legth -grams which produces optimal classifiers amog the examied models uder cost-sesitive evaluatio. Keywords: ati-spam filterig, machie learig, -grams.. Itroductio Nowadays, e-mail is oe of the cheapest ad fastest available meas of commuicatio. However, a major problem of ay iteret user is the icreasig umber of usolicited commercial e-mail, or spam. Spam messages waste both valuable time of the users ad importat badwidth of iteret coectios. Moreover, they are usually associated with aoyig material (e.g., porographic site advertisemets) or the distributio of computer viruses. Hece, there is a icreasig eed for effective ati-spam filters that either automate the detectio ad removal of spam messages or iform the user of potetial spam messages. Early spam filters were based o blacklists of kow spammers ad hadcrafted rules for detectig typical spam phrases (e.g., free pics ). The developmet of such filters is a time-cosumig procedure. Moreover, they ca easily be fooled by usig forged e-mail addresses or variatios of kow phrases that is still readable for a huma (e.g., f*r*e*e.). Hece, ew rules have to be icorporated cotiuously to maitai the effectiveess of the filter.
2 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos Recet advaces i applyig machie learig techiques to text categorizatio ispired researchers to develop cotet-based spam filters. I more detail, a collectio of both kow spam ad legitimate (o-spam or ham) messages is used by a supervised learig algorithm (e.g., decisio trees, support vector machies, etc.) to develop a model for automatically classifyig ew, usee messages to oe of these two categories. That way, it is easy to develop persoalized filters suitable for either a specific user or a mailig list moderator. Spam detectio is ot a typical text categorizatio task sice it has some itriguig characteristics. I particular, both spam ad legitimate messages ca cover a variety of topics ad geres. I other words, both classes are ot homogeeous. Moreover, the legth of e-mail messages varies from a couple of text lies to dozes of text lies. I additio, the message may cotai grammatical errors ad strage abbreviatios (sometimes itetioally used by spammers i order to fool ati-spam filters). Therefore, the learig model should be robust i such coditios. Furthermore, besides the cotet of the body of the e-mail messages, useful iformatio ca be foud i e-mail address of the seder, attachmets etc. Such additioal o-textual iformatio ca cosiderably assist the effectiveess of spam filters. 2 Last, but ot least, spam detectio is a cost sesitive procedure. I the case of a fully-automated ati-spam filter, the cost of characterizig a legitimate message as spam is much higher tha lettig a few spam messages pass. This fact of crucial importace should be cosidered i evaluatig spam detectio approaches. All supervised learig algorithms require a suitable represetatio of the messages, usually i the form of a attribute vector. So far, the vast majority of machie learig approaches to spam detectio use the bag of words represetatio, that is, each message is cosidered as a set of words that occur a certai umber of times. 2, 3, 4, 5 Puttig it aother way, the cotext iformatio for a word is ot take ito accout. The wordbased text represetatios require laguage-depedet tools, such as a tokeizer (to split the message ito tokes) ad usually a lemmatizer plus a list of stop words (to reduce the dimesioality of the problem). A commo practice of spammers is to attempt to cofuse tokeizers, usig structures such as f.r.e.e., f-r-e-e, f r e e, etc. Moreover, there is o effective lemmatizers available for ay atural laguage, especially for morphologically rich laguages. O the other had, word -grams, i.e., cotiguous sequeces of words, have also bee examied. 6 Such approaches attempt to take advatage of cotextual phrasal iformatio (e.g., buy ow ), that distiguish spam from legitimate messages. However, word -grams cosiderably icrease the dimesioality of the problem ad the results so far are ot ecouragig. I this paper, we focus o a differet but simple text represetatio. I particular, each message is cosidered as a bag of character -grams, that is, strigs of legth. For example, the character 4-grams of the begiig of this paragraph would be a : I_t, _th, _thi, this, his_, is_p, s_pa, _pap, pape, aper, etc. Character -grams are a We use ad _ to deote -gram boudaries ad a sigle space character, respectively.
Words vs. Character N-grams for Ati-spam Filterig 3 able to capture iformatio o various levels: lexical ( the_, free ), word-class ( ed_, ig_ ), puctuatio mark usage (!!!, f.r. ), etc. I additio, they are robust to grammatical errors (e.g., the word-tokes assigmet ad asigmet share the majority of character -grams) ad strage usage of abbreviatios, puctuatio marks etc.. The bag of character -grams represetatio is laguage-idepedet ad does ot require ay text preprocessig (tokeizer, lemmatizer, or other deep NLP tools). It has already bee used i several tasks icludig laguage idetificatio, 7 authorship attributio, 8 ad topic-based text categorizatio 9 with remarkable results i compariso to word-based represetatios. A importat characteristic of the character-level -grams is that they avoid (at least to a great extet) the problem of sparse data that arises whe usig word-level -grams. That is, there is much less character combiatios tha word combiatios, therefore, less -grams will have zero frequecy. O the other had, the proposed represetatio still produces a cosiderably larger feature set i compariso with traditioal bag of words represetatios. Therefore, learig algorithms able to deal with high dimesioal spaces should be used. Support Vector Machies (SVM) is a supervised learig algorithm based o the structural risk miimizatio priciple. 0 Oe of the most remarkable properties of SVMs is that their learig ability is idepedet of the feature space dimesioality, because they measure the complexity of the hypotheses based o the margi with which they separate the data, istead of the features. The applicatio of SVMs to text categorizatio tasks has show the effectiveess of this approach whe dealig with high dimesioal data ad sparse data. I this paper, we compare character -gram represetatios with traditioal wordbased represetatios i the framework of cotet-based ati-spam filterig. No extra iformatio comig from, e-mail address of the seder, attachmets etc. is take ito accout. Experimets o two publicly available corpora usig a variety of cost-sesitive evaluatio measures provide strog evidece that character -gram represetatios produce more effective models i compariso with the word-based represetatios. Moreover, we propose a method for extractig variable-legth character -grams ad show that this represetatio produces optimal classifiers amog the examied models whe cosiderig a cost-sesitive evaluatio. The rest of this paper is orgaized as follows: Sectio 2 icludes related work o cotet-based ati-spam filterig. Sectio 3 describes the character -gram represetatio while Sectio 4 comprises the method for variable-legth character -gram selectio. Sectio 5 gives a overview of the evaluatio measures we used ad Sectio 6 describes the corpora ad the performed experimets. Fially, sectio 7 summarizes the coclusios draw from this study ad idicates future work directios. 2. Related Work Probably the first study employig machie learig methods for ati-spam filterig was published i late 990s. 2 A Bayesia classifier was traied o maually categorized legitimate ad spam messages ad its performace o usee cases was remarkable.
4 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos Sice the, several machie learig algorithms have bee tested o this task, icludig boostig decisio trees ad support vector machies, 5 memory-based algorithms, 4 ad esembles of classifiers based o stackig. 2 O the other had, a umber of text represetatios have bee proposed dealig maily with word tokes ad ispired from iformatio retrieval research. Oe commo method is to use biary attributes correspodig to word occurrece. 2, 4 Alterative methods iclude word (term) frequecies, 6 tf-idf, 5 ad word-positio-based attributes. 3 The dimesioality of the resultig attribute vectors is usually reduced by removig attributes that correspod to words occurrig oly a few times. Recet work 3 has showed that the removal of the most frequet words (like ad, to etc.) cosiderably improves the classificatio accuracy. Aother commo practice is to use a lemmatizer 3 for covertig each word-type to its lemma (e.g., copies becomes copy ). Naturally, the performace of the lemmatizer affects the accuracy of the filter ad makes the method laguage-depedet. Fially, the dimesioality of the attribute vector ca be further reduced by applyig a feature selectio method 4 that raks the attributes accordig to their sigificace i distiguishig amog the two classes. Oly a predefied umber of top raked attributes are, the, used i the learig model. I additio, word -grams have also bee proposed 6, 3 but, so far, the results are ot ecouragig. Although such a represetatio captures phrasal iformatio, sometimes particularly crucial, the dimesioality of the problem icreases sigificatly. Moreover, the sparse data problem arises sice there are may word combiatios with low frequecy of occurrece. Recetly, laguage modelig techiques have also tested for 5, 6 ati-spam filterig with promisig resuts. I, 2005, the Text REtrieval Coferece (TREC) has expaded its tracks with the additio of the spam track aimig at providig a stadard evaluatio of ati-spam filterig approaches. To this ed, a collectio of evaluatio corpora, 7 both public ad private, was compiled ad a methodology for filter evaluatio was developed. 8 Notably, TREC spam track focused o the ability of the filter to evolve ad improve its performace with use. A few recet studies attempt to utilize a character-level represetatio of e-mail messages. I Ref. (9) a suffix-tree approach is described which outperforms a traditioal Bayesia classifier that is based o a bag of words represetatio. O the other had, a represetatio based o the combiatio of character 2-grams ad 3-grams is proposed i Ref. (20). However, prelimiary results i a e-mail categorizatio task (where may message classes are available) show that approaches based o word-based represetatios perform slightly better. Fially, oe of the best performig participat systems i the TREC 2005 spam track was based o compressio models workig o the character level. 2 Research i spam detectio was cosiderably assisted by publicly available bechmark corpora, so that differet approaches to be evaluated o the same testig groud. A importat issue is that legitimate messages usually cotai persoal iformatio of the users which should ot become publicly available. Oe solutio is to
Words vs. Character N-grams for Ati-spam Filterig 5 collect legitimate messages from mailig lists, (e.g., Lig-Spam b ) or directly by users willig to doate them (e.g., SpamAssasi c ). Aother solutio is to attempt to obscure iformatio about seders ad receivers, 5 or ecode the words of the body of the messages so that to become ureadable. 6 Recetly, the publicly-available Ero corpus has bee used as a source of legitimate messages. 22 3. The Bag-of-Character N-grams Approach First, for a give, we extract the L most frequet character -grams of the traiig corpus. Let <g, g 2,, g L > be the ordered list (i decreasig frequecy) of the most frequet -grams of the traiig corpus. The, each message is represeted as a vector of legth L <x, x 2,, x L >, where x i depeds o g i. I more detail, we examie two represetatios: Biary: The value of x i may be (if g i is icluded at least oce i the message) or 0 (if g i is ot icluded i the message). Term Frequecy (TF): The value of x i correspods to the frequecy of occurrece (ormalized by the message legth) of g i i the message. The produced vectors ca be arbitrarily log. O oe had, if L is chose too short, the messages are ot represeted adequately. O the other had, if L is chose too log the dimesioality of the problem icreases sigificatly. I the experimets described i the followig sectios, all the character -grams that appear more tha 3 times i the traiig corpus are take ito accout. A feature selectio method ca the be applied to the resultig vectors, so that oly the most sigificat attributes cotribute to the classificatio model. A feature selectio method that proved to be quite effective for text categorizatio tasks is iformatio gai. 4 The iformatio gai of a feature x i is defied as a expected reductio i etropy by takig x i as give: IG(C, x i ) = H(C) H(C x i ) () where C deotes the class of the message (C {spam, legitimate}) ad H(C) is the etropy of C. I other words, IG(C, x i ) is the iformatio gaied by kowig x i. Iformatio gai helps us to sort the features accordig to their sigificace i distiguishig betwee spam ad legitimate messages. Oly the first m most sigificat attributes are, the, take ito accout. The produced vectors (of legth m) of the traiig set are used to trai a SVM classifier. The Weka 23 implemetatio of SVM was used (default parameters were set i all reported experimets). 4. Variable-legth N-gram Selectio b Available at: http://www.aueb.gr/users/io/data/lig-spam_public.tar.gz c Available at: http://spamassassi.apache.org/publiccorpus/
6 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos The approach described above is able to provide fixed-legth character -grams. I this paper we also propose a method for extractig variable-legth character -gram features based o a existig approach for extractig multiword terms (i.e., word -grams of variable legth) from texts. The origial approach aimed at iformatio retrieval applicatios. 24 I this study, we slightly modified this approach i order to apply it to character -grams. The mai idea is to compare each -gram with similar -grams (either immediately loger or shorter) ad keep the domiat -grams. Therefore, we eed a fuctio able to express the glue that sticks the characters together withi a -gram. For example, the glue of the -gram the_ will be higher tha the glue of the -gram thea. 4.. Domiat N-gram Extractio To extract the domiat character -grams i a corpus we modified the algorithm LocalMaxs itroduced i Ref. (24). It is a algorithm that computes local maxima comparig each -gram with similar -grams. Give that: g(c) is the glue of -gram C, that is the power holdig its characters together. at(c) is a atecedet of a -gram C, that is a shorter strig of size -. succ(c) is a successsor of C, that is, a loger strig of size +, i.e., havig oe extra character either o the left or right side of C. The, the domiat -grams are selected accordig to the followig rules: if g if g ( C. legth > 3) ( C) g( at( C) ) g( C) > g( succ( C) ), at( C), succ( C) ( C. legth = 3) ( C) > g( succ( C) ), succ( C) (2) I this study, we oly cosider 3-grams, 4-grams, ad 5-grams as cadidate -grams. Note that 3-grams are oly compared with successor -grams. Moreover, 5-grams are oly compared with atecedet -grams. So, it is expected that the proposed algorithm will favor 3-grams ad 5-grams agaist 4-grams. 4.2. Represetig the Glue To measure the glue holdig the characters of a -gram together various measures have bee proposed, icludig specific mutual iformatio 25, the φ 2 measure, 26 etc. I this study, we adopt the Symmetrical Coditioal Probability (SCP) proposed i Ref. (27). The SCP of a bigram xy is the product of the coditioal probabilities of each give the other: SCP 2 p( x, y) p( x, y), = = (3) p( x) p p ( x y) = p( x y) p( y x) ( y) p( x, y) () x p() y
Words vs. Character N-grams for Ati-spam Filterig 7 Give a character -gram c c, a dispersio poit defies two subparts of the - gram. A -gram of legth cotais - possible dispersio poits (e.g., if * deote a dispersio poit, the the 3-gram the has two dispersio poits: t*he ad th*e ). The, the SCP of the -gram c c give the dispersio poit c c - * c is: SCP (( c c ) c ), 2 p( c Kc ) ( c Kc ) p( c ) K = (4) p The SCP measure ca be easily exteded so that to accout for ay possible dispersio poit (sice this measure is based o fair dispersio poit ormalizatio, will be called fairscp). Hece the fairscp of the -gram c c is as follows: fairscp ( c K c ) = i= i= p p ( c Kc ) ( c... c ) p( c... c ) i 2 i+ (5) 5. Evaluatio Measures 5.. Total Cost Ratio Two well kow measures from iformatio retrieval commuity, recall ad precisio, ca describe i detail the effectiveess of a spam detectio approach. I more detail, give that S S is the amout of spam messages correctly recogized, S L is the amout of spam messages icorrectly categorized as legitimate, ad L S is the amout of legitimate messages icorrectly classified as spam, the, spam recall ad spam precisio ca be defied as follows: Spam Recall = S S S S + S L (6) Spam Precisio = S S S S + L S (7) I ituitive terms, spam recall is a idicatio of filter effectiveess (the higher the recall, the less spam messages pass) while spam precisio is a idicatio of filter safety (the higher the precisio, the less legitimate messages blocked). However, spam detectio is a cost sesitive classificatio task. So, it is much worse to misclassify a legitimate message as spam tha vice versa. Therefore, we eed a evaluatio measure that icorporates a idicatio of this cost. A cost factor λ is assiged to each legitimate message, that is, each legitimate message is cosidered as λ messages. 3,4 I other words, if a legitimate message is misclassified, λ errors occur. A cost-sesitive evaluatio measure, the Total Cost Ratio (TCR) ca, the, be defied 3, 4 as follows:
8 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos = λ S S S L TCR (8) L S + + The higher the TCR, the better the performace of the approach. I additio, if TCR is lower tha, the the filter should ot be used (the cost of blockig legitimate messages is too high). To be i accordace with previous studies, three cost scearios were examied: Low cost sceario (λ=): This correspods to a ati-spam filter that lets a message classified as spam to reach the mailbox of the receiver alog with a warig that the message is probably spam. Medium cost sceario (λ=9): This correspods to a ati-spam filter that blocks a message classified as spam ad the seder is iformed to resed the message. High cost sceario (λ=999): This correspods to a fully-automated filter that deletes a message classified as spam without otifyig either the receiver or the seder. 5.2. ROC Graphs Receiver Operatig Characteristics (ROC) graphs, origiated from sigal detectio theory, 28 are a useful tool for visualizig the performace of classifiers. Sice they offer a reliable represetatio of the classifier properties uder imbalaced class distributio ad uequal classificatio error costs, they are especially popular i the machie learig commuity. 29 A ROC graph depicts relative trade-offs betwee beefits ad costs. I more detail, give that false positive (fp) rate ad true positive (tp) rate are defied as follows: fp rate = L L L S + S L L S (9) tp rate = S S S S + S L (0) the performace of a classifier is depicted i a two-dimesioal space i which the false positive rate is plotted o the x axis ad the true positive rate is plotted o the y axis. Alteratively, the two axes may correspod to ham misclassificatio (=fp rate) ad spam misclassificatio (=-tp rate), respectively. 8 Give that the classifier ca assig a probability or score to a istace, a curve is plotted i the ROC space by varyig the threshold used to produce biary classificatio results. A importat property of the ROC graphs is that they are isesitive to chages i class distributio. 29 That is, if the proportio of spam to ham messages chages, the ROC curve of a classifier remais the same. Note that recall-precessio graphs are affected drastically i such chages. For spam filterig applicatios, this is a crucial factor sice the amout of spam messages that reaches a specific mailbox cotiuously chages. Aother iterestig property of ROC graphs is that they are able to produce a sigle value
Words vs. Character N-grams for Ati-spam Filterig 9 that represets the expected performace. The most commo method is to calculate the area uder the ROC curve (AUC). The AUC of a classifier correspods to the probability that the classifier will rak a radomly chose positive istace higher tha a radomly chose egative istace. Fially, the operatig coditios of a classifier (i.e., differet classificatio error costs) may be traslated ito iso-performace lies i the ROC space, that is, lies with the same slope. 29 Let λ be the cost of misclassifyig a ham message. The slope of the isoperformace lies is defied as: slope + L L L S = λ () S S + S L Lies closer to the upper left corer of the ROC space correspod to better classifiers. Give the ROC curves of a set of classifiers, a classifier may be optimal if ad oly if it lies o the ROC covex hull (ROCCH). If the operatig coditios of the classifier chage, the ROCCH remais the same but a differet portio of the ROCCH should be examied for idetifyig the optimal classifier. 6. Experimets 6.. Corpora ad Settigs I this study we are based o two widely-used corpora to evaluate the usefuless of the character -grams for spam filterig. The first corpus is Lig-Spam cosistig of 2,893 emails, 48 spam messages ad 2,42 legitimate messages take from postigs of a mailig list about liguistics. This corpus has a relatively low spam rate (6%) ad the legitimate messages are ot as heterogeeous as the messages foud i the persoal ibox of a specific user. However, it has already bee widely used i previous studies 3, 4, 9 ad compariso of our results with word-based methods is feasible. Moreover, it provides evidece about the effectiveess of our approach as assistace to mailig list moderators. The bare versio of this corpus was used (o lemmatizig or stop-word removal was performed) so that to be able to extract accurate character -gram frequecies. Ufortuately, this corpus was already coverted to lower case, so it was ot possible to explore the sigificace of upper case characters. The secod corpus is a part of the publicly-available SpamAssasi corpus. The legitimate messages of this corpus were collected from public forums as well as from direct doatio of specific users. I more detail, the corpus we used cosists of 2,000 ham ad,500 spam messages. This corpus has bee preprocessed i order to remove attachmets, headers, ad html-tags. Sice it has bee costructed by may idividual users, the legitimate messages are expected to be more heterogeeous tha the messages foud i the persoal mailbox of a sigle user. Moreover, sice the messages are i their origial form, it is possible to explore whether or ot case sesitive character -grams perform better tha case-isesitive character -grams.
0 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos I all the experimets described below, a te-fold cross-validatio procedure was followed. That is, the etire corpus was divided ito te equal parts, i each fold a differet part is used as test set ad the remaiig parts as traiig set. Fial results come from averagig the results of each fold. Fially, i all cases, a SVM classifier was used as learig algorithm with default values (liear kerel, C=). Spam Recall (%) Spam Recall (%) 00 99 98 97 96 95 94 00 99 98 97 96 95 94 0 000 2000 3000 4000 0 000 2000 3000 4000 Spam Precisio (%) Spam Precisio (%) 00 99 98 97 96 95 94 00 99 98 97 96 95 94 0 000 2000 3000 4000 5-grams 4-grams 3-grams 0 000 2000 3000 4000 Fig.. Spam recall ad spam precisio of the proposed approach based o character 3-grams, 4-grams, ad 5-grams ad varyig umber of attributes o the Lig-Spam corpus. Top: biary attributes. Bottom: TF attributes. 6.2. Results o Lig-Spam Three sets of experimets were performed based o character 3-gram, 4-gram, ad 5- gram represetatios, respectively. I all three cases, both biary ad TF attributes were examied. Moreover, differet values of the m attributes left after the feature selectio procedure were tested (m starts from 250 ad the varies from 500 to 4000 by 500). For evaluatig the performace of the classifiers we use the recall, precisio, ad TCR measures i order to be able to compare it with reported results of previous word-based studies o the same corpus.
Words vs. Character N-grams for Ati-spam Filterig TCR 60 50 40 30 20 0 60 50 40 30 20 0 TCR 0 25 20 5 0 5 0 0 000 2000 3000 4000 25 20 5 0 5 0 000 2000 3000 4000 TCR 0 0 0 000 2000 3000 4000 0 000 2000 3000 4000 0.5 0.5 5-grams 0.4 0.4 4-grams 3-grams 0.3 0.3 0.2 0.2 0. 0. 0 0 000 2000 3000 4000 0 0 000 2000 3000 4000 Fig. 2. Results of cost-sesitive evaluatio o the Lig-Spam corpus. TCR values for λ=(top), λ=9 (middle), ad λ=999 (bottom) ad varyig umber of attributes ad -gram legth o the Lig-Spam corpus. Left colum: biary attributes. Right colum: TF attributes.
2 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos Table. Compariso of cost-sesitive evaluatio (λ=, 9, ad 999) of the proposed approach with previously published results o Lig-Spam. Best reported results for spam recall, spam precisio, ad TCR are give. ST results refer to a sub-corpus of Lig-Spam. Approach λ Recall Precisio TCR NB 00 82.35% 99.02% 5.4 MBL 600 88.60% 97.39% 7.8 SG 300 89.60% 98.70% 8.60 ST - 97.22% 00% 35.97 Proposed 3,500 98.50% 99.60% 52.75 NB 9 00 77.57% 99.45% 3.82 MBL 9 700 8.93% 98.79% 3.64 SG 9 00 84.80% 98.80% 4.08 ST 9-98.89% 98.89% 9.0 Proposed 9 3,500 98.50% 99.60% 9.76 NB 999 300 63.67% 00% 2.86 MBL 999 600 59.9% 00% 2.49 ST 999-97.78% 00% 45.04 Proposed 999 3,500 98.50% 99.60% 0.25 The results of the applicatio of our approach to Lig-Spam are show i Fig.. As ca be see, for biary attributes, 4-grams seems to provide the more reliable represetatio (for m>2000). O the other had, for TF attributes there is o clear wier. More sigificatly, biary attributes seem to provide better spam precisio results while TF attributes are better i terms of spam recall. I most cases, spam recall was higher tha 97% while, at the same time, spam precisio was higher tha 98%. Moreover, a few thousads of features are required to get these results. This is i cotrast to previous word-based approaches that deal with a limited amout (a few hudreds) of attributes. The results of the cost-sesitive evaluatio are show i Fig. 2. I particular, TCR values for 3-grams, 4-grams, ad 5-grams are give for varyig umber of attributes. Results are give for both biary ad TF attributes as well as the three evaluatio scearios (λ=, 9, ad 999, respectively). As ca be see, i all three scearios, a represetatio based o character 4-grams with biary attributes provides the best results. This stads for a relatively high umber of attributes (m>2500). For λ=, ad λ=9 the TCR results are well above idicatig the effectiveess of the filter. O the other had, for λ=999, the TCR results are less tha idicatig that the filter should ot be used at all. However, it is difficult for this sceario to be used i practice. Table shows a compariso of the proposed approach with previously published results o the same corpus i terms of spam recall, spam precisio, ad TCR values. I more detail, best results achieved by three methods are reported: a Naïve Bayes (NB) classifier, 3 a Memory-Based Learer (MBL), 4 ad a Stacked Geeralizatio approach (SG) 2 usig word-based features ad a Suffix Tree (ST) 9 approach based o characterlevel iformatio. The umber of attributes that correspod to the best results of each method is also give. It should be oted that the results for the ST approach are referred
Words vs. Character N-grams for Ati-spam Filterig 3 to a sub-corpus of Lig-Spam with a proportio of spam to legitimate messages approximately equal to the etire Lig-Spam corpus (200 spam ad,000 legitimate messages). Moreover, o results were reported for the SG approach based o the high cost sceario. As cocers the TCR, the proposed approach is by far more effective tha wordbased approaches for the low ad medium cost scearios. This is due to the fact that it maages to achieve high spam recall while maitaiig spam precisio o equally-high level. ST is also quite competitive. This provides extra evidece that character-based represetatios are better able to capture the characteristics of spam messages. O the other had, the proposed approach failed to produce a TCR value greater tha for the high cost sceario. That is because the precisio failed to be 00%. It must be uderlied that previous studies 3, 4 show that TCR is ot stable for the high cost sceario ad it is commo for TCR to exceed oly for very specific settigs of the filter. 6.3. Results o SpamAssasi Usig the SpamAssasi corpus we are able to evaluate the character -grams approach based o case sesitive ad case isesitive datasets. Three sets of experimets were performed based o character 3-gram, 4-gram, ad 5-gram biary represetatios, respectively. All character -grams appearig more tha 3 times i the corpus costitute the iitial feature set. The, iformatio gai is applied to this iitial feature set to reduce the dimesioality. Differet values of the m attributes left after the feature selectio procedure were tested (m varies from 500 to 4,000 by 500) for each character -gram category. Additioally, traditioal word-based models were also costructed for comparative purposes. First, a bare bag of words approach ad, secod, a bag of words approach i combiatio with a lemmatizer ad stop words removal were tested usig the TMG toolbox 30. For evaluatig the performace of the produced classifiers we use the ROC graphs sice they offer a isight view ito the properties of the produced filters. Fig. 3 shows the performace of the case sesitive ad case isesitive character - gram-based classifiers i terms of the AUC. As ca be see, case sesitive character - grams outperform case isesitive character -grams. Moreover, the 3-gram model is the most effective for this corpus followed by 4-grams ad 5-grams. This cotrasts the case of Lig-Spam where 4-grams were foud to perform better. Recall that SpamAssasi legitimate messages are ot homogeeous i topic as Lig-Spam messages. Some log character -grams take from very commo words of the Lig-Spam corpus (e.g., laguage, liguistics, etc) ted to be the most importat for idetifyig ham messages. There is o such keywords i SpamAssasi corpus. Therefore, shorter character -grams prevail i this corpus. Fig. 4 depicts the performace of the case sesitive models i compariso to the word-based models. Agai, 3-grams are the most affective features. O the other had, word-based models outperform 4-grams ad 5-grams. Fially, the use of lemmatizer ad stop word removal improves the results for low dimesioal datasets (m<2,500) while the bare bag of words approach is better for higher dimesioality.
4 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos AUC 0.999 0.998 0.997 0.996 0.995 0.994 0.993 0.992 CS 3-grams CS 4-grams CS 5-grams CI 3-grams CI 4-grams CI 5-grams 0 000 2000 3000 4000 5000 6000 Fig. 3. The performace of the case sesitive (CS) ad case isesitive (CI) character -gram models o the SpamAssasi corpus. 0.998 0.996 AUC 0.994 0.992 0.99 0.988 0.986 BOW BOW+lem CS 3-grams CS 4-grams CS 5-grams 0 000 2000 3000 4000 5000 6000 Fig. 4. The performace of the word-based models (simple bag-of-words ad bag-of-words with lemmatizer ad stop word removal) ad the case sesitive (CS) character -grams models o the SpamAssasi corpus.
Words vs. Character N-grams for Ati-spam Filterig 5 0.998 0.996 AUC 0.994 0.992 0.99 0.988 0.986 BOW BOW+lem CS 3-grams VL -grams 0 000 2000 3000 4000 5000 6000 Fig. 5. Performace of the word-based approaches (simple bag-of-words ad bag-of-words with lemmatizer ad stop word removal), the case sesitive (CS) 3-gram model ad the variable-legth (VL) -gram model o the SpamAssasi corpus. 6.4. Results with Variable-legth N-grams So far, all the experimets were based o fixed-legth character -grams. I order to test the approach proposed i Sectio 4 for extractig variable-legth -grams we performed the followig experimet. A iitial large feature set cosistig of case sesitive character -grams of variable legth is extracted from the traiig corpus. This feature set icludes the L most frequet -grams for certai values of. That is, for L=,000, the,000 most frequet 3-grams, the,000 most frequet 4-grams, ad the,000 most frequet 5- grams compose the iitial feature set. The proposed variable-legth feature selectio method is applied to this iitial feature set. The resultig feature set is used to trai a SVM classifier. This procedure was followed for L ragig from,000 to 5,000 with a step of 500. Fig. 5 shows the performace (i terms of AUC) of the variable-legth character -grams i compariso with the case sesitive 3-grams ad the word-based approaches. Note that the variable-legth model is ot able to produce a predefied umber of attributes. However, by varyig the value of L, it is possible to get roughly as may attributes as the other models produce. The variable-legth -gram approach outperforms the word-based
6 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos 0.99 0.98 tp rate 0.97 0.96 0.95 0.94 0.93 0.92 BOW+lem BOW CS 3-grams VL -grams ROCCH 0 0.0 0.02 0.03 0.04 0.05 fp rate Fig. 6. ROC graphs for the word-based models, the case sesitive 3-gram model, ad the variable-legth - gram model. The ROC covex hull idicatig optimal performace amog the examied models is also depicted. approaches ad it is quite competitive with the case sesitive 3-gram model. However, the latter is still the best performig model i most of the cases. Note that the AUC measure is a overall idicatio of the effectiveess of the classifier. A closer look at the ROC graphs reveals that the variable-legth -gram model is optimal whe cosiderig a cost-sesitive evaluatio. I particular, Fig. 6 shows the ROC graphs for the case sesitive 3-gram ad the word-based models (m=5,000) as well as the variable-legth -gram model (L=4,500 producig 5,277 selected attributes). The ROCCH is also depicted. As ca be see, the 3-gram model ad the variable-legth - gram model domiate the ROCCH. The former is better especially for both very low ad very high (ot show i the figure) levels of fp rate while the latter is better for the middle fp rate level. O the other had, word-based models are outperformed. Recall from sectio 5.2 that accordig to the operatig coditios of the classifier, a specific part of the ROCCH correspods to the optimal classifier. I the framework of a cost-sesitive evaluatio, the optimal classifier is idetified based o the iso-performace lies (see Eq. ). Fig. 7 shows the best iso-performace lies for differet costs of misclassifyig a legitimate message as spam: λ=, 9, ad 999. I all three cases, the part of ROCCH take up by the variable-legth -gram model correspods to the optimal classifier. Therefore, despite the fact that the 3-gram model is more effective i geeral, as idicated by the AUC measure, the variable-legth -gram model is optimal i a cost sesitive evaluatio.
Words vs. Character N-grams for Ati-spam Filterig 7 Note that the attributes of the case sesitive 3-gram model of the previous experimet were selected usig a 22,673 iitial feature set (i.e., all character 3-grams that appear more tha 3 times i the corpus). O the other had, the variable-legth -gram attributes were selected usig a 3,500 iitial feature set (sice L=4,500). Fig. 8 shows the compositio of this variable-legth -gram model produced for L=4,500 (resultig 5,277 attributes). The prevailig character -gram category is 3-grams, followed by 5-grams. However, the cotributio of the loger -grams seems to be of crucial importace for costructig more reliable cost sesitive filters. Iterestigly, the commo 3-grams of the variable-legth -gram set ad the case sesitive 3-gram set are oly 926. That is, oly 8% of the case sesitive 3-grams are icluded i the variable-legth -gram set. This fact meas that may 3-grams ot selected by iformatio gai are ow icluded i the feature set ad are helpful for producig more effective classifiers. 0.99 0.98 λ=999 λ=9 λ= 0.97 tp rate 0.96 0.95 0.94 0.93 0.92 CS 3-grams VL -grams ROCCH 0 0.005 0.0 0.05 0.02 0.025 0.03 0.035 0.04 fp rate Fig. 7. ROC graphs for the case sesitive 3-gram model ad the variable-legth -gram model as well as the ROC covex hull. Iso-performace lies idicatig the optimal classifier for differet cost values (λ=, 9, ad 999) of misclassifyig a legitimate message are also depicted. 7. Coclusios I this paper we preseted a compariso of words ad character -grams i the framework of cotet-based ati-spam filterig. A series of experimets usig two bechmark corpora ad a variety of cost-sesitive evaluatio measures provides strog evidece that character-level iformatio is better able to discrimiate betwee spam ad
8 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos 4000 3500 325 3000 2500 2000 500 000 500 80 26 0 3-grams 4-grams 5-grams Fig. 8. The compositio of the variable-legth -gram model for L=4,500 (5,277 extracted attributes). legitimate messages. The most importat property of the character -gram approach is that it avoids the use of tokeizers, lemmatizers ad other laguage-depedet tools. Those tools are quite vulerable for the spammers. O the other had, a model based o character-level iformatio ca capture uaces of the spammiess of a message ad, more importatly, it is ot easy for the spammers to fool it by icorporatig puctuatio marks or other special symbols withi a word. Oe corpus used i this study comprised homogeeous legitimate messages, sice they were take from a mailig list about liguistics. The legitimate messages of the other corpus are quite heterogeeous, more tha the messages foud i the mailbox of a sigle user, sice they cosist of messages doated by differet users. This differece was reflected i the results of the character -gram models o these two corpora. Whe cosiderig heterogeeous legitimate messages, short -grams (3-grams) produced the best models while loger -grams (4-grams) were optimal whe cosiderig homogeeous legitimate messages. This idicates the ability of the character-level approach to be adapted to the properties of a specific corpus or the mailbox of a specific user. I additio to character -grams of fixed legth, we also proposed a method for extractig variable-legth character -gram based o a existig approach, origially used for extractig multi-word terms for iformatio retrieval applicatios. Results of cost-sesitive evaluatio idicate that the variable-legth -gram model is more effective i ay of the three examied cost scearios (i.e., low, medium, or high cost). Although the majority of the variable-legth -grams cosists of 3-grams, there are oly a few commo members with the fixed-legth 3-gram set. Hece, the iformatio icluded i the variable legth -grams is quite differet i compariso to the iformatio represeted by case sesitive 3-grams.
Words vs. Character N-grams for Ati-spam Filterig 9 A iterestig future work directio will be to test the character -gram represetatio i the framework of o-lie evaluatio of ati-spam filterig. This implies that the set of sigificat character -grams should be evolved with use as ew messages arrive ad classified by the filter so as to better capture the properties of spam ad legitimate messages. Refereces. F. Sebastiai, Machie learig i automated text categorizatio, ACM Computig Surveys, 34() (2002) pp. 47. 2. M. Sahami, S. Dumais, D. Heckerma, ad E. Horvitz, A Bayesia approach to filterig juk e-mail, I Proc. of AAAI Workshop o Learig for Text Categorizatio (998). 3. I. Adroutsopoulos, J. Koutsias, K.V. Chadrios, G. Paliouras, ad C.D. Spyropoulos, A evaluatio of aive Bayesia ati-spam filterig, I eds. G. Potamias, V. Moustakis, ad M. va Somere, Proc. of the Workshop o Machie Learig i the New Iformatio Age, th Europea Coferece o Machie Learig (2000) pp. 9-7. 4. G. Sakkis, I. Adroutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, ad P. Stamatopoulos, A memory-based approach to ati-spam filterig for mailig lists, Iformatio Retrieval, 6() (2003) pp. 49-73. 5. H. Drucker, D. Wu, ad V. Vapik, Support vector machies for spam categorizatio, IEEE Tras. Neural Networks, 0 (999) pp. 048-054. 6. I. Adroutsopoulos, G. Paliouras, ad E. Michelakis, Learig to filter usolicited commercial e-mail, Techical report 2004/2, NCSR "Demokritos" (2004). 7. W. Cavar ad J. Trekle, N-gram-based text categorizatio. I Proc. 3rd It l Symposium o Documet Aalysis ad Iformatio Retrieval (994) pp. 6-69. 8. V. Keselj, F. Peg, N. Cercoe, ad C. Thomas, N-gram-based author profiles for authorship attributio. I Proc. of the Coferece Pacific Assoc. Comp. Liguistics (2003). 9. H. Lodhi, C. Sauders, J. Shawe-Taylor, N. Cristiaii, ad C. Watkis, Text classificatio usig strig kerels, The Joural of Machie Learig Research, 2 (2002) pp. 49 444. 0. V. Vapik, The Nature of Statistical Learig Theory, (Spriger, New York 995).. T. Joachims, Text categorizatio with support vector machies: learig with may relevat features, I Proc. of the Europea Coferece o Machie Learig (998). 2. G. Sakkis, I. Adroutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, ad P. Stamatopoulos, Stackig classifiers for ati-spam filterig of e-mail, I Proc. of 6 th Cof. Empirical Methods i Natural Laguage Processig (200) pp. 44-50. 3. J. Hovold, Naive Bayes spam filterig usig word-positio-based attributes, I Proc. of the Secod Coferece o Email ad Ati-Spam (2005). 4. Y. Yag ad J.O. Peterse, A comparative study o feature selectio i text categorizatio, I Proc. of the 4 th It. Coferece o Machie Learig (997) pp. 42-420. 5. B. Medlock, A adaptive, semi-structured laguage model approach to spam filterig o a ew corpus. I Proc. of the 3 rd Coferece o Email ad Ati-Spam (2006). 6. E. Terra, Simple laguage models for spam detectio, I Proc. of the 4 th Text Retrieval Coferece (2005). 7. G. Cormack ad T. Lyam, Spam corpus creatio for TREC, I Proc. of 2 d Coferece o Email ad Ati-Spam (2005). 8. G. Cormack ad T. Lyam, TREC 2005 spam track overview, I Proc. of the 4 th Text Retrieval Coferece (2005). 9. R. Pampapathi, B. Mirki, ad M. Levee, A suffix tree approach to text categorisatio applied to spam filterig, http://arxiv.org/abs/cs.ai/0503030.
20 I. Kaaris, K. Kaaris, I. Houvardas, ad E. Stamatatos 20. H. Berger, M. Koehle, D. Merkl, O the impact of documet represetatio o classifier performace i e-mail categorizatio, I Proc. of the 4 th Iteratioal Coferece o Iformatio Systems Techology ad its Applicatios (2005) pp. 9-30. 2. A. Bratko ad B. Filipic, Spam filterig usig character-level Markov models: experimets for the TREC 2005 spam track, I Proc. of the 4 th Text Retrieval Coferece (2005). 22. V. Metsis, I. Adroutsopoulos, ad G. Paliouras, Spam filterig with aive Bayes - which aive Bayes?, I Proc. of the 3 rd Coferece o Email ad Ati-Spam (2006). 23. I.H. Witte ad E. Frak, Data Miig: Practical Machie Learig Tools with Java Implemetatios, (Morga Kaufma, Sa Fracisco 2000). 24. J. Silva ad G. Lopes, A local maxima method ad a fair dispersio ormalizatio for extractig multiword uits, I Proc. of the 6 th Meetig o the Mathematics of Laguage (999) pp. 369-38. 25. K. Church ad K. Haks, Word associatio orms, mutual iformatio ad lexicography, Computatioal Liguistics, 6() (990) pp. 22-29. 26. W. Gale ad K. Church, Cocordace for parallel texts, I Proc. of the 7 th Aual Coferece for the ew OED ad Text Research, (Oxford, 99) pp. 40-62. 27. J. Silva, G. Dias, S. Guilloré, ad G. Lopes, Usig LocalMaxs algorithm for the extractio of cotiguous ad o-cotiguous multiword lexical uits, Lecture Notes o Artificial Itelligece, 695 (Spriger, 999) pp. 3-32. 28. J. Ega, Sigal Detectio Theory ad ROC Aalysis, Series i Cogitio ad Perceptio, (New York: Academic Press, 975). 29. F. Provost ad T. Fawcett, Robust classificatio for imprecise eviromets, Machie Learig 42(3) (200) pp. 203-23. 30. D. Zeimpekis ad E. Gallopoulos, Desig of a MATLAB toolbox for term-documet matrix geeratio, I Proc. Workshop o Clusterig High Dimesioal Data ad its Applicatios, (2005) pp. 38-48.