An Evaluation of Naive Bayesian Anti-Spam Filtering



Similar documents
Learning to Filter Spam A Comparison of a Naive Bayesian and a Memory-Based Approach 1

Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS

Average Price Ratios

A Parallel Transmission Remote Backup System

An IG-RS-SVM classifier for analyzing reviews of E-commerce product

Green Master based on MapReduce Cluster

6.7 Network analysis Introduction. References - Network analysis. Topological analysis

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki

1. The Time Value of Money

Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software

APPENDIX III THE ENVELOPE PROPERTY

The Time Value of Money

Classic Problems at a Glance using the TVM Solver

CHAPTER 2. Time Value of Money 6-1

Optimal Packetization Interval for VoIP Applications Over IEEE Networks

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information

IP Network Topology Link Prediction Based on Improved Local Information Similarity Algorithm

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

Efficient Traceback of DoS Attacks using Small Worlds in MANET

A DISTRIBUTED REPUTATION BROKER FRAMEWORK FOR WEB SERVICE APPLICATIONS

Numerical Methods with MS Excel

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

On Error Detection with Block Codes

Settlement Prediction by Spatial-temporal Random Process

T = 1/freq, T = 2/freq, T = i/freq, T = n (number of cash flows = freq n) are :

of the relationship between time and the value of money.

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

A Study of Unrelated Parallel-Machine Scheduling with Deteriorating Maintenance Activities to Minimize the Total Completion Time

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev

Measuring the Quality of Credit Scoring Models

Constrained Cubic Spline Interpolation for Chemical Engineering Applications

Integrating Production Scheduling and Maintenance: Practical Implications

Proactive Detection of DDoS Attacks Utilizing k-nn Classifier in an Anti-DDos Framework

10.5 Future Value and Present Value of a General Annuity Due

Low-Cost Side Channel Remote Traffic Analysis Attack in Packet Networks

On formula to compute primes and the n th prime

Optimal replacement and overhaul decisions with imperfect maintenance and warranty contracts

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

Cyber Journals: Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Telecommunications (JSAT), January Edition, 2011

Forecasting Trend and Stock Price with Adaptive Extended Kalman Filter Data Fusion

The paper presents Constant Rebalanced Portfolio first introduced by Thomas

Credibility Premium Calculation in Motor Third-Party Liability Insurance

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity

AP Statistics 2006 Free-Response Questions Form B

Bayesian Network Representation

Speeding up k-means Clustering by Bootstrap Averaging

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

ROULETTE-TOURNAMENT SELECTION FOR SHRIMP DIET FORMULATION PROBLEM

Report 52 Fixed Maturity EUR Industrial Bond Funds

The Digital Signature Scheme MQQ-SIG

A probabilistic part-of-speech tagger for Swedish

Projection model for Computer Network Security Evaluation with interval-valued intuitionistic fuzzy information. Qingxiang Li

Reinsurance and the distribution of term insurance claims

DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

Banking (Early Repayment of Housing Loans) Order,

Entropy-Based Link Analysis for Mining Web Informative Structures

How To Balance Load On A Weght-Based Metadata Server Cluster

RUSSIAN ROULETTE AND PARTICLE SPLITTING

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

Robust Realtime Face Recognition And Tracking System

Approximation Algorithms for Scheduling with Rejection on Two Unrelated Parallel Machines

The simple linear Regression Model

Using Data Mining Techniques to Predict Product Quality from Physicochemical Data

The Popularity Parameter in Unstructured P2P File Sharing Networks

GRADUATION PROJECT REPORT

CHAPTER 13. Simple Linear Regression LEARNING OBJECTIVES. USING Sunflowers Apparel

Compressive Sensing over Strongly Connected Digraph and Its Application in Traffic Monitoring

Automated Event Registration System in Corporation

Near Neighbor Distribution in Sets of Fractal Nature

Numerical Comparisons of Quality Control Charts for Variables

Efficient Compensation for Regulatory Takings. and Oregon s Measure 37

Performance Attribution. Methodology Overview

Suspicious Transaction Detection for Anti-Money Laundering

Study on prediction of network security situation based on fuzzy neutral network

Transcription:

Proceedgs of the workshop o Mache earg the New Iformato Age, G. Potamas, V. Moustaks ad M. va omere (eds.), th Europea Coferece o Mache earg, Barceloa, pa, pp. 9-7, 2000. A Evaluato of Nave Bayesa At-pam Flterg Io Adroutsopoulos, Joh Koutsas, Kostatos V. Chadros, George Palouras ad Costate D. pyropoulos oftware ad Kowledge Egeerg aoratory Natoal Cetre for cetfc Research Demokrtos 53 0 Ag. Paraskev, Athes, Greece phoe: +30--650397 fax: +30--653275 E-mal: {oadr, jkouts, kostel, palourg, costass}@t.demokrtos.gr Astract It has recetly ee argued that a Nave Bayesa classfer ca e used to flter usolcted ulk e-mal ( spam ). We coduct a thorough evaluato of ths proposal o a corpus that we make pulcly avalale, cotrutg towards stadard echmarks. At the same tme we vestgate the effect of attrute-set sze, trag-corpus sze, lemmatzato, ad stop-lsts o the flter s performace, ssues that had ot ee prevously explored. After troducg approprate cost-sestve evaluato measures, we reach the cocluso that addtoal safety ets are eeded for the Nave Bayesa at-spam flter to e vale practce. Itroducto Usolcted ulk e-mal, electroc messages posted ldly to thousads of recpets, s ecomg alarmgly commo. Although most users fd these postgs called spam aoyg ad delete them mmedately, the low cost of e-mal s a strog ctemet for drect marketers advertsg aythg from vacatos to get-rch schemes. A 997 study (Craor & amaccha, 998) foud that 0% of the comg e-mal to a corporate etwork was spam. Apart from wastg tme, spam costs moey to users wth dal-up coectos, wastes adwdth, ad may expose uder-aged recpets to usutale (e.g. porographc) cotet. ome at-spam flters are already avalale. These rely mostly o maually costructed pattermatchg rules that eed to e tued to each user s comg messages, a task requrg tme ad expertse. Furthermore, the characterstcs of spam (e.g. products advertsed, frequet terms) chage over tme, requrg the rules to e mataed. A system that would lear automatcally to separate spam from other legtmate messages would, therefore, preset sgfcat advatages. everal mache learg algorthms have ee appled to text categorzato (e.g. Apte & Damerau, 994; ews, 996; Daga et al., 997; see easta, 999, for a survey). These algorthms lear to classfy documets to fxed categores, ased o ther cotet, after eg traed o maually categorzed documets. Algorthms of ths kd have also ee used to thread e-mal (ews & Kowles, 997), classfy e-mal to folders (Cohe, 996; Paye & Edwards, 997), detfy terestg ews artcles (ag, 995), etc. To the est of our kowledge, however, oly oe attempt has ever ee made to apply a mache learg algorthm to at-spam flterg (aham et al., 998). aham et al. traed a Nave Bayesa classfer (Duda & Hart, 973; Mtchell 997) o maually categorzed legtmate ad spam messages, reportg mpressve precso ad recall o usee messages. It may e surprsg that text categorzato ca e effectve at-spam flterg: ulke other text categorzato tasks, t s the act of ldly mass-malg a message that makes t spam, ot ts actual cotet. Nevertheless, t seems that the laguage of spam costtutes a dstctve gere, ad that spam messages are ofte aout topcs rarely metoed legtmate messages, makg t possle to tra a text classfer for at-spam flterg. ee, for example, http://www.tucows.com. Cosult http://www.cauce.org, http://www.jukemal.org, ad http://spam.ause.et for related resources ad legal ssues.

Text categorzato research has eefted from pulcly avalale maually categorzed documet collectos, lke the Reuters corpus (ews, 992), that have ee used as echmarks. Creatg smlar resources for at-spam flterg s ot straghtforward, ecause a user s comg e-mal stream caot e made pulc wthout volatg hs/her prvacy. A useful approxmato of such a stream, however, ca e made y mxg spam messages wth messages extracted from spam-free pulc archves of malg lsts. Towards that drecto, we test aham et al. s approach o a mxture of spam messages ad messages set va the gust lst, 2 a moderated (hece, spam-free) lst aout the professo ad scece of lgustcs. The resultg corpus, dued g-pam, s made pulcly avalale for others to use as a echmark. 3 The gust messages are, of course, more topc-specfc tha most users comg e-mal. They are less stadardzed, however, tha oe mght expect (e.g. they cota jo postgs, software avalalty aoucemets, eve flame-lke resposes), to the extet that useful prelmary coclusos aout atspam flterg of a user s comg e-mal ca e reached wth g-pam, at least utl etter pulc corpora ecome avalale. Wth a more drect terpretato, our expermets ca e see as a study o at-spam flters for ope umoderated malg lsts or ewsgroups. Ulke aham et al., we use te-fold cross-valdato whch makes our results less proe to radom varato. Our expermets also shed more lght o the ehavor of Nave Bayesa at-spam flterg y vestgatg the effect of attrute-set sze, trag-corpus sze, lemmatzato, ad stop-lsts, ssues ot covered y aham et al. s study. Furthermore, we show how evaluato measures that corporate a decso-theoretc oto of cost ca e employed. Our results cofrm aham et al. s hgh precso ad recall. A cost-sestve evaluato, however, suggests that complemetary safety ets are eeded for the Nave Bayesa flter to e vale. ecto 2 dscusses Nave Bayesa classfcato; secto 3 lsts aham et al. s results; secto 4 descres our flterg system, the g-pam corpus, ad our results; secto 5 troduces cost-sestve evaluato measures; ad secto 6 cocludes. 2 Nave Bayesa classfcato Each message s represeted y a vector attrutes x, x, x2, x3, x, where x,, x are the values of X,, X. Followg aham et al., we use ary attrutes: X f some characterstc represeted y X s preset the message; otherwse X 0. I our expermets, attrutes correspod to words,.e. each attrute shows f a partcular word (e.g. adult ) s preset. To select amog all possle attrutes, we follow aham et al. ad compute the mutual formato ( MI ) of each caddate attrute X wth the category-deotg varale C : X x, C c) MI( X ; C) X x, C c) log X x) C ) x { 0,}, c { spam, legtmate} c The attrutes wth the hghest MI s are selected. Proaltes are estmated as frequecy ratos from the trag corpus (see Mtchell, 996, for etter estmators that we pla to corporate future). From Bayes theorem ad the theorem of total proalty, gve the vector documet d, the proalty that d elogs to category c s: x x,, x of a l P ( C c X x) C c) X x C c) k { spam, legtmate} C k) X x C k) The proaltes P ( X C) are practcally mpossle to estmate drectly (the possle values of X are too may, ad there are data-sparseess prolems). The Nave Bayesa classfer makes the smplfyg assumpto that X, l, X are codtoally depedet gve the category C. The: 2 Archved at http://lstserv.lgustlst.org/archves/lgust.html. 3 The g-pam corpus s avalale from http://www.t.demokrtos.gr/~oadr/pulcatos.htm.

P ( C c X x) C c) X C k) x k { spam, legtmate} C c) X x C k) where X C) ad P (C) ca e easly estmated as relatve frequeces from the trag corpus. everal studes have foud the Nave Bayesa classfer to e surprsgly effectve (agley et al., 992; Domgos & Pazza, 996), despte the fact that ts depedece assumpto s usually oversmplstc. Mstakely lockg a legtmate message (classfyg t as spam) s geerally more severe tha lettg a spam message pass the flter (classfyg a spam message as legtmate). et ad deote the two error types. Assumg that s λ tmes more costly tha, we classfy a message as spam f: C spam X x) > λ C legtmate X x) To the extet that the depedece assumpto holds ad the proalty estmates are accurate, a classfer adoptg ths crtero acheves optmal results (Duda & Hart, 973). I our case, P ( C spam X x) C legtmate X x), whch leads to a alteratve reformulato of the crtero: λ t P ( C spam X x) > t, wth t, λ + λ t aham et al. set the threshold t to 0.999 ( λ 999 );.e. lockg a legtmate message s as ad as lettg 999 spam messages pass the flter. uch a hgh value of λ s reasoale whe locked messages are dscarded wthout further processg, as most users would cosder losg a legtmate message uacceptale. Alteratve cofguratos are possle, however, where lower values of λ are reasoale. Istead of deletg a locked message, t could e retured to the seder, wth a request to re-sed t to a prvate u-fltered e-mal address of the recpet (see also Hall, 998). The prvate address would ever e advertsed (e.g. o we pages), makg t ulkely to receve spam drectly; ad the request to re-sed could clude a frequetly chagg rddle (e.g. Iclude the suject the captal of Frace. ) to esure that reples are ot set y spam-geeratg roots. I that case, λ 9 ( t 0. 9 ) seems reasoale: lockg a legtmate message s pealzed mldly more tha lettg a spam message pass, to model the fact that re-sedg a locked message volves more work (y the seder) tha maually deletg a spam message. Eve λ ( t 0. 5 ) may e acceptale, f the recpet does ot care aout extra work mposed o the seder. 3 Prevous results Tale summarzes aham et al. s results. If ad are the umers of ad errors, ad, are the umers of correctly treated legtmate ad spam messages, the spam recall ( R ) ad spam precso ( P ) are: R + P + I the secod expermet of tale, caddate attrutes cluded ot oly word-attrutes, ut also attrutes showg f partcular had-pcked phrases (e.g. e over 2 ) were preset. I the thrd ad fourth expermets, o-textual caddate attrutes were added, showg f messages had maually chose propertes (e.g. attachmets). aham et al. s phrasal ad o-textual attrutes troduce a maual cofgurato stage, as oe has to select maually phrases ad o-textual characterstcs to e treated as caddate attrutes. ce our target was to explore fully automatc at-spam flterg, we have lmted ourselves to word-attrutes.

Tale : Resuls of aham et al. (500 attrutes, threshold 0.999, λ 999 ) Attrutes Total Messages Testg Messages % pam pam Precso pam Recall words oly 789 25 88.2% 97.% 94.3% words + phrases 789 25 88.2% 97.6% 94.3% words + phrases + o-textual 789 25 88.2% 00.0% 98.3% words + phrases + o-textual 285 222 ~20% 92.3% 80.0% 4 Expermets wth g-pam Our expermets were all performed o the g-pam corpus, whch cossts of: 242 gust messages, otaed y radomly dowloadg dgests from the archves, separatg ther messages, ad removg text added y the lst s server. 48 spam messages, receved y the frst author. Attachmets, HTM tags, ad duplcate spam messages receved o the same day were ot cluded. pam s 6.6% of the corpus, a fgure close to the spam rates of the authors, aham et al. s fourth expermet, ad rates reported elsewhere (Craor & amaccha, 998). Our mplemetato of the Nave Bayesa flter (developed o GATE), cludes a lemmatzer that coverts each word to ts ase form, ad a stop-lst that removes from messages the 00 most frequet words of the Brtsh Natoal Corpus (BNC). 4 The two modules ca e ealed or dsaled, allowg ther effect to e measured. To reduce radom varato, te-fold cross-valdato was employed, ad averaged scores are reported. I a frst seres of expermets, the umer of retaed attrutes (hghest MI ) raged from 50 to 700 y 50, for all comatos of ealed/dsaled lemmatzer ad stop-lst. Three thresholds were tred: t 0.999 ( λ 999 ), t 0. 9 ( λ 9 ), ad t 0. 5 ( λ ). As dscussed secto 2, these represet three scearos: deletg locked messages; ssug a re-sed request ad accoutg for the seder s extra work; ad ssug a re-sed request gorg the seder s extra work. Fgures 3 show that the flter acheved mpressve spam recall ad precso at all three thresholds, verfyg that sese the fdgs of aham et al. I all cases, lemmatzato seems to mprove results. The stop-lst has a postve effect for λ ad λ 9, ut ts effect looks eglgle for λ 999. Wthout a sgle evaluato measure, however, to e used stead of spam precso ad recall, t s dffcult to check f the effects of the lemmatzer ad the stop-lst are statstcally sgfcat. For λ 999, lockg a legtmate message s much more severe tha lettg a spam message pass the flter. Hece, t seems reasoale to assume that the est cofgurato s the oe that maxmzes spam precso. Ths s acheved wth 300 attrutes ad the lemmatzer ealed (00% spam precso, 63% spam recall; here, the effect of the stop-lst s eglgle). For λ ad λ 9, however, t s hard to tell whch cofgurato (comato of precso ad recall) s est. Aga, a sgle measure s eeded; ad t must e sestve to our cost. We dscuss ths ext. 5 4 GATE, cludg the lemmatzer, s avalale from http://www.dcs.shef.ac.uk/research/groups/lp. BNC frequecy lsts are avalale from ftp://ftp.tr.to.ac.uk/pu/c. 5 The F-measure, used formato retreval ad extracto to come recall ad precso, s usutale to our purposes, ecause ts weghtg factor caot e easly related to our oto of cost.

0.99 0.99 spam precso 0.98 0.97 spam precso 0.98 0.97 0.96 0.95 o lemmatzer, o stop-lst o lemmatzer, top-00 stop-lst wth lemmatzer, o stop-lst 0.4 0.5 0.6 0.7 0.8 0.9 spam recall 0.96 0.95 o lemmatzer, o stop-lst o lemmatzer, top-00 stop-lst wth lemmatzer, o stop-lst 0.4 0.5 0.6 0.7 0.8 0.9 spam recall Fgure : pam precso ad recall at t 0.5 ( λ ) Fgure 2: pam precso ad recall at t 0.9 ( λ 9 ) 0.99 spam precso 0.98 0.97 0.96 0.95 o lemmatzer, o stop-lst o lemmatzer, top-00 stop-lst wth lemmatzer, o stop-lst 0.4 0.5 0.6 0.7 0.8 0.9 spam recall Fgure 3: pam precso ad recall at t 0. 999 ( λ 999 ) 5 Cost-sestve evaluato measures I classfcato tasks, two commoly used evaluato measures are accuracy ( Acc ) ad error rate ( Err Acc ). I our case: Acc N + + N Err N + + N N ad N are the umers of legtmate ad spam messages to e classfed. Accuracy ad error rate assg equal weghts to the two error types ( ad ). Whe selectg the threshold of the classfer (secto 2), however, we assumed that s λ tmes more costly tha. To make accuracy ad error rate sestve to ths cost, we treat each legtmate message as f t were λ messages: whe a legtmate message s msclassfed, ths couts as λ errors; ad whe t s classfed correctly, ths couts as λ successes. Ths leads to weghted accuracy (WAcc ) ad weghted error rate ( WErr WAcc ): WAcc λ λ N + + N WErr λ λ N + + N

Tale 2: Results o g-pam for est o. of attrutes (2893 total messages, 6.6% spam, 0-fold cross valdato, attrutes ragg from 50 to 700 y a step of 50) Flter Cofgurato λ No. of pam pam Weghted Basele TCR attr. Recall Precso Accuracy W. Acc. (a) are 50 8.0% 96.85% 96.408% 83.374% 4.63 () stop-lst 50 82.35% 97.3% 96.649% 83.374% 4.96 (c) lemmatzer 00 82.35% 99.02% 96.926% 83.374% 5.4 (d) lemmatzer + stop-lst 00 82.78% 99.49% 97.064% 83.374% 5.66 (a) are 9 200 76.94% 99.46% 99.49% 97.832% 3.73 () stop-lst 9 200 76.% 99.47% 99.40% 97.832% 3.62 (c) lemmatzer 9 00 77.57% 99.45% 99.432% 97.832% 3.82 (d) lemmatzer + stop-lst 9 00 78.4% 99.47% 99.450% 97.832% 3.94 (a) are 999 200 73.82% 99.43% 99.92% 99.980% 0.23 () stop-lst 999 200 73.40% 99.43% 99.92% 99.980% 0.23 (c) lemmatzer 999 300 63.67% 00.00% 99.993% 99.980% 2.86 (d) lemmatzer + stop-lst 999 300 63.05% 00.00% 99.993% 99.980% 2.86 Whe usg accuracy or error rate (weghted or ot), t s mportat to compare to a smplstc asele approach, to avod msterpretg the ofte hgh accuracy ad low error rate scores. As asele, we use the case where o flter s preset: legtmate messages are (correctly) ever locked, ad spam messages (mstakely) always pass the flter. The weghted accuracy ad error rate of the asele are: WAcc λ N λ N + N WErr N λ N + N To compare easly wth the asele, we troduce the total cost rato (TCR ): WErr TCR WErr λ N + Greater TCR dcates etter performace. For TCR <, ot usg the flter s etter. If cost s proportoal to wasted tme, TCR measures how much tme s wasted to delete maually all spam messages whe o flter s preset ( N ), compared to the tme wasted to delete maually ay spam messages that passed the flter ( ) plus the tme eeded to recover from mstakely locked legtmate messages ( λ ). Tale 2 lsts spam recall, spam precso, weghted accuracy, asele weghted accuracy, ad TCR, for varous cofguratos of the flter, ad for the umer of attrutes that led to the hghest TCR wth each cofgurato. Fgures 4 6 show TCR for dfferet umers of attrutes, ad λ, 9, 999. I all cases, te-fold cross valdato was used, ad average WAcc s reported. TCR s computed as WErr dvded y the average WErr. Icreasg the umer of attrutes eyod a certa pot geerally degrades performace, ecause attrutes wth low MI do ot dscrmate well etwee the two categores. At all three λ values, the hghest TCR scores were otaed wth the lemmatzer ealed. The stop-lst had a addtoal postve effect for λ ad λ 9, ut ot for λ 999. The dffereces, however, are ot always statstcally sgfcat. For λ, pared sgle-taled t-tests o WAcc etwee all flter cofguratos of tale 2 cofrm oly that cofguratos () ad (d) are etter tha (a) at p < 0. 05. All four cofguratos, however, are sgfcatly etter tha the asele at p < 0. 0. For λ 9, oe of

TCR 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0.5.0 0.5 o lemmatzer, o stop-lst o lemmatzer, top-00 stop-lst wth lemmatzer, o stop-lst 0.0 50 00 50 200 250 300 350 400 450 500 550 600 650 700 umer of retaed attrutes Fgure 4: TCR at t 0. 5 ( λ ) Fgure 5: TCR at t 0. 9 ( λ 9 ) TCR 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0.5.0 0.5 o lemmatzer, o stop-lst o lemmatzer, top-00 stop-lst wth lemmatzer, o stop-lst 0.0 50 00 50 200 250 300 350 400 450 500 550 600 650 700 umer of retaed attrutes TCR 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0.5.0 0.5 o lemmatzer, o stop-lst o lemmatzer, top-00 stop-lst wth lemmatzer, o stop-lst 0.0 50 00 50 200 250 300 350 400 450 500 550 600 650 700 umer of retaed attrutes TCR 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0.5.0 0.5 lamda, 00 attrutes lamda 9, 00 attrutes lamda 999, 300 attrutes 0.0 0% 20% 30% 40% 50% 60% 70% 80% 90% 00% sze of trag corpus (00% s 2603 messages) Fgure 6: TCR at t 0. 999 ( λ 999 ) Fgure 7: TCR for varale trag corpus sze, wth lemmatzer ad stop-lst the hypotheses of tale 2, e.g. that cofgurato (d) s etter tha (a), are statstcally sgfcat at p < 0.05, ut all cofguratos are, aga, sgfcatly etter tha the asele at p < 0. 0. For λ 999, the flter acheves TCR > oly wth the lemmatzer ealed. The stop-lst has essetally o effect, ad oth cofguratos (c) ad (d) are sgfcatly etter tha the asele at p < 0. 0. Overall, for λ ad λ 9 the flter demostrates a stale ehavor, wth TCR costatly greater tha. For λ 999, however, the flter acheves TCR > oly for oe partcular umer of attrutes (300), ecause the error s pealzed so heavly that a sgle locked legtmate message s eough for WAcc to exceed WAcc (the flter makes o such error at 300 attrutes). I a real applcato, t s ulkely that oe would e ale to p-pot precsely the optmal umer of attrutes, whch casts douts over the applcalty of the flter for λ 999. Eve more worryg, for λ 999, are the results of a secod seres of expermets we performed, ths tme varyg the sze of the trag corpus. At every te-fold repetto, g-pam was dvded to te parts, wth oe part reserved for testg. From each oe of the remag e parts, oly x % was used for trag, wth x ragg from 0 to 00 y 0. Fgure 7 shows the resultg TCR scores for λ, 0.9, 0.999. All expermets were coducted wth the lemmatzer ad stop-lst ealed, ad wth the est umers of attrutes, as tale 2.

Ulke λ ad λ 9, for λ 999 the flter reached TCR > oly wth 00% of the trag corpus, ad oe caot easly assume that TCR would rema > gve more trag. (We attrute the tal peak of TCR to the fact that wth very lttle trag the classfer teds to classfy all messages to the most frequet category, legtmate, whch protects t from makg a costly error). These fdgs suggest that whe λ 999, the flter s ot safe eough to use. 6 Coclusos Our cost-sestve evaluato suggests that, despte ts hgh spam recall ad precso, the Nave Bayesa flter s ot vale whe locked messages are deleted (a stuato we modelled wth λ 999 ). Wth addtoal safety ets, however, lke re-sedg to prvate addresses, the cost of lockg a legtmate message s lower (we used λ ad λ 9), ad the flter has a stale sgfcat postve cotruto. We pla to mplemet at-spam flters ased o alteratve mache learg algorthms, ad compare them to the Nave Bayesa flter. We expect automatc at-spam flterg to ecome a mportat memer of a emergg famly of juk-flterg tools for the Iteret, whch wll clude tools to remove advertsemets (Kushmerck, 999), ad lock hostle or porographc materal (Forsyth, 996; pertus, 997). Refereces. Apte, C., ad Damerau, F. Automated earg of Decso Rules for Text Categorzato. ACM Trasactos o Iformato ystems, 2(3):233 25, 994. 2. Cohe, W.W. earg Rules that Classfy E-Mal. Proceedgs of the AAAI prg ymposum o Mache earg Iformato Access, taford, Calfora, 996. 3. Craor,.F., ad amaccha, B.A. pam! Commucatos of ACM, 4(8):74 83, 998. 4. Daga, I., Karov, Y., ad Roth, D. Mstake-Drve earg Text Categorzato. Proceedgs of the 2 d Coferece o Emprcal Methods Natural aguage Processg, pages 55 63, Provdece, Rhode Islad, 997. 5. Domgos, P., ad Pazza, M. Beyod Idepedece: Codtos for the Optmalty of the mple Bayesa Classfer. Proceedgs of the 3 th It. Coferece o Mache earg, pp. 05 2, Bar, Italy, 996. 6. Duda, R.O., ad Hart, P.E. Bayes Decso Theory. Chapter 2 Patter Classfcato ad cee Aalyss, pp. 0 43. Joh Wley, 973. 7. Hall, R.J. How to Avod Uwated Emal. Commucatos of ACM, 4(3):88 95, 998. 8. Kushmerck, N. earg to Remove Iteret Advertsemets. Proceedgs of the 3 rd Iteratoal Coferece o Autoomous Agets, pp. 75 8, eattle, Washgto, 999. 9. ag, K. Newsweeder: earg to Flter Netews. Proceedgs of the 2 th It. Coferece o Mache earg, pp. 33 339, taford, Calfora, 995. 0. agley, P., Waye, I., ad Thompso, K.. A Aalyss of Bayesa Classfers. Proceedgs of the 0 th Natoal Coferece o AI, pp. 223 228, a Jose, Calfora, 992.. ews, D. Feature electo ad Feature Extracto for Text Categorzato. Proceedgs of the DARPA Workshop o peech ad Natural aguage, pp. 22 27, Harrma, New York, 992. 2. ews, D. Trag Algorthms for ear Text Classfers. Proceedgs of the 9 th Aual Iteratoal ACM-IGIR Coferece o Research ad Developmet Iformato Retreval, pp. 298 306, Kostaz, Germay, 996. 3. ews, D. ad Kowles, K.A. Threadg Electroc Mal: A Prelmary tudy. Iformato Processg ad Maagemet, 33(2):209 27, 997. 4. Mtchell, T.M. Mache earg. McGraw-Hll, 997.

5. Paye, T.R. ad Edwards, P. Iterface Agets that ear: A Ivestgato of earg Issues a Mal Aget Iterface. Appled Artfcal Itellgece, (): 32, 997. 6. aham, M., Dumas,., Heckerma, D., ad Horvtz, E. A Bayesa Approach to Flterg Juk E- Mal. I earg for Text Categorzato Papers from the AAAI Workshop, pp. 55 62, Madso Wscos. AAAI Techcal Report W-98-05, 998. 7. easta, F. Mache earg Automated Text Categorsato. Techcal Reeport B4-3, Isttuto d Elaorazoe dell'iformazoe, Cosglo Nazoale delle Rcerche, Psa, 999. http://faure.e.p.cr.t/~farzo. 8. pertus, E. mokey: Automatc Recogto of Hostle Messages. Proceedgs of the 4 th Natoal Coferece o AI ad the 9 th Coferece o Iovatve Applcatos of AI, pp. 058 065, Provdece, Rhode Islad, 997.