Filtering Junk

Transcription

1 Flterng Jun E-Mal: A Performance Comparson beteen Genetc Programmng & Naïve Bayes September Prepared by: Hooman Katra 4A Year Computer Engneerng Student Department of Electrcal & Computer Engneerng Unversty Of Waterloo Waterloo Ontaro Presented to: Dr. Dale. Schuurmans Professor of Computer Scence Logc Programmng & Artfcal Intellgence Group Unversty of Waterloo Waterloo Ontaro

2 Table of Contents 1. Summary Motvaton Spam s a maor problem The Need for a flter at the e-mal clent Problem To Be Solved Related Wor Approach Bacground The Naïve Bayes Approach to Text Classfcaton.. 6 The Generatve Naïve Bayes Model... 7 Tranng a Naïve Bayes Classfer Usng a Naïve Bayes Classfer The Genetc Programmng Approach The Ftness Functon Parse Trees: The Representaton of Solutons.. 13 Crossover Mutaton GP Summary Implementaton Issues The Tranng Set And The Test Set Feature Selecton Poor Performance due to lac of precson Learnng Issues Expermental Results Conclusons Recommendatons References Appendx 1: Table of Expenses

3 Table of Fgures Table 1: Contngency Table or Confuson Matrx... 6 Table 2: Numercal Operators Table 3: Word Operators Table 4: Recall and Precson for Naïve Bayes and Genetc Programmng 19 Table 5: E-mal Sgnatures Can Be Harvested as Useful Features. 20 Fgure 1: The Reducton of a Smple Tree Fgure 2: Representng The Order of Operatons n Parse Trees. 13 Fgure 3: A Smple Tree Contanng a Feature Detector.. 14 Fgure 4: The Crossover Operaton Fgure 5: The Mutaton Operaton

4 1. Summary Ths paper descrbes the applcaton of genetc programmng as a novel approach to the problem of flterng un e-mal. We benchmar our results aganst the common standard: the Naïve Bayes classfer. Whle the genetcally programmed classfer demonstrated a precson comparable to that of Naïve Bayes t as slghtly outperformed n recall. Snce both learnng methods gave smlar results t s recommended that a larger study be undertaen to ascertan hether these dfferences are ndeed statstcally sgnfcant. Further t s recommended that the performance of these classfers be tested n a rcher feature space more typcal of real-orld classfers. Although the genetcally programmng classfer greatly outperformed the Naïve Bayes classfer n speed t s concluded that a more effcent mplementaton of Naïve Bayes needs to be used n order to provde a far comparson. We sho that hen left unabated e-mal sgnatures also non as taglnes reduce the value of several mportant features n un e-mal detecton; hoever t s also shon that these e-mal sgnatures may be harvested as advantageous features f some of ther components are removed and noted as a feature. We therefore recommend that a better parser capable of meetng ths crtera be mplemented. To ad the reader n the theoretcal aspects of our or e have ncluded ntroductory bacground for both approaches ncludng a full dervaton of the generatve Naïve Bayes model. 1

5 2. Motvaton 2.1 Spam s a maor problem Unsolcted Bul E-mal (UBE) commonly non as un e-mal or spam s to most users merely a nusance an annoyng but sometmes unavodable realty of lfe n an ncreasngly nternet-centrc orld. As a parent concerned about pornographc materals landng n ther chldren s malbox and you ll get a very dfferent response. As the person ho must deal th such dstasteful messages on a daly bass and you ll fnd smlar concerns. In short spam s an unauthorzed ntruson nto a vrtual space the e-mal box rented by ctzens for ther on purposes. Spam s often dstasteful an AT&T study found that 11 percent of spam contaned adult content [3] but parents and users aren t the only ones concerned. UUNet one of the larger ISPs has a team of 6 people th an annual budget of USD $1 Mllon to combat spam [1]. Another ISP Netcom estmates that 10% of a customer s bll s devoted to fghtng spam [2]. Why are these ISPs spendng so much money to fght spam? Because n addton offendng users spam accounts for over 30% of the e-mal on maor ISPs such as AOL and Mndsprng [1]. One thrd of ISPs have reported system outages caused by spam [4]. Such hgh volumes of un e-mal clog netor banddth and consume large amounts of storage space n fles servers here many users may receve duplcates of the same message [10]. AOL n partcular has been hard ht because of ts publcly avalable e-mal drectory hch allos spammers to qucly gather a large number of e-mal addresses. Even large ISPs such as Pacfc Bell have experenced complete shutdons n servce due to spam [6]. A survey conducted by the respected research frm the Internatonal Data Corporaton (IDC) found that spam as raned by ISPs as ther number to problem [8]. There are other socal costs as ell. For example snce many UBE malers obtan ther e-mal addresses from Usenet nesgroups many people have become reluctant to post messages on publc forums reducng the vbrancy of the nternet nes communty. Brefly spam s a maor problem that needs to be brought under effectve control; hoever t s dffcult for governments to legslate on ths ssue snce the nternet s a global phenomenon. Thus users must tae the steps requred to protect themselves and ths often nvolves the applcaton of ntellgent agents or softbots to mae decsons on a human s behalf. It s th ths purpose to advance noledge n the means of contructng such softbots that ths study as undertaen. 2.2 The need for a flter at the e-mal clent There are many technques that can be used to flter un e-mal before t s deposted nto a user s e-mal account. For example messages contanng ell-non spam stes n ther relay paths are very lely to be spam themselves. Such messages should be automatcally fltered at the mal server level and never be delvered to the user s malbox. The focus of ths paper s the desgn of a flter to remove UBE hen these other methods have faled. Indeed methods that flter e-mal before they reach a user are 2

6 generally lmted n ther ablty to remove most of the spam receved 1. Snce e have assumed that all spam flters th 100% precson have already been employed at the router and mal-server levels t s then reasonable to assume that the flter at the e-mal clent level ll mae mstaes from tme-to-tme. Ths gves rse to some mportant ssues. Frstly users are hestant f not unllng to employ flters that can potentally remove legtmate e-mal especally f the flter has been gven the ablty to delete a message before a user s gven an opportunty to ve t. Users le to feel n control. Therefore nstead of automatcally deletng those messages classfed as spam an e-mal clent should n our opnon relocate those messages to a specal folder hch the user can chec occasonally to ensure that no legtmate e-mals ere msclassfed. Some researchers have proposed that messages th hgh spam ratngs should be fltered. For example n [10] Saham et. al propose that messages th a very hgh un confdence ratng should be automatcally removed. To reduce the chances of error Saham et. al proposed a hgh cutoff 99.9% but even at ths level they note that some mstaes ere made. Ths strengthens our argument that messages should not be deleted but moved to a dfferent folder here the user can rapdly delete them n tandem although t may be far more lely for a user to delete a legtmate message f t s moved to such a folder. Ths allevates the user s fear of mssng a legtmate and possbly mportant message. Furthermore e note that t s desrable to mae such a flter an automated one. Ths s because un e-mal not only taes tme to delete but ts sometmes contans offensve content (such as pornography) hch maes the cost of veng t greater than the tme needed to sort out the un [10]. Many e-mal pacages ncludng Mcrosoft Outloo allo users to manually create rules by hch they can detect and sort un e- mal. Ths alternatve to the automated approach s clearly nadequate to the tas of flterng un e-mal snce t assumes the user s savvy enough to create such rules and snce such rules ll be unable to adapt to the changes n the nature of UBE over tme [10]. Thus t s desrable for the flter to learn drectly from a user s mal repostory snce such a flter can automatcally adapt to the characterstcs of the user s un and legtmate e-mal [10]. 1 An example of such a flter ould nvolve reectng e-mals from ell-non spammng domans usng so-called real-tme spammng blaclsts hoever such flters are lmted n ther ablty to combat spam. Ths s because most UBE producers sho lttle regard for honest or normal busness practces [9]. To reduce the traceablty of ther bul e-mal to ts source many UBE senders falsfy the e-mal header ncludng the FROM feld mang an e-mal flter that reles on ths feld of lmted value. Spam producers are also non to abuse the relayng feature of the Smple Mal Transport Protocol (SMTP) the protocol used to transport nternet e-mal. Ths feature allos one mal server to relay messages to ntermedary mal servers. Spammers abuse ths feature by usng t to send masses of e-mal usng other people s servers thout ther noledge. Ths n effect offloads almost the entre cost of the bul malng to the vctm s mal server. Further f the recever attempts to trace the message bac to ts orgn t ll lead to the vctm s mal server and not to the spammer s doman. 3

7 3. Problem to be solved We no provde a more formal defnton of the problem to be solved. We sh to create an nformaton flter to autonomously classfy e-mals from an nput stream consstng of ncomng e-mals nto to output streams one representng the category belongng to Unsolcted Bul E-mal (UBE) and the other representng legtmate messages. We defne UBE as unsolcted messages advertsng solctng or advocatng a product servce eb ste vepont get-rch quc scheme or other fraudulent organzaton that ere sent n bul to many users and thout the pror consent of the recevers. 4. Related Wor Although there has been much or done n autonomous text categorzaton over the last fe decades only a small amount of or has gone nto the classfcaton of e-mal messages. Feer stll are the papers focussed on the autonomous dentfcaton of un e- mal. We lst the ones deemed most relevant here: 1. MAXIM (Lashar Metral & Maes 1993) s an e-mal based assstant that uses Machne Based Reasonng (MBR) to predct hether a user ould fle delete or read an e-mal although un e-mal s never specfcally addressed. 2. Mag (Payne 1994) a mal nterface agent that uses decson trees to model a user profle. Mag attempts to automatcally route ne messages to relevant folders. 3. RIPPER algorthm (Cohen 1996) Cohen suggests ne methods for automatcally learnng rules for classfyng e-mal nto categores; hoever he never specfcally addresses the category of un e-mal n hs paper. 4. Genetc Document Classfer (Clac C. & Farrngton J. & Ldell P. & Yu T. 1997) Ths as the frst publshed text classfer to use genetc programmng. It routed n-bound documents (ncludng e-mals) to a central classfer hch autonomously routed documents to nterested research groups thn a large organzaton. 5. Smoey (Spertus 1997) an e-mal assstant that detected flames an nternet slang term for hostle or angry messages usually n retalaton for some act or event such as an unelcome nesgroup postng. 6. Mcrosoft Outloo 98 A eyord based un e-mal flter as ntroduced n Outloo 98 beta but as later thdran follong legal concerns. 7. Bayesan Jun E-mal Flter (Saham M. & Dumas S. & Hecerman D. & Horvtz E. 1998) A un e-mal flter based on an enhanced naïve Bayes classfer. Recall and precson ere mproved hen phrases and header specfc nformaton ere added as features. 4

8 Our or dffers from the others n three mportant aspects: 1. Our or s specfcally focussed on the flterng of un e-mal 2. We use a novel approach (Genetc Programmng) based on the or of Clac et. al. 3. We present an emprcal comparson beteen generc programmng and naïve Bayes approach. 5. Approach We solve ths problem usng the novel approach of Genetc Programmng. Genetc Programmng has already been shon to be effectve n classfyng eb pages [27] and n general document classfcaton [23]. To provde a bass for comparson e have also solved ths problem usng a tradtonal Naïve Bayes classfer the most common classfer used n practce today. After delvng nto the theory behnd the Bayesan and Genetc approaches e then examne the specfcs of each mplementaton hch e follo th an analyss of the outcomes and some recommendatons. 6. Bacground 6.1 General Classfcaton Theory One may conceptually model a document classfer as a determnstc functon hch maps documents represented as sequences of ord events to categores. In our case e are concerned th a bnary classfcaton tas;.e. e sh to classfy e-mals nto one of to categores the dscrmnatng class.e un and the default class non-un. In flter termnology f a classfer classfes a document nto the dscrmnatng class the document s sad to have been accepted. Conversely f the document s classfed as non-un e say the document has been reected. A classfer s decson to accept or reect a document s based on the features of the document. In general text classfers only use a small subset of ords found n a doman as features. Ths lst of ords s called the classfer s vocabulary and t s sometmes also referred to as a dstngushed ords lst. In both the Naïve Bayes genetc programmng based classfers the features are the ndvdual frequences of the ords n the vocabulary. An deal flter ll accept all documents belongng to the dscrmnatng class and reect all others. In practce hoever a flter desgner s generally faced th the tradeoffs of recall and precson 5

9 hch are defned as follos. Classfer Accepted Classfer Reected Expert says yes a c Expert Says No b d Table 1: Contngency Table or Confuson Matrx: Each entry n the table represents the number of documents th the specfed outcome;.e. a s the number of tmes the classfer accepted a document that belonged to the dscrmnatng class. Recall ( a a c) Pr ecson ( a a b) (1) Recall s the percentage of documents n the dscrmnatng class that ere accepted. Precson s the percentage of accepted documents belongng to the dscrmnatng class. Many classfers also provde a confdence ratng hch asserts the degree to hch the document belongs to the dscrmnatng class. Often these confdence ratngs are expressed as a percentage here 100% represents a document that completely belongs to the dscrmnatng class and 0% represents a document that s completely rrelevant. Often nformaton flters are desgned to accept documents th a confdence above a certan threshold hle reectng documents belo that threshold. 6.2 The Naïve Bayes Approach to text classfcaton Theory We use a generatve probablstc model to explan the Naïve Bayes classfer. For those unfamlar th generatve probablty models e recommend the materals of Brendan Frey at Defnton of Terms To smplfy future explanaton e no defne some terms. A vocabulary V s an ordered collecton of ords.e. V={v 1 v 2 v 3. V}. Smlar to a human s vocabulary hch represents the ords hch a human understands the classfer s vocabulary represents the only ords the classfer ll use to determne a document s category.e. all ords n a document hch are not n the classfer' s vocabulary are gnored. We 6

10 7 )... ( ) ( T D c P D P ) ( )... ( T T D D c P c P ) ( ) ( ) ( 1 T T D c P D P c D P represent a document D as an ordered collecton of ord events rtten D={ D } here each represents a ord from the vocabulary. We rte to denote the th ord n document. A classfer s a machne n the mathematcal sense that determnstcally returns a class c n C={c 1 c 2 c 3 c}gven a partcular document D and a collecton of parameters T. The Generatve Naïve Bayes Model (adapted from [11]) To generate a document D e frst e pc the length for the document D and then generate a document based on ths length. Note ths means that e are assumng the document length to be ndependent of the category and that each ord s generated ndependent of the length. )... ( ) ( ) ( 2 1 D c P D P c C D P (1) Notce that e requre the generaton of each ord to depend on the ords that preceded t. Although ths s true n practce e no relax ths assumpton by assumng the standard Naïve Bayes assumpton.e. that (2) The reader may obect to the above assumpton notng that t requres each ord to be generated ndependent of ts context and that t further requres each ord to be generated ndependent of ts poston an assumpton certanly volated n practce. Computatonal lngusts have found hoever that ths model produces good results n classfyng text documents [11] [12] [13] [14] and detectng un e-mal mal [10]. Furthermore ths assumpton greatly reduces the number of parameters requred n the generatve model. Contnung e substtute (2) nto (1) to gve: (3)

11 Tranng a Naïve Bayes Classfer To tran a naïve Bayes classfer e must estmate the parameters of our model. These parameters are the class ord probabltes and the pror class probabltes rtten T { P( c T ) V c } (4) c C T c { P( c T ) c C} (5) here: and denotes the -th ord n the vocabulary V and c denotes the -th class n C. V 1 C P( c T ) 1 P ( C T ) 1. (6) 1 These parameters are respectvely estmated by: and ( ) N D D ˆ C T c P( C Tˆ) V (7) N( D ) t 1 D C t ˆ # of tranng documents n category T P(c T c ˆ) (8) D here N(xy) s the number of occurrences of ord x V n document y and D s the total number of tranng documents. The document length s not needed as a parameter for classfcaton snce e have assumed a unform dstrbuton for all classes. To elmnate zero probabltes n nfrequently occurrng ords e apply Laplacan smoothng hch eeps the sum of all ord probabltes thn a class as 1 hle elmnatng zero probabltes. Smoothng s necessary to prevent the product term n (3) from gong to zero every tme a gven document contans a vocabulary ord that dd not occur n the tranng data of the gven class c. We cannot exclude such terms from the product because dong so assumes a probablty of 1;.e. that the ord s omnpresent an assumpton that s completely defes our tranng data. We therefore employ smoothng as an ntutvely reasonable means of dealng th ths problem. 8

12 Applyng smoothng to equatons (7)(8) respectvely gves: and ( ) 1 N D D ˆ C T c P( W c Tˆ) V (9) V ) N( D t 1 D C t 1 P( C c D D ) ˆ T P( c T c ˆ) (10) C D here: N (xy) s the number of occurrences of ord x V n document y; C s the number of categores n C; D s the total number of tranng documents and P(C=c D=D ) {01}. That s P(C=c D=D )=1 f document belongs to category c otherse P(C=c D=D )=0. Usng a Naïve Bayes Classfer Usng a Naïve Bayes classfer nvolves calculatng P(C=c D ) here c s an element of C and D s the document e sh to classfy. To calculate ths quantty e use Bayes Rule for condtonal probabltes.e. P( C ˆ) P( D C ˆ) P( C C D D ˆ) (11) P( D ˆ) Substtutng (9) and (10) nto (11) gves: P( C C n 1 P( C n ˆ) D 1 D ˆ) P( 1 P( c ˆ) c n ) (12) To classfy a document nto a category e smply assgn the category for hch P(C D T) s maxmzed.e. Class of D arg max P ( C c D D ˆ) (13) If a probablty s not requred and f every document must be classfed e can ncorporate certan smplfcatons to ths model. Frst e can gnore the denomnator of the probablty n equaton (13) snce t s ndependent of and nstead of computng a large product n (12) e can nstead compute the log of the product. The latter s advantageous because t changes a product of many small numbers n the numerator often too small to be represented by hardare supported floatng pont representatons 9

13 nto a sum of reasonably szed numbers. Once ths sum has been calculated e can convert t bac to ts orgnal doman usng the nverse log functon. A smlar tactc can be used to compute the product terms n the denomnator. We cannot hoever ncorporate the frst smplfcaton snce e requre a probablty or confdence score for each document. The second smplfcaton the computng of the sum of logarthms n place of a product s hoever achevable. 6.3 The Genetc Programmng Approach Snce ts ncepton n 1992 by Koza Genetc Programmng (GP) has found many applcatons n the feld of machne learnng. Genetc Programmng s a subset of a larger famly of technques non as evolutonary computaton. In the evolutonary computaton paradgm the programmer does not explctly rte the program hch s sad to be the outcome of the evolutonary programmng process. Rather the programmer creates an evolutonary envronment heren the computer accordng to a process lad don by the programmer evolves the programs or hch are sad to be the outcome of the evolutonary process. Typcally ths process occurs th mnmal or no supervson hoever other confguratons hch accept more user feedbac are also possble. Genetc programmng s an evolutonary computaton paradgm here the programs are represented as trees and an evolutonary mechansm s accomplshed through to operators namely crossover and mutaton. The genetc programmng process as created as an attempt to model n a manner hoever crude the evolutonary process non n bology as natural selecton. Ths analogy to nature ll no be explored as a means of famlarzng the reader th the genetc programmng method. Once ths bg pcture has been establshed e ll proceed to fll n the detals. The Bg Pcture: A Comparson Beteen Nature and Genetc Programmng In Nature many dfferent speces compete to survve and reproduce. In GP dfferent programs representng possble solutons to a problem compete to survve and reproduce. In Nature the speces best adapted to ther envronment have the best chance to reproduce. Ths s often called the la of the survval of the fttest. In GP the best soluton to the problem e are tryng to solve s the soluton best adapted to the problem and therefore the most ft. The problem can be anythng from recognzng a face to fttng a curve or n the case of ths paper the classfcaton of a document. In Nature many dfferent males ll try to compete th one female (or vceversa). In GP ths analogy contnues through a process called tournament selecton here several programs ll compete th each other to mate th another program. In Nature the genes of the offsprng ll consst as a cross beteen ts mother and father. In GP ths happens too through the process non as crossover although n GP a chld may have more than to parents. In Nature genetc code can occur n a chld completely ndependent of ts parents. In GP ths s carred out through the evolutonary operator non as mutaton here mmedately after beng born a chld may receve code ndependent of that of ts parents. In Nature the parents gradually de off to be eventually replaced by ther chldren. In steady-state GP hch s the technque used n the classfer the parents alays gve brth to fraternal tns and they mmedately de after so dong. In Nature populatons become ncreasngly adapted to ther envronment 10

14 through evolutonary process of natural selecton. In GP the populaton taen as a hole becomes better and better at solvng the gven problem through the processes of crossover and mutaton. Eventually there ll be found an ndvdual soluton that meets some certan standard or some nd of crtera. Ths sngle soluton s sad to be the outcome of the genetc programmng process. Genetc programmng s best used hen no ell-non soluton to a problem exsts. It can only be attempted f some functon can be rtten to quanttatvely determne ho ell any soluton solves a gven problem. In the fe short years snce ts ncepton n 1992 t has demonstrated ts ablty to solve some dffcult engneerng problems sometmes evolvng better solutons then have been rtten by humans [14]. It s also mportant to dstngush the feld of genetc programmng from that of genetc algorthms. The dfference manly les n the fact that genetc programmng evolves actual programs represented as trees hereas genetc algorthms evolves bt strngs. The extra versatlty afforded by ths dfference needs no expoundng. The Ftness Functon In Nature some speces are better suted to ther envronment than others. The speces best adapted to ther envronment have the best chance to reproduce. Notce that ths mples that there s some ay of measurng ho ell any gven program solved the problem. Ths measurng devce s called the ftness functon. One type of ftness functon gves a score that ranges beteen zero and nfnty th zero representng an optmal soluton and larger numbers representng ncreasngly orse solutons. Thus the closer a canddate s soluton to zero the better t solves the problem and the larger ts ftness the orse the soluton. When the ftness s measured n ths ay t s called the standard ftness hch s ho e have represented ftness values n the genetc classfer. Ho the Ftness Functon s Calculated Calculatng the ftness functon for a partcular program alays nvolves fndng the error beteen the program s anser and some deal response. In the case of our classfer the deal response s a human s udgement as to hether or not a gven document s spam. We represent ths human response n numerc form by representng t as a percentage denotng the confdence that the e-mal s an Unsolcted Bul E-mal (UBE). Thus f the gven document s a pece of un e-mal the deal response s 100% and f an e-mal s a legtmate pece of e-mal.e. non-un the deal response s 0%. Thus for a partcular document the error s the dfference beteen the program s anser and the deal. To both exaggerate the error and force the error to be alays postve e square the error. Thus for a set of documents the ftness functon can be calculated as the sum of squared-errors over all documents; or n the language of mathematcs D 2 Ftness ( ) (14) 1 v a 11

15 Where D s the number of tranng documents v s the value returned by the classfer for the th document and a s equal to 100 for un documents and 0 for non-un documents. Ths ftness functon hoever encourages bas n the classfer toards the category hch comprses most of the tranng documents especally hen the tranng documents of one class far outnumber those of the others. In such a cases one ll often the most ft solutons durng the early stages of tranng to be those solutons that alays choose the category th the larger number of examples. Unfortunately snce these solutons ll be more ft then other members of the populaton they ll have a greater chance to mate. Often one ll fnd most of the populaton to be polluted th the genes of these ndvduals before any better solutons emerge. Ths greatly mpedes learnng. Therefore a ftness functon that balances the contrbutons beteen the to categores s more desrable. One such functon th ths property s the sum of the mean squared errors of each category.e. Ftness D D P( C Spam D )( v a) P( C Non Spam D )( v a ) (15) DSpam 1 DNon Spam 1 Where a {0100} s the correct anser for the th document v { 0 < v I < 100 } s the anser returned by the classfer for the th document D Spam s the number of un e-mal documents D Non Spam s the number of non-un e-mal documents P(C=unD I ) = 1 f document s a un e-mal document and 0 otherse. P(C= non un D I ) =1 f document s a non-un e-mal document and 0 otherse. Ths ftness functon as used after the ftness functon n (14) yelded dsappontng results. There are other ftness functons that can solve the gven problem. On possblty s usng Van Rbergen s E-Measure [25] snce t combnes the precson and recall n a sngle number a desrable property snce e sh to maxmze both. The E-measure accepts a sngle parameter E hch determnes the relatve eght put on recall and precson. 2 PR E ( 1) 1 (16) 2 P R Although desgned as a effectveness measure for nformaton retreval ths measure can be used as a standard ftness value snce t taes on a value of 0 n the deal case and ncreasngly larger values for ncreasngly orse solutons. Ths measure also satsfes our crtera f E s chosen to emphasze precson snce t ll not bas the learnng toards the category th the larger number of tranng examples. Although the use of ths ftness functon dd yeld some nterestng solutons ncludng one th 100% precson and 40% recall usng a parameter of 0.4 (.e. recall only 2/5 s as mportant as precson). The performance of equaton hoever (15) as better n obtanng solutons th both a hgh precson and recalls greater than 60%. Therefore only equaton (15) as ultmately used. 12

16 Parse Trees: The Representaton Scheme Of Solutons In genetc programmng each program n a populaton s represented by a tree. These trees are smlar to the parse trees used by complers to evaluate expressons. The tree structure used s as follos. The termnals or leaves of the tree consst solely of numercal constants or ords. The non-termnals are of to nds: ords operators and numercal operators. We lst both n tables 3 and 2 respectvely. Table 2: Numercal Operators Numercal Operators Type Symbols Arthmetc +-/* Relatonal =<>>=< Logcal AND OR NOT Non-Lnear Mn Max ABS Square Root Name Freq(x) Exsts(x) Word Operators Descrpton Returns the frequency of ord x n the document Returns 1 f ord x exsts and 0 otherse. Table 3: Word Operators The use of trees to represent expressons ll be llustrated by example. Suppose e desre to represent the expresson 2*8. In a parse tree ths ould loo le: hch reduces to Fg. 1: The reducton of a smple tree Every non-termnal reduces nto a sngle number and therefore a tree of such enttes also reduces to a sngle number. We use another example to demonstrate order of operatons. Fgure 2 shos a parse tree representaton of (5-3)*2*8. hch reduces to hch smplfes to Fg. 2: Representng order of operatons n parse trees 13

17 Ths sn t terrble exctng. The nterestng part happens hen feature detectors are ntroduced nto the trees. These feature detectors ntroduce nformaton about the current document nto the tree. An example of a feature detector mght be the number of tmes the ord classfer appears n a document. Fgure 3 represents a tree th a feature detector. Fg. 3: A Smple Tree Contanng a Feature Detector Ths smple tree returns the frequency of the ord classfer n the current document. Le the other trees shon thus far ths tree also reduces to a sngle number; hoever ths number ll vary accordng to the number of tmes the ord classfer appears n the gven document. Of course more complex trees can be constructed by ncorporatng both ord and numercal operators. It should also be noted that only ord operators can accept ords as nput and conversely only numercal operators can accept numbers as nput. It follos that all ords ll have ord operators as ther parents. Further follong Koza s recommendatons e have used closed dvson n place of standard dvson. In closed dvson a dvsor of zero ll not result n an error; rather a large value th the sgn of the dvdend s returned. For the ndetermnate value.e. 0/0 a zero s returned. Ths follos Koza s asserton that the operators n a genetc program should be able to accept all possble values hch ther descendents may generate. Crossover Crossover n the orld of genetc programmng s the equvalent of sexual reproducton. Fg. 4: The Crossover Operaton 14

18 It s the sngle most mportant evolutonary operaton n genetc programmng. In crossover to solutons are sexually combned to form a ne offsprng that s a hybrd of both parents. The parents are selected from the populaton through a process called tournament selecton hch s descrbed as follos. Frst a soluton s selected at random. Ths soluton represents the female. Then the genetc program chooses ten other solutons at random. These solutons represent the males. Of these ten the one th the best ftness s selected to mate th the female. Ths method smulates bologcal matng patterns n hch to or more members of the same sex compete to mate th a partcular member of the opposte sex. Once the parents have been selected the creaton of offsprng by crossover s accomplshed by randomly selectng a subtree n each parent and sappng them. Ths produces to chldren that contan code from both ther parents. Ths s shon n the above fgure here the to bolded subtrees n the parents are sapped to create to chldren. Mutaton Mutaton s also mportant feature n genetc programmng because t s the only ay that a chld can receve genetc code ndependent of ts parents. There are to types of mutaton n the classfer. In the frst type only a non-termnal can replace a non-termnal and n the second one subtree replaces another subtree. Fgure 5 belo demonstrates both. Fg. 5: The Mutaton Operaton 15

19 We note hoever that mutaton must result n a vald tree that s the tree must be reducble to a sngle number. Therefore a ord can only replace a ord and a number can only replace another number. Genetc Programmng Summary The genetc programmng process can be summarzed as follos: 1. A populaton of random solutons s generated usng some type of random tree generaton algorthm. 2. The bestftness varable s set to the hghest possble value 3. Untl a soluton s found to satsfy some predetermned stop crtera or untl a certan number of generatons have been completed the follong steps are repeated:. to parents are selected from the populaton of solutons usng tournament selecton and ther ftnesses are calculated... v. The solutons represented by these to parents are combned to form to ne solutons usng the operaton called crossover. The chldren may undergo the process of mutaton th a certan probablty. The ftness of each the to chldren s evaluated and compared to the bestftness varable. If the chld s ftness s loer than the current value stored n the bestftness varable t replaces the value n bestftness. v. If a parent s ftness s equal to bestftness t s ept; otherse t s removed from the populaton. In ths manner the best soluton n the populaton s never be lled. 4. The soluton th the best ftness s taen as the outcome of the genetc programmng cycle. If there are N members n the populaton at any gven tme a generaton s defned as N/2 teratons rounded up to the next nteger. 16

20 7.0 Implementaton Issues 7.1 The Tranng Set and the Test Set Intally e had a collecton of 972 unsolcted bul e-mal documents hch ere obtaned from a user ho had saved hs un e-mal over an approxmately to year perod. Approxmately one quarter of the spam documents duplcated others n the set and much or ent nto the detecton and removal of these duplcates. Often these duplcates only dffered n small aspects such as the tme n hch they ere sent or the mal-servers through hch the message as relayed. In some cases the messages only dffered by the addton of extra spaces or nelne characters. To elmnate such dfferences e removed the message headers usng a smple PERL scrpt. From ths pont onard fndng duplcates as a more straghtforard tas and a PERL scrpt as used to ths end resultng n the removal of 271 duplcates. After the removal of duplcates 701 spam documents remaned and the collecton of non-spam documents stood at 102 documents. We passed these documents through a seres of flters. The frst removed the HTML tags embedded n some messages. The second removed the 60 most common ords n the Englsh language a common practce n text learnng [10][23] snce t s felt that these ords occur too frequently to be of much dscrmnatng value. Thrd e appled stemmng a technque that attempts to reduce the many forms of a ord to ther root form. For example an deal stemmng algorthm ould convert the ords runs runnng and ran to run. We used Porter s stemmng algorthm as mplemented by Fraes n [24] due to ts smplcty and hgh executon speed. We dd not use other more sophstcated stemmng algorthms due to ther prohbtve tme and computng costs. Porter s algorthm s a fast and effcent algorthm hch taes only a fracton of a second to complete. In our experence other more sophstcated stemmng algorthms ould tae seven to eght seconds to complete on our test machne a 466 MHz Intel Celeron processor th 256MB of RAM. Clearly such delays are unacceptable to a user hose machne ould be ted up for more than three mnutes only to read 18 peces of mal. After passng our documents through these many flters e splt our documents nto a tranng set consstng of 671 spam documents and 72 non-spam documents and a test set of 30 spam documents and 30 non-spam documents. Although the actual occurrence of legtmate messages s far more frequent for a user then the occurrence of spam e thought t expedent to ncrease the number of spam documents to an equal footng to facltate more accurate percentages of UBE recall and precson. The alternatve gven our small number of samples ould requre that our recall and precson be calculated from a test set contanng only a fe un-e-mal messages. 7.2 Feature Selecton and The need to reduce the classfer s vocabulary Smlar to a human beng s vocabulary hch conssts of the ords hch a human understands a classfer s vocabulary represents the only ords hch a classfer ll use to determne the class of a document. We created the classfer s vocabulary by frst rtng a PERL scrpt to create a separate ord lst for the un and non-un categores. Each ord lst contaned the frequency of each ord and the number of documents n hch each ord occurred. The ntal count of ords yelded over

21 unque ords over both classes; some ords consstng solely of punctuaton. We further notced that our set folloed a Zpfan dstrbuton 2 a common occurrence n document corpora. Furthermore accordng to [21] t s a conventonal rule of thumb n pattern recognton practce to use fve to ten tmes as many tranng samples as features for each class n order to estmate the probablty dstrbutons. Snce thout feature selecton e ould requre a hoopng *2*5 = 1.2 mllon tranng documents t becomes pertnent to reduce our features as much as possble to ncrease the speed of learnng and to reduce the requred number of tranng samples. Although e dd not perform an exhaustve study e found 550 features to be adequate. Ths s consstent th many other practtoners or text learnng ho have found that feer features often yeld better performance n text-learnng domans [11] [17] [18] ncludng Saham et. al [10] ho used 500 features for classfyng un e-mal [10] and Mladenc ho found systems that only used 1-3% of the total ords n a category demonstrated lttle or no loss n performance [20]. Despte papers such [16] hch support the clam that document frequency s faster and generally as effectve as other feature selecton technques n text doman and despte the author s past experence th eb page classfcaton that seemed to support ths asserton the author notced very poor results (approxmately 50% precson) hen document frequency as used as a feature selecton crteron. Another common feature selecton crtera for text categorzaton s mutual nformaton th the class varable [10][11][17][18]. It s calculated as follos: MI( X C) X P( X C) P( X C)log P( X ) P( ) x C c C When mutual nformaton as used as the feature selecton crteron the results ere found to be much better and these results are those reported n ths paper. 7.3 Poor Performance caused by lac of precson After notng poor classfcaton performance n the Naïve Bayes classfer a closer examnaton shoed that product terms n equaton (12) ere sometmes stored as zero. Closer nspecton revealed that ths happened more often for longer documents. We later realzed that the source of the problem as that the product terms n equaton (12) ere so small that they ranged beyond the precson of the double precson floatng pont afforded by our C++ compler. Snce longer documents typcally had more vocabulary ords than shorter ones the product terms ent to zero more frequently for longer documents. To remedy ths problem an nfnte precson floatng pont pacage as used greatly ncreasng classfer performance although at the expense of tme complexty. In some cases the computaton tmes ere ncreased from a fe seconds to mnutes. In hndsght a better mplementaton ould have been one that computed the log 2 Zpf' s la named after George Zpf ts dscoverer s an observaton about the relatonshp beteen the ranng of the frequency of an event and the frequency tself. The means that the number of tmes the second most commonly-used ord occurred s approxmately 1/2 the number of tmes the most popular ord s used; the number of tmes the thrd most popular ord s used s close to 1/3 the the number of tmes the most popular ord s used and so on. In general f the most popular ord appears N tmes the th most popular ord appears approxmately (1/) * N tmes [22]. 18

22 of equaton (12). The advantage posed by a log-based approach s that t converts a large product of small numbers nto a sum of reasonably szed numbers. Ths crcumvents the need for an nfnte precson pacage and logs can be computed th lttle performance penalty hen a looup table s used. Once the sum of the logs have been computed they can easly be transformed bac to ther orgnal doman usng the nverse log functon. Thus a probablty can stll be computed. 8. Learnng Issues Stoppng Crteron and the Preventon of Overfttng Many teratve learnng algorthms suffer from the problem of overfttng and genetc programmng s no excepton. In most data sets there ll be nose and t s mportant for the learnng algorthm to generalze based on the sgnal and not the nose. To prevent overfttng n the genetc classfer durng tranng e cross-valdated the best member of the populaton after each generaton usng the recall and precson over the test set. Ths alloed the operator to observe the generalzablty of the decson rule. When these measures began to declne e stopped the genetc classfer s tranng. 9. Expermental Results Snce Genetc Programmng s not a determnstc process ts output ll vary from run to run. Thus e have provded the mean and standard devaton over sx runs th one run beng dscarded snce t as a degenerate case here learnng dd not occur. We labeled the GP run th the hghest combned sum of recall and precson as the best run. The GP results ere produced usng a populaton of 300 trees that ere ntally generated usng an algorthm that randomly generated trees th depths greater than 6. We obtaned excellent results usng a tournament sze of 10 and poor results usng a value of 5. Naïve Bayes GP Best GP Mean GP Std. Dev Jun Recall 76.67% 70.0% Jun Precson 95.83% 95.45% Table 4: Recall and Precson for Naïve Bayes and Genetc Programmng 10. Conclusons Although the precson of the best genetc programmed classfer as comparable to that of the Naïve Bayes classfer ts recall traled the Naïve Bayes classfer by 6.67%. Nevertheless e have shon that t s possble to construct a genetcally programmed classfer th reasonable performance. In fact the to performance measures are so close that e cannot be sure that the dfferences are statstcally sgnfcant. 19

23 11. Recommendatons Although both classfers demonstrated smlar results the Naïve Bayes outperformed the genetcally programmed classfer n recall by 6.67%. It s recommended that more data sets be produced to both ascertan that these dfferences are statstcally sgnfcant and to demonstrate ho ell these to learnng methods scale to larger numbers of tranng documents. Although the to classfers ere not explctly tmed the genetc classfer s computaton tme for a document as qualtatvely nstantaneous hle the Naïve Bayes classfer too a consderable amount of tme (sometmes over 40 seconds on a 466 Mhz Pentum) to output a classfcaton decson. Ths can be attrbuted to our use of a hgh precson math lbrary to deal th the very small numbers n the product term of equaton (12). We cannot hoever regarded ths as a far comparson of classfcaton tme snce mplementatons of the Naïve Bayes algorthm exst hch are much faster than the one used. It s therefore recommended that a Naïve Bayes classfer based on the sums of logs be mplemented to facltate a more far comparson of classfcaton tme. We have restrcted the features n the Genetc Programmng code to closely match those made avalable to the Naïve Bayesan classfer to provde a far bass of comparson beteen these to learnng methods. Le naïve Bayes genetc programmng can be extended to ncorporate many other features beyond ungrams (sngle ords) to n- grams (phrases or syllables of ords). Often. the nners of TREC an annual text retreval competton s a varant of Naïve Bayes extended to ncorporate selected n- grams and other features. It s therefore recommended that more or be done to compare the performance of a genetc classfer aganst a Bayesan classfer n a rcher feature space that ncorporates features such as the par-dstances beteen ords and the frequences of phrases and n-grams. Descrpton Spam Non-spam Messages contanng sgnatures 28% 36% Sgnatures contanng remove 86% 0% Sgnatures contanng a name and/or a ttle address 0% 72% Messages th punctuaton repeated 3 or more tmes 76% 38% Messages contanng repeated punctuaton th Taglne and forardng messages gnored 36% 2% Table 5: E-mal sgnatures can be harvested as useful features Many e-mals end th a personal sgnature or taglne. These e-mal sgnatures often contans repeated punctuaton denotng ther startng boundary folloed by a person s name ttle address and n some cases a toll free number. Sgnatures such as these are uncommon n unsolcted bul e-mal; hoever ther ndvdual components are frequently found n un e-mal. In a random sample of 50 spam and 50 non-spam messages 28% of the spam messages ere found to contan a taglne; hoever 86% of these messages contaned the ord remove or delete compared to 0% n the nonspam category demonstratng the utlty of ths nformaton as a hgh-precson feature. Furthermore e note that 72% of the non-un messages contaned a name ttle and 20

24 address compared to 0% n the spam category posng yet another useful feature n the e- mal sgnature. We further note as have Saham et. al n [10] that spam messages often contans repeated punctuaton; for example one mght fnd the phrase HUGE SAVINGS!!!! n a spam message; hoever e also note that repeated punctuaton s often contaned n the e-mal sgnatures of non-spam messages dmnshng ts value as a feature. A loo at our random samples revealed that spam messages ere only tce as lely to contan repeated punctuaton; hoever f the repeatng punctuaton n the emal sgnatures ere removed and forardng headers such as ---- orgnal message ---- ere gnored the spam messages ere 18 tmes more lely to contan repeated punctuaton than the non-spam messages an ncrease of 9 tmes n feature effectveness. Thus e have shon that repeated punctuaton s a another useful feature for the detecton of un e-mal. We have also shon that the utlty of ths feature can be greatly ncreased f the repeatng punctuaton assocated th e-mal sgnatures s detected and removed. It s therefore recommended that a better parser be constructed to detect and parse e-mal sgnatures so that ther repeatng punctuaton can be removed and so that ther features can be extracted. We hypothesze that such an approach ll be very helpful n ncreasng classfer performance. 21

25 References [1] Internet Wee May CMP Meda Inc Manhasset Ne Yor [2] Ne Yor Tmes March [3] Cranor Lorre & LaMaccha Bran Spam AT&T Labs Techncal Report March 1998 [4] Internet Wee May 11 CMP Meda Inc Manhasset Ne Yor 1998 [5] Commercal Internet Exchange and Internet La and Polcy Forum June 1998 [6] Marshall Jonathan Spam' Overloads Pac Bell Flood of un e-mal nocs out servce San Francsco Chroncle March [7] CNET The Net February [8] Levtt Mar & Comsey Me Brght Lght Focuses on Elmnatng Spam IDC Corporaton July [9] Hoffman P. & Crocer D. Unsolcted Bul E-mal: Mechansms for Control Internet Mal Consortum. [10] Saham M. & Dumas S. & Hecerman D. & Horvtz E. A Bayesan Approach to Flterng Jun E-mal n Learnng for Text Categorzaton: Papers from the 1998 Worshop. AAAI Techncal Report [11] Ngam K. & McCallum A. & Thurn S. & Mtchell T. Text Classfcaton from labeled and unlabeled documents. [12] Fredman Nr & Geger Dan & Goldszmdt Moses Bayesan netor classfers n Machne Learnng Vol. 2 pp [13] Les D. Nave (Bayes) at forty: The ndependence assumpton n nformaton retreval n ECML-98: Proceedngs of the Tenth European Conference on Machne Learnng [14] Les D. Test Representaton for ntellgent text retreval: A classfcaton orented ve. In Paul S. Jacobs edtor Text-Based Intellgent Systems pp Larence Erlbaum NJ [15] Benzaf et. al Genetc Programmng: An Introducton Morgan Kauffman San Francsco pp [16] Yang Y. & Pedersen J A comparatve study on feature selecton n text categorzaton n Internatonal Conference on Machne Learnng (ICML)

26 [17] Les D. D Feature selecton and feature extracton for text categorzaton Morgan Kaufmann San Francsco pp [18] Koller D. and Saham M. Herarchcally classfyng documents usng very fe ords n Internatonal Conference on Machne Learnng (ICML) pp [19] Jaaola T. S. and Haussler Explotng generatve models n dscrmnatve classfers [20] Mladenc D Feature subset selecton n text-learnng n Proc. of the 10th European Conference on Machne Learnng [21] Jan A. Chandrasearan Dmensonalty and sample sze consderatons n pattern recognton practce n handboo of statstcs Vol. 2 (Krshnah P. and Kanal L. edtors) pp Amsterdam: North-Holland Publshng Company [22] Zpf G.K. Human Behavour and the Prncple of Least Effort. Addson Wesley [23] Clac C. & Farrngton J. & Ldell P. & Yu T. Autonomous Document Classfcaton for Busness n Proceedngs of The ACM Agents Conference [24] Fraes W. Stemmng Algorthms n Informaton Retreval and Data Structures pp Engleood Clffs Ne Jersey: Prentce Hall [25] van Rsbergen Informaton Retreval Butterorths London 2 nd edton [26] Les D. "Evaluatng and optmzng autonomous text classfcaton systems" n SIGIR '95 pp [27] Katra H. Genetc Programmng and ts Applcaton to the Classfcaton of Web Pages Department of Electrcal and Computer Engneerng Techncal Report Unversty of Waterloo Waterloo Ontaro

27 Estmated Expenses: $0 Actual Expenses: $0 Appendx 1:Table of Expenses [As requred by the E&CE 499 Report Outlne] 24