IMPROVING NAIVE BAYESIAN SPAM FILTERING

Similar documents

A Multivariate Statistical Analysis of Stock Trends. Abstract

The impact of metadata implementation on webpage visibility in search engine results (Part II) q

Web Application Scalability: A Model-Based Approach

An important observation in supply chain management, known as the bullwhip effect,

FDA CFR PART 11 ELECTRONIC RECORDS, ELECTRONIC SIGNATURES

Monitoring Frequency of Change By Li Qin

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

Machine Learning with Operational Costs

An Introduction to Risk Parity Hossein Kazemi

Learning Human Behavior from Analyzing Activities in Virtual Environments

CRITICAL AVIATION INFRASTRUCTURES VULNERABILITY ASSESSMENT TO TERRORIST THREATS

Managing specific risk in property portfolios

Compensating Fund Managers for Risk-Adjusted Performance

Synopsys RURAL ELECTRICATION PLANNING SOFTWARE (LAPER) Rainer Fronius Marc Gratton Electricité de France Research and Development FRANCE

Project Management and. Scheduling CHAPTER CONTENTS

Risk and Return. Sample chapter. e r t u i o p a s d f CHAPTER CONTENTS LEARNING OBJECTIVES. Chapter 7

An inventory control system for spare parts at a refinery: An empirical comparison of different reorder point methods

X How to Schedule a Cascade in an Arbitrary Graph

The risk of using the Q heterogeneity estimator for software engineering experiments

Evaluating a Web-Based Information System for Managing Master of Science Summer Projects

THE RELATIONSHIP BETWEEN EMPLOYEE PERFORMANCE AND THEIR EFFICIENCY EVALUATION SYSTEM IN THE YOTH AND SPORT OFFICES IN NORTH WEST OF IRAN

Sage HRMS I Planning Guide. The HR Software Buyer s Guide and Checklist

Automatic Search for Correlated Alarms

CABRS CELLULAR AUTOMATON BASED MRI BRAIN SEGMENTATION

Beyond the F Test: Effect Size Confidence Intervals and Tests of Close Fit in the Analysis of Variance and Contrast Analysis

C-Bus Voltage Calculation

A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION

Electronic Commerce Research and Applications

Sage Timberline Office

Softmax Model as Generalization upon Logistic Discrimination Suffers from Overfitting

Local Connectivity Tests to Identify Wormholes in Wireless Networks

ENFORCING SAFETY PROPERTIES IN WEB APPLICATIONS USING PETRI NETS

A Modified Measure of Covert Network Performance

COST CALCULATION IN COMPLEX TRANSPORT SYSTEMS

Comparing Dissimilarity Measures for Symbolic Data Analysis

DAY-AHEAD ELECTRICITY PRICE FORECASTING BASED ON TIME SERIES MODELS: A COMPARISON

On the predictive content of the PPI on CPI inflation: the case of Mexico

F inding the optimal, or value-maximizing, capital

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network

The Online Freeze-tag Problem

The Changing Wage Return to an Undergraduate Education

Corporate Compliance Policy

Rummage Web Server Tuning Evaluation through Benchmark

Sage Document Management. User's Guide Version 12.1

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Storage Basics Architecting the Storage Supplemental Handout

Drinking water systems are vulnerable to

Load Balancing Mechanism in Agent-based Grid

INFERRING APP DEMAND FROM PUBLICLY AVAILABLE DATA 1

Sage Document Management. User's Guide Version 13.1

An Overview of Spam Blocking Techniques

Service Network Design with Asset Management: Formulations and Comparative Analyzes

A Simple Model of Pricing, Markups and Market. Power Under Demand Fluctuations

From Simulation to Experiment: A Case Study on Multiprocessor Task Scheduling

Title: Stochastic models of resource allocation for services

Two-resource stochastic capacity planning employing a Bayesian methodology

Multiperiod Portfolio Optimization with General Transaction Costs

Static and Dynamic Properties of Small-world Connection Topologies Based on Transit-stub Networks

Time-Cost Trade-Offs in Resource-Constraint Project Scheduling Problems with Overlapping Modes

Joint Production and Financing Decisions: Modeling and Analysis

Large firms and heterogeneity: the structure of trade and industry under oligopoly

An actuarial approach to pricing Mortgage Insurance considering simultaneously mortgage default and prepayment

Asymmetric Information, Transaction Cost, and. Externalities in Competitive Insurance Markets *

Analysis of Effectiveness of Web based E- Learning Through Information Technology

Branch-and-Price for Service Network Design with Asset Management Constraints

Pinhole Optics. OBJECTIVES To study the formation of an image without use of a lens.

New Approaches to Idea Generation and Consumer Input in the Product Development

An Associative Memory Readout in ESN for Neural Action Potential Detection

6.042/18.062J Mathematics for Computer Science December 12, 2006 Tom Leighton and Ronitt Rubinfeld. Random Walks

On-the-Job Search, Work Effort and Hyperbolic Discounting

Normally Distributed Data. A mean with a normal value Test of Hypothesis Sign Test Paired observations within a single patient group

IEEM 101: Inventory control

The HIV Epidemic: What kind of vaccine are we looking for?

Measuring relative phase between two waveforms using an oscilloscope

Service Network Design with Asset Management: Formulations and Comparative Analyzes

Risk in Revenue Management and Dynamic Pricing

Re-Dispatch Approach for Congestion Relief in Deregulated Power Systems

Applications of Regret Theory to Asset Pricing

Fundamental Concepts for Workflow Automation in Practice

Sage HRMS I Planning Guide. The Complete Buyer s Guide for Payroll Software

Concurrent Program Synthesis Based on Supervisory Control

Computational Finance The Martingale Measure and Pricing of Derivatives

1 Gambler s Ruin Problem

An Analysis Model of Botnet Tracking based on Ant Colony Optimization Algorithm

Failure Behavior Analysis for Reliable Distributed Embedded Systems

The fast Fourier transform method for the valuation of European style options in-the-money (ITM), at-the-money (ATM) and out-of-the-money (OTM)

Simulink Implementation of a CDMA Smart Antenna System

Web Inv. Web Invoicing & Electronic Payments. What s Inside. Strategic Impact of AP Automation. Inefficiencies in Current State

Transcription:

Master Thesis IMPROVING NAIVE BAYESIAN SPAM FILTERING Jon Kågström Mid Sweden University Deartment for Information Technology and Media Sring 005

Abstract Sam or unsolicited e-mail has become a major roblem for comanies and rivate users. This thesis exlores the roblems associated with sam and some different aroaches attemting to deal with it. The most aealing methods are those that are easy to maintain and rove to have a satisfactory erformance. Statistical classifiers are such a grou of methods as their ability to filter sam is based uon the revious knowledge gathered through collected and classified e-mails. A learning algorithm which uses the Naive Bayesian classifier has shown romising results in searating sam from legitimate mail. Tokenization, robability estimation and feature selection are rocesses erformed rior to classification and all have a significant influence uon the erformance of sam filtering. The main objective of this work is to examine and emirically test the currently known techniques used for each of these rocesses and to investigate the ossibilities for imroving the classifier erformance. Firstly, how a filter and wraer aroach can be used to find tokenization delimiter subsets that imrove classification is shown. After this, seven robability estimators are tested and comared in order to demonstrate which of them ameliorate the erformance. Finally a survey of commonly used methods for the feature selection rocess is erformed and recommendations for their use are resented. ii

Acknowledgments I would like to thank my suervisor Iskra Poova for guiding me in how to write a thesis. She has heled me to imrove the structure and ointed out where the content needed enrichment. For me it has been a learning rocess. I also want to thank everyone who makes an effort to fight sam, and there are a lot of you! iii

Glossary classifier A erson or machine that is sorting out the constituents of a substance. clique A maximum comlete subgrah of a grah. comlete grah All vertices are connected with each other. corus A collection of natural language text used for accumulating statistics. More secifically in this thesis a corus is a ma between a word (token) and its frequency. degrees of freedom Describe the number of values that are free to vary in a statistical calculation. delimiter A character that marks the beginning or end of a unit of data. entroy A measure of the disorder that exists in a system. feature A rominent art or characteristic. inducer An inducer is a machine learning algorithm that roduces a classifier that, in turn, assigns a class to each instance. misclassification An acctual sam classified as good, or an acctual good classified as sam. monotonicity The function f is monotone if, whenever x y, then f ( x) f ( y). Stated differently, a monotone function is one that reserves the order. mutually exclusive Describing two events, conditions, or variables which cannot occur at once. n-gram n features are considered at a time. null hyothesis Predicts that two distributions are the same. robability distribution A list of the robabilities associated with each of its values. Throughout this work discreet robability distributions are used and they are non-continuous. The robability mass function is denoted by x ) where x i is a random discrete variable. samle sace The set of all ossible outcomes of an exeriment. significance level Is the decision criterion for acceting or rejecting the null hyothesis. sace Character used to searate words. sam Unsolicited usually commercial e-mail sent to a large number of addresses. token A distinguishing characteristic (feature). transitivity In mathematics, a binary relation R over a set X is transitive if it holds for all a, b, and c in X, that if a is related to b and b is related to c, then a is related to c. unigram Only one feature is considered at a time. See also n- ( i iv

white-sace χ -distribution gram. The characters, sace, tab, line-feed and other characters that leaves an emty sace. The shae alters as the degrees of freedom change. The area under the curve will grow as the degrees of freedom increase making it more symmetrical. With more than 30 degrees of freedom it aroximates the normal distribution. v

Abbreviations ASCII BNS ELE HTML IP ISP KL knn L-S MLE NB NNet SA SBS SBFS SFS SFFS SMTP SVM XML American Standard Code of Information Interchange Bi-normal searation Exected Likelihood Estimate Hyer Text Marku Language Internet Protocol Internet Service Provider Kullback-Liebler k-nearest Neighbor Ling-Sam Maximum Likelihood Estimate Naive Bayesian Neural Network SamAssassin Sequential Backward Selection Sequential Backward Floating Selection Sequential Forward Selection Sequential Forward Floating Selection Simle Mail Transfer Protocol Suort Vector Machine Extensible Marku Language vi

Notation α Significance level. χ Chi square statistics. λ Weight of each good message. A Alhabet. B Number of bins (distinct items). C Class vector. D Delimiter set EW Acc Extended Weighted Accuracy. F Feature vector. IG Information Gain. J Fitness function. KL Kullback-Liebler divergence. L Label vector. M Message corus. n Number of good messages classified as good. gg n Number of good messages classified as sam. gs Number of sam messages classified as sam. n ss n Number of sam messages classified as good. sg N Total number of training instances. N j Frequency of frequency of items seen j times. Precision MLE Maximum likelihood estimator. Absolute estimator. Abs Lalace estimator. La Exected likelihood estimator. ELE Lidstone estimator. Lid Witten Bell estimator. WB Good Turing estimator. GT Rob Bayesian estimator. value Calculated robability in classical statistics. P Probability. PR Probability ratio. Q Production of a tokenizer. r Recall. R Rank. t Threshold value for sam cutoff. T Tokenizer. W Weighted accuracy. Acc vii

X, Y Events. viii

Table of Contents Introduction.... The roblem of sam.... Research objectives....3 Thesis Outline... Method... 4. Literature Survey... 4. Exerimental work... 4 3 Techniques to eliminate sam... 5 3. Hiding the e-mail address... 5 3. Pattern matching, whitelists and blacklists... 5 3.3 Rule based filters... 5 3.4 Statistical filters... 6 3.5 E-mail verification... 6 3.6 Distributed blacklists of sam sources... 6 3.7 Distributed blacklist of sam signatures... 7 3.8 Money e-mail stams... 7 3.9 Proof-of-work e-mail stams... 7 3.0 Legal measures... 8 3. Conclusion... 8 4 Statistical Classifiers... 9 4. Features and classes... 9 4. Text categorization... 9 4.3 Basics about Probability Theory... 0 4.4 Bayes theorem... 4.5 Classical vs. Bayesian statistics... 4.5. Using statistics... 4.5. Using statistics... 4.5.3 Objective and subjective robabilities... 4.5.4 Inference differences... 3 4.5.5 Examle of statistical sam classification... 4 4.5.5. Classical statistics... 4 4.5.5. Bayesian statistics... 4 5 Naive Bayesian Sam Filtering... 5 5. The model... 5 5. Naive Bayesian Classifier... 6 5.3 Measuring the erformance... 7 5.3. Precision and recall... 7 5.3. Weighted accuracy... 8 5.3.3 Cross validation... 9 ix

5.3.4 Benchmark coruses... 9 5.4 Message tokenization... 9 5.4. Definitions... 0 5.4. Delimiter Interaction... 5.4.. Non-transitivity... 5.4.. Non-monotonicity... 5.4.3 Dimensionality reduction in the search for a good delimiter subset... 5.4.4 Filters and wraers... 4 5.5 Probability estimation... 6 5.5. Absolute Estimate ( abs )... 6 5.5. Lalace Estimate ( la )... 7 5.5.3 Exected Likelihood Estimate (ELE) ( ELE )... 7 5.5.4 Lidstone Estimate ( Lid )... 7 5.5.5 Witten Bell smoothing ( WB )... 7 5.5.6 Good Turing Estimate ( SGT )... 8 5.5.7 Bayesian smoothing ( Rob )... 8 5.6 Feature Selection... 30 Information Gain... 3 5.6. χ statistics... 3 5.6. Probability Ratio... 3 6 Exerimental Results... 33 6. Delimiter selection... 34 6.. Filter for delimiter selection using KL-divergence... 34 6.. Wraer aroach for delimiter selection... 35 6..3 Exerimental settings... 36 6..4 Results... 37 6..5 Analysis... 38 6..6 Conclusion... 39 6..7 Future work... 40 6. Probability estimation... 4 6.. Exerimental settings... 4 6.. Results... 4 6..3 Analysis... 44 6..4 Conclusion... 45 6..5 Future work... 45 6.3 Feature selection... 46 6.3. Exerimental settings... 46 6.3. Results... 46 6.3.3 Analysis... 50 6.3.4 Conclusion... 5 6.3.5 Future work... 5 7 Summary... 5 x

References... 53 xi

List of Figures Figure. A model of Naive Bayesian sam filtering...5 Figure. Pseudo code insired by (Sance & Sajda 998) for the modified SFFS...3 Figure 3. Illustration of the filter selection rocedure....4 Figure 4. Illustration of the wraer selection rocedure...5 Diagram. ( x MLE i ) = 0. 8 is smoothed by Rob as the data oints increase....9 Diagram. Performance as the number of delimiters increase...39 Diagram 3. Performance as a function of the number of features selected on PU...46 Diagram 4. Performance as a function of the number of features selected on PU...47 Diagram 5. Performance as a function of the number of features selected on PU3...47 Diagram 6. Performance as a function of the number of features selected on PUA...48 Diagram 7. Performance as a function of the number of features selected on SA...48 Diagram 8. Performance as a function of the number of features selected on L-S....49 Diagram 9. The average time to classify one message for the different feature selectors...49 Table. Illustration of the non-transitive relationshi between delimiters.... Table. Coruses used in the exeriments...33 Table 3. Delimiter subsets used as base-line....37 Table 4. Delimiters subsets automatically found by the SFFS for different coruses....37 Table 5. Performance of the different delimiter subsets on Ling-Sam....37 Table 6. Performance of the different delimiter subsets on SamAssassin...38 Table 7. Performance of the different delimiter subsets on Personal....38 Table 8. Probability estimators tested on PU...4 Table 9. Probability estimators tested on PU...4 Table 0. Probability estimators tested on PU3...4 Table. Probability estimators tested on PUA...4 Table. Probability estimators tested on Ling-Sam...43 Table 3. Probability estimators tested on SamAssassin....43 Table 4. Mean results for λ =...43 Table 5. Mean results for λ = 9...44 Table 6. Mean results for λ = 99...44 Table 7. Overall erformance of the tested estimators...45 xii

Introduction. The roblem of sam Internet has oened new channels of communication; enabling an e-mail to be sent to a relative thousand of kilometers away. This medium of communication oens doors for virtually free mass e-mailing, reaching out to hundred of thousands users within seconds. However, this freedom of communication can be misused. In the last coule of years sam has become a henomenon that threatens the viability of communication via e-mail. It is difficult to develo an accurate and useful definition of sam, although every e-mail user will quickly recognize sam messages. Merriam-Webster Online Dictionary defines sam as unsolicited usually commercial e-mail sent to a large number of addresses. Some other than commercial uroses of sam are to exress olitical or religious oinions, deceive the target audience with romises of fortune, sread meaningless chain letters and infect the receivers comuter with viruses. Even though one can argue that what is sam for one erson can be an interesting mail message for another, most eole agree that sam is a ublic frustration. Sam has become a serious roblem because in the short term it is usually economically beneficial to the sender. The low cost of e-mail as a communication medium virtually guaranties rofits. Even if a very small ercentage of eole resond to the sam advertising message by buying the roduct, this can be worth the money and the time sent for sending bulk e-mails. Commercial sammers are often reresented by eole or comanies that have no reutation to lose. Because of technological obstacles with e-mail infrastructure, it is difficult and time-consuming to trace the individual or the grou resonsible for sending sam. Sammers make it even more difficult by hiding or forging the origin of their messages. Even if they are traced, the decentralized architecture of the Internet with no central authority makes it hard to take legal actions against sammers. Sam has increased steadily over the last years, according to Brightmail. At resent, March 004, 6% of all emails on the internet are sam comared to 45% a year ago. The major roblem concerning sam is that it is the receiver who is aying for the sam in terms of their time, bandwidth and disk sace. This can be very costly even for a small comany with only 0 emloyees who each receive 0 sam e-mails a day. If it takes 5 seconds to classify and remove a sam, then the comany will send about half an hour every day to searate sam from legitimate e-mail. The statistics shows that 0 sam messages er day is a very low number for a comany that is suscetible to sam. There are other roblems associated with sam. Messages can have content that is offensive to eole and might cause general sychological annoyance, a large amount of sam messages can crash unrotected mail servers, legitimate ersonal e-mails can be easily lost and more. There is an immediate need to control the steadily growing sam flood. A great deal of on-going research is trying to resolve the roblem. However, e-mail users are imatient and therefore there is a growing need for raidly available anti-sam solutions to rotect them. Merriam-Webster Online Dictionary, htt://www.m-w.com/, 004-03- Brightmail, htt://brightmail.com/, 004-03-

. Research objectives There are many different aroaches available at resent attemting to solve the sam issue. One of the most romising methods for filtering sam with regards to erformance and ease of imlementation is that of statistical filters. These filters learn to distinguish (or classify) between sam and legitimate e-mail messages as they are being used. In addition, they automatically adat as the content of sam messages changes. The objective of this thesis is to exlore the statistical filter called Naive Bayesian classifier and to investigate the ossibilities for imroving its erformance. After dissecting the segments of its oeration, this work focuses on three secific areas described below. Before a message can be classified as either sam or legitimate it is first slit into tokens; this rocess is called tokenizing. As this text is being read, tokenizing into tokens (words) is actually taking lace as sace is being used as a delimiter. Similarly an e-mail message can be slit into tokens using sace or any other character as delimiter. The first objective of this work is to examine how the selection of delimiters affects the classifier s erformance and to offer recommendations for choosing delimiters. The classification of some e-mail messages as sam is based uon the knowledge gathered from the statistics about tokens aearing in revious e-mail messages. When a message is to be classified; each token is looked u in the training data. For examle, the token Viagra may have aeared 5 times in revious sams and 0 times in revious legitimate e- mails. These are the frequencies of a token in the training data. From these frequencies it is ossible to estimate the robability that a token is found in a sam or legitimate e-mail. The most straight forward technique is to divide the frequency by the total number of tokens reviously seen. Higher frequencies give better robability estimates. But whenever a token is either not resent in any of the revious messages or it has a low frequency, there are better ways of estimating its robability. Our second objective is to examine how different robability estimators affect the sam classification erformance. A feature is a characteristic of an object. For examle in image recognition a feature could be a color and in the case of classifying e-mails it is a token or a word. E-mail messages are written using natural languages which contain thousands of distinct words. The number of words is the dimensionality of the message. The rimary urose of feature selection is to reduce the dimensionality in order to increase the seed of the comutation. Our third objective was to conduct comarative analyses between three commonly used feature selection methods..3 Thesis Outline The thesis is structured in seven chaters. Chater two discusses the method used. Chater three briefly describes currently develoed techniques to eliminate sam aiming to show that all existing schemes are not fully develoed and do not offer comlete sam elimination. It also emhasizes that statistical filters have, so far, roved to be the most successful method in dealing with sam.

Chater four gives an overview of statistical classifiers and rovides the reader with some necessary basic mathematical understanding for this work. Chater five resents a general concetualized model of a Naive Bayesian sam filter and exlains the theory used by the Naive Bayesian classifier. This chater also contains the currently used techniques for the three hases erformed rior to the classification of messages starting with an examination of how the selection of delimiters affects the classifier erformance. Following this seven different aroaches are resented for smoothing the robabilities for the training data and finally three different methods for selecting features are demonstrated. Chater six is devoted to the exerimental work erformed. It contains three sections, each with its own conclusions and one for each exeriment. In this way the results resented in the sections are summarized and the conclusions are derived. Chater seven is a summarization of all the work erformed. 3

Method The methodology used throughout the thesis consisted of a theoretical study requiring a literature survey and ractical work involving several exeriments.. Literature Survey Articles found on the Internet are the most commonly used research material for this work. Google and Citeseer 3 were frequently used to find articles of interest. The sam henomenon is still in its infancy and it was therefore natural to use the Internet as the main source of information. The theory behind statistical filters is well established and a number of books on statistics served as rimary literature in this area. Books on Formal Languages, Artificial Intelligence and Discrete Mathematics were often consulted throughout the work on this thesis.. Exerimental work The exerimental work is suorted by some theoretical background. The emirical results obtained were verified with the theoretical ones whenever they were available. To carry out the exeriments a test environment in C++ was built. In order to avoid rebuilding the environment for different tests, exeriments were defined in an XML file that is read at run-time. For examle, the corus to use, robability estimator and feature selection method are defined in the XML file. 3 htt://citeseer.ist.su.edu/ is a database with articles. 4

3 Techniques to eliminate sam There are several aroaches which deal with sam. This section briefly summarizes some common methods to avoid sam and briefly describes the sam filtering techniques used at resent. 3. Hiding the e-mail address The simlest aroach to avoid sam is to kee the e-mail address hidden from sammers. The e- mail address can be revealed only to trusted arties. For communication with less trusted arties a temorary e-mail account can be used. If the e-mail address is ublished on a web age it can be disguised for e-mail siders 4 by inserting a tag that is requested to be removed before relying. Robots will collect the e-mail address with the tag, while humans will understand that the tag has to be removed in order to retrieve the correct e-mail address. For most users this method is insufficient. Firstly, it is time consuming to imlement techniques that will kee the e-mail address safe, and secondly, the disguised address could not only mislead robots, but also the inattentive human. Once the e-mail address is exosed, there is no further rotection against sam. 3. Pattern matching, whitelists and blacklists This is a content-based attern matching aroach where the incoming e-mail is matched against some atterns and classified as either sam or legitimate. Many e-mail rograms have this feature which is often referred to as message rules or message filters. This technique mostly consists of a lain string matching. Whitelists and blacklists, which basically are lists of friends and foes, fall into this category. Whenever an incoming e-mail is matched against an entry in the whitelist, the rule is to allow that e-mail through. However whenever an e-mail has a match against the blacklist, it is classified as a sam. This method can reduce sam u to a certain level and requires constant udating as sam evolves. It is time consuming to determine what rules to use and it is hard to obtain good results with this technique. In Mertz D. 00 some simle rules are resented. The author claims that he was caable of catching about 80% of all sam he received. However, he also stated that the rules used had, unfortunately, relatively high false ositive rates. Basically, this technique is a simler version of the more sohisticated rule based filters which are discussed below. 3.3 Rule based filters This is a oular content-based method deloyed by sam filtering software such as SamAssassin 5. Rule-based filters aly a set of rules to every incoming email. If there is a match, the e-mail is assigned a score that indicates saminess or non-saminess. If the total score exceeds a threshold the e-mail is classified as sam. The rules are generally built u by regular exressions and they come with the software. The rule set must be udated regularly as sam changes, in order for the filtering of sam to be successful. Udates are retrieved via the Internet. The tests results from the comarison of anti-sam rograms resented in Holden 003 show that SamAssasin finds about 80% of all sam, while statistical filters (discussed later) find close to 99% of all sam. 4 E-mail siders, or e-mail robots, are comuter rograms that scans and collects e-mail address from Internet. 5 SamAssasin, htt://www.samassassin.org/index.html 5

The advantage of rule-based filters is that they require no training to erform reasonably well. Rules are imlemented by humans and they can be very comlex. Before a newly written rule is ready for use, it requires extensive testing to make sure it only classifies sam as sam and not legitimate messages as sam. Another disadvantage of this technique is the need for frequent udates of the rules. Once the sammer finds the way to deceive the filter, the sam messages will get through all filters with the same set of rules. 3.4 Statistical filters In Sahami et al. 998, it is shown that it is ossible to achieve remarkable results by using a statistical sam classifier. Since then many statistical filters have aeared. The reason for this is simle; they are easy to imlement, have a very good erformance and require a little maintenance. Statistical filters require training on both sam and non-sam messages and will gradually become more efficient. They are trained ersonally on the legitimate and sam e-mails of the user. Hence it is very hard for a sammer to deceive the filter. A more in-deth discussion on statistical filters will follow in the next chater. 3.5 E-mail verification E-mail verification is a challenge resonse system that automatically sends out a one-time verification e-mail to the sender. The only way for an e-mail to ass through the filter is if the sender successfully resonds to the challenge. The challenge in the verification e-mail is often a hyerlink for the sender to click. When this link is clicked, all e-mails from that sender are allowed through. Bluebottle 6 and ChoiceMail 7 are two such systems. The advantage of this method is able to filter almost 00% of the sam. However, there are two drawbacks associated with this method. The sender is required to resond to the challenge which necessitates extra care. If this challenge is not recognized the e-mail will be lost. Verifications can also be lost due to technical obstacles such as firewalls and other e-mail resonse systems. It can also cause roblems for automated e-mail resonses such as online orders and newsletters. The verification e-mail also generates more traffic. 3.6 Distributed blacklists of sam sources These filters use a distributed blacklist to determine whether or not an incoming e-mail is sam. The distributed blacklist resides on the Internet and is frequently being udated by the users of the filter. If a sam asses through a filter, the user reorts the e-mail to the blacklist. The blacklist is udated and will now rotect other users from the sender of that secific e-mail. This class of blacklists kees a record of known sam sources, such as IP numbers that allow SMTP relaying. The roblem involved in using a filter entirely relying on these blacklists is that it will generally classify many legitimate e-mails as sam (false ositive). Another downside is the time taken for the networked based looku. These solutions may be useful for comanies assuming that all their e- 6 Bluebottle, htt://www.bluebottle.com/ 7 ChoiceMail, htt://www.digiortal.com/index.html 6

mail communications are with other serious non-listed businesses. Comanies offering this service include MAPS 8, ORDB 9 and Samco 0. 3.7 Distributed blacklist of sam signatures These blacklists work in a same manner to that described in 3.6. The difference is that these blacklists consist of sam message signatures instead of sam sources. When a user receives a sam, that user can reort the message signature (tyically a hash code of the e-mail) to the blacklist. In this way, one user will be able to warn all other users that a certain message is sam. To avoid non-sam being added to a distributed blacklist, many different users must have reorted the same signature. Sammers have found an easy way to fool these filters; they simly add a random string to every sam. This will revent the e-mail from being detected in the blacklist. However sam fighters attemt to overcome this roblem by adating their signature algorithms to allow some random noise. The advantage being that these kinds of filters rarely classify legitimate messages as sam. The greatest disadvantage is they are not able to recall much of the sam. Viul s Razor uses such a blacklist and states that it catches 60%-90% of all incoming sam. Another disadvantage is the time taken for the network looku. 3.8 Money e-mail stams The idea of e-mail stams is not new, having been discussed since 99, but it is not until recently that major comanies have considered using it to combat sam. The sender would have to ay a small fee for the stam. This fee could be minor for legitimate e-mail senders, while it could destroy business for sammers that send millions of e-mails daily. There are two stam tyes; money stams and roof-of-work stams (discussed later). GoodmailSystems is develoing a system for money stams. The basic idea is to insert a unique encryted id to the header of each sent e-mail. If the reciient ISP is also articiating in the system, the id is sent to Goodmail where it is decryted. Goodmail will now be able to identify and charge the sender of the e-mail. Today there are many issues requiring solutions before such a system can be deloyed. Who receives the money? Where is tax aid? Who are allowed to sell stams? Since this is a centralized solution, what about scalability? It would also be the end of many legitimate newsletters. 3.9 Proof-of-work e-mail stams At the beginning of 004, Bill Gates, Microsoft s chairman, suggested that the sam roblem could be solved within two years by adding a roof-of-work stam to each e-mail. Camram 3 is a system that uses roof-of-work stams. Instead of taking a micro fee from the sender, a cheat-roof mathematical uzzle is sent. The uzzle requires a certain amount of comutational ower to be 8 Mail Abuse Prevention System LLC (MAPSSM), htt://mailabuse.com/ 9 Oen Relay DataBase (ORDB), htt://ordb.org/ 0 Samco, htt://www.samco.net/ Viul s Razor, htt://razor.sourceforge.net/ Goodmail, htt://www.goodmailsystems.com/ 3 Camram, htt://www.camram.org/ 7

solved (matter of seconds). When a solution is found, it is sent back to the receiver and the e-mail is allowed to ass to the receiver. The uzzle Camram is using is called Hashcash 4. Whether it is money or roof-of-work e-mail stams, many oose the idea, not only because e- mailing should be free, but also because it will not solve the sam roblem. To make this aroach effective, most ISP s would have to join the stam rogram. As long as there are ISP s that are not integrated into the stam system, sammers could use their servers for mass e-mailing. It could then still be ossible for the legitimate e-mailers to ay to send e-mails, while sam is still flooding into the inboxes of users. Many non-rofit legitimate mass e-mailers will robably have to abandon their newsletters due to the sending cost. Historically, sammers have been able to deceive most of the other anti sam filters and this could also be the case with the stam system. 3.0 Legal measures In recent years many nations have introduced anti-sam laws, in December 003, resident George W. Bush signed the CAN-SPAM 5 act, the Controlling the Assault of Non-Solicited Pornograhy and Marketing Act. The law rohibits the use of forged header information in bulk commercial e- mail. It also requires sam to include ot-out instructions. Violations can result in fines of $50 er e-mail, caed at $6 million. In Aril 004 the first four sammers were charged under the CAN- SPAM law. The trial is still on, but if the court manages to send out a strong message, this could deter some sammers. The Euroean Union introduced an anti-sam law on the 3st of October 003 called The Directive on Privacy and Electronic Communications. This new law requires that comanies gain consent before they send out commercial e-mails. Many argue that this law is toothless since most of the sam comes from the outside of EU. In the long-run legislation can be used to slowdown the sam flood to some extent, but it will require an international movement. Legislation will not be able to solve the sam roblem by itself, at least not in the near future. 3. Conclusion The most commonly used methods for eliminating sam were described in this chater. Perhas legislation is the best otion in the long run. However, it requires a world wide effort and this rocess could be slow. Presently users need to rotect themselves and for the moment statistical filters are the most romising method for this urose. They have suerior erformance, can adat automatically as sam changes and in many cases are comutationally efficient. 4 Hashcash, htt://www.hashcash.org/ 5 Information about sam laws can be found here htt://www.samlaws.com/. 8

4 Statistical Classifiers A classifier s task is to assign a attern to its class. The attern can be a seech signal, an image or simly a text document. For examle in sam classification, the classifier would assign a message as either sam or legitimate class. Historically, rule-based classifiers were mainly used until the end of the 980s. Rule-based classifiers are simle but require classification rules to be written. Writing rules for high accuracy is difficult and time consuming. By the end of 980s, when comuters were becoming more efficient, statistical classifiers started to emerge. Statistical classifiers use machine learning to build its classifier from reviously labeled (the class is known) training data. For examle, a statistical sam classifier is trained on labeled legitimate and sam messages and a seech recognition classifier is trained on different labeled voices. The classifier uses characteristics of the attern to classify it into one of several redefined classes. Any characteristic can be referred as a feature. 4. Features and classes A feature is any characteristic, asect, quality or attribute of an object. For examle, the eye color of a erson or the words in a text documents are features. A good feature is one that is distinctive for the class of the object. For examle, the word Viagra is found in many sam messages but not in many legitimate, hence it is a good feature. In most cases many features makes the classification more accurate. The combination of n features can be reresented as an n -dimensional vector, called a feature vector. The feature vector is defined as F = { f, f,..., f n}: i n where f i is a feature. The n -dimensionality of the feature vector is called the feature sace. By examining a feature vector the classifier s task is to determine its class. If m is the number of classes, then the class vector is defined as C = c, c,..., c }, where, k m, is a unique class. 4. Text categorization { m Text categorization is the roblem involved in classifying text documents to a category or class. Text categorization is becoming more oular as the amount of digital textual information grows. The roblem of classifying an e-mail message as sam or legitimate message can be considered as a text categorization roblem. Another oular area of use is Web age categorization to hierarchical catalogues. Statistical text classifiers can be divided into two categories, generative and discriminative. The generative aroach uses an intermediate ste to estimate arameters while the discriminative models the robability of a document belonging to a class directly. There are arguments for using discriminative methods instead of involving the intermediate ste of generative aroaches. Recent studies (Ng and Jordan 00) have shown that the erformances of generative and discriminative aroaches are highly deendent on the corus training data size. There are many statistical filters in the literature. An extensive study (Yang Y. and Liu X, 998) comared several filters including Suorted Vector Machines (SVM), k-nearest-neighbor (knn), c k 9

Neural Networks (NNet) and Naive Bayesian (NB). NB is the only generative algorithm from these four. A brief introduction to these classifiers follows. SVM (Vanik 995) searates two classes with vectors that ass through training data oints. The searation is measured as the distance between the suort vectors and is called the margin. The time involved in finding suort vectors that maximize the margin is, in the worst-case scenario, a quadratic. SVM have shown romising results concerning text categorization roblems in several studies (Yang Y. & Liu X, 998). A recent study (Androutsooulos 004) demonstrated that its erformance was good with reference to the sam domain. Another classifier, k-nearest Neighbor (knn), mas a document to features and measures the similarity to the k-nearest training documents. Scores are created for each of the classes of the k- nearest documents based on the similarity. The document is then classified as the class with the greatest similarity. This aroach has been available for over four decades and has roved to be the to-erformer on Reuters corus (toic classification of text documents). Neural Networks (NNet) is commonly used in attern analysis and has been alied to text categorization by Wiener et al. 995 & Yang Y. and Liu X, 998. In a study (Chen D. et al.) the NNet was outerformed by NB. NNets are exensive to train and memory consuming as the number of features grow. Among the described classifiers the NB classifier is the simlest in terms of its ease of imlementation. When comared to the others, it is also comutationally efficient. Tests carried out by Yang Y. and Liu X, 998 showed that NB undererformed the others. Another study (Androutsooulos 000a) shows that NB outerforms knn. For this work NB is used as classifier not only for its simlicity and comutational efficiency, but also because of a belief that with a good robability estimator and careful feature selection it does not necessarily under-erform discriminative methods. In the rest of this section the basic elements of the theory relevant to NB will be clarified by using simle examles from everyday life. The detailed descrition of NB and its elements will follow in the next chater. 4.3 Basics about Probability Theory The robability that an event X occurs is a number that can be obtained by dividing the number of times X occurs by the total number of events. The robability is always between 0 and, or it can be exressed as ercentage. For examle, the robability of a six sided die showing 6 is P ( 6) = 6 or it is aroximately equal to6.67%. Two events are indeendent if they do not affect each other s robabilities. For examle, the events tossing a coin and rolling a die are indeendent because the robability of the coin landing on its head is not affected by the robability of rolling a six on a die. 0

For indeendent events, the robability of both occurring is called a joint robability and it is calculated as a roduct of the individual robabilities. The robability for event X is P (X ) and for event Y is P (Y ). If X and Y are indeendent, then their joint robability is exressed as P( X Y ) = P( X ) P( Y ). () Using the examle with the coin and die, the robability that the coin lands on head and the six is rolled is P ( HEAD) P(6) = P( HEAD 6) = =. 6 Events are deendent when their robabilities do affect each other. In this context conditional robabilities are defined and calculated. For examle, the robability of drawing a heart from a comlete card deck is 5%. If the second card is drawn without reinserting the first card, the robability of that card being a heart is lower since one card has already been removed. Therefore, the robability of drawing the second card is deendent on the first one. Conditional robability is denoted as P ( Y X ), which is read as the robability that Y occurs given that X has occurred. Conditional robability is formally defined as P( X Y ) P( Y X ) =. P( X ) () If the events X and Y are indeendent, then the conditional robability for Y given that X has occurred is equal to the robability of Y. This result can be easily obtained by substituting () into (). P( X Y ) P( X ) P( Y ) P ( Y X ) = = = P( Y ) P( X ) P( X ) (3) The unconditional (rior) robability of an event X, P (X ), is the robability of the event before any evidence is resented. The evidence is the ercetion that affects the degree of belief in an event. The conditional robability of an event is the robability of the event after the evidence is resented. For examle, the event that a erson has anti-virus rogram can be event V and the event that a erson has a sam-filter event S. If there is evidence that 60% of all eole have an anti-virus rogram and that 0% of all eole have a sam-filter and an anti-virus rogram, then the robability of a erson having a sam-filter given that he/she has an anti-virus rogram can be calculated as follows. P( V S) 0.0 P ( S V ) = = = P( V ) 0.60 The sam classification in NB is based uon Bayes theorem that defines the relationshi between the conditional robabilities of two events. 3

4.4 Bayes theorem Bayes theorem rovides a way to calculate the robability of a hyothesis, here the event Y, given the observed training data, here reresented as X : P( X Y ) P( Y ) P ( Y X ) =. P( X ) (4) This simle formula has enormous ractical imortance in many alications. It is often easier to calculate the robabilities, P ( X Y ), P (Y ), P (X ) when it is the robability P ( Y X ) that is required. This theorem is central to Bayesian statistics, which calculates the robability of a new event on the basis of earlier robability estimates derived from emirical data. The following section exlains the different ways of erforming statistical analyses using the classical and the Bayesian statistics. 4.5 Classical vs. Bayesian statistics 4.5. Using statistics Statistics is used to draw conclusions from data and to redict the future in order to answer research questions such as Is there a relationshi between a student s IQ and height? To answer such questions students IQ and height data must be collected. This can be achieved by erforming exeriments. For examle, an IQ-test is given and the height of each student is recorded. A lot is made of IQ v Height and it is then ossible to detect whether or not a correlation exists. Statistical tests can be alied to answer research questions, to confirm or reject certain hyothesis. There are two essential statistical methods, classical (or frequentists) (Hinton R. 004) and Bayesian (Bullard F. 00 & Heckerman D. 995 & Lee P. 004). 4.5. Using statistics Consider the scores from one hundred students that have taken a test. Each test is marked with a score between zero and fifty. The collected data is somewhat uninformative as it is merely a list of numbers. To imrove the resentation it is ossible to add u the number of eole who achieved the same mark. This is called the frequency for each mark. For examle, 3 eole scored 3 oints, 0 scored oints etc. This information can now be reresented as a histogram, where the mark is assigned to the x-axis and the number of students with that mark along the y-axis. This resentation is called the frequency distribution. Frequency distributions are imortant in statistical analysis as they rovide an informative reresentation of the data. Statistical tests can be alied to frequency distributions to answer research questions, to confirm or reject certain hyothesis. 4.5.3 Objective and subjective robabilities In classical statistics all attention is devoted to the observed data, the frequencies which are generally collected from reeated trials. For examle, in the case where the research question consists of deciding whether a articular die is biased or not, multile rolls are required to obtain sufficient data. The data is then used as evidence to determine whether the observed results are

significantly different to the exected for a non-biased die. This can be used as evidence that a die is biased. Now consider the scenario where a Casino emloyee, who is an exert on biased dice, is resent and claims that there is a 98% certainty that a articular die is biased. However, this additional information does not benefit the test as such subjective degrees of belief are ignored in classical statistics. As oosed to classical, Bayesian statistics takes a subjective degree of belief into account, the rior data. It allows us to use the information offered by the exert in our rediction as to whether or not the die is biased. In fact many exerts could be consulted and asked for their oinions. With Bayesian statistics it is ossible to take subjective robabilities together with the collected data to obtain the robability of the die being biased. 4.5.4 Inference differences Another difference between classical and Bayesian statistics is how their inference is erformed. In classical statistics an initial assumtion or the hyothesis about the research question is first made. It is usually called a null hyothesis. A single or several alternative hyotheses can also be defined. Then the relevant evidence or data are collected. This evidence measures how different the observed results are from the exected if the null hyothesis was true. The measurement is given in terms of a calculated robability called the -value. It is the robability of obtaining the observation found in the collected data, or other observations which are even more extreme. The significance level is the degree of certainty that is required in order to reject the null hyothesis in favor of the alternative. A tyical significance level of 5% is usually used. The notation is α=0.05. For this significance level, the robability of incorrectly rejecting the null hyothesis when it is actually true is 0.05. If higher rotection is needed, a lower α can be selected. Once the significance level is determined and the -value is calculated the following conclusion is drawn. If the robability of observing the actual data under the null hyothesis is small (< α) the null hyothesis is not true and it can be rejected. This means that the alternative hyothesis is acceted. The converse is not true. If the -value is big (> α), then there is insufficient evidence to reject the null hyotheses. For examle, let the null hyothesis, the die is unbiased be assumed to be true. If the die is rolled many times and 70% of all outcomes are sixes, the statistical test will calculate the robability of obtaining a six in 70% of outcomes or higher (-value). The robability distribution for the outcomes observed using an unbiased die is used for this calculation. If the -value is lower than 0.05 then according to classical statistics the null hyothesis can be rejected and the die will be considered to be biased. In Bayesian statistics, however, a robability is really an estimate of a belief in a articular hyothesis. The belief that a six occurs once in every six rolls of the die comes from both, rior considerations about fair die and the emirical results that have been observed in the ast. Bayesian statistics evaluates the robability of a six by taking the revious data collected into consideration. For many researchers this aroach is more intuitive than the inference of classical statistics. 3

4.5.5 Examle of statistical sam classification In order to imlement either the classical or the Bayesian statistics to classify an e-mail massage as sam, some data must be available. This usually occurs as a collection of e-mail messages labeled as sam or non-sam and are referred as a corus (training data). In addition, the information about the frequency distribution of the tokens in the corus is available. This means that every token is accomanied by the number of times it has been seen in the corus. 4.5.5. Classical statistics When testing a new e-mail message the starting oint is the null hyothesis stated as the message is sam. The alternative hyothesis is the message is non-sam. The significance level is chosen to be 5% or α =0.05. To classify a message as sam a frequency distribution of its tokens is created and comared to the revious training data (sam corus) with an aroriate statistical test ( χ tests can be used to analyze frequency data). The statistical test will give a robability ( -value) and if it is lower than the significance level the null hyothesis is rejected and the message is concluded not to be sam. Otherwise the null hyothesis is acceted. 4.5.5. Bayesian statistics Bayesian robabilistic reasoning has been used in machine learning since the 960s, esecially in medical diagnosis. It was not until 998 that Sahami et al. 998 alied a Bayesian aroach to classify sam. With Bayesian statistics the robability of a model based on the data is calculated as oosed to classical statistics which calculates the robability of the data given a hyothesis (model). To illustrate this, consider the revious examle where classical statistics was used to verify the null hyothesis (the message being sam), either the message is classified as sam or not. Bayesian statistics calculates the robability of a message being sam. For examle, Bayesian statistics can calculate that a message has an 8% chance of being a sam. 4

5 Naive Bayesian Sam Filtering 5. The model A general Naive Bayesian sam filtering can be concetualized into the model resented in Figure. It consists of four major modules, each resonsible for four different rocesses: message tokenization, robability estimation, feature selection and Naive Bayesian classification. Incoming message (e-mail) Message tokenization Probability estimation Feature selection Sam Naive Bayesian classifier Legitimate Remove message / tag subject Process message as usual Figure. A model of Naive Bayesian sam filtering. When a message arrives, it is firstly tokenized into a set of features (tokens), F. Every feature is assigned an estimated robability that indicates its saminess. To reduce the dimensionality of the feature vector, a feature selection algorithm is alied to outut a subset of the features, F F. The Naive Bayesian classifier combines the robabilities of every feature in F, and estimates the robability of the message being sam. In the following text, the rocess of Naive Bayesian classification is described, followed by details concerning the measuring erformance. This order of exlanation is necessary because the sections concerned with the first three modules require understanding of the classification rocess and the arameters used to evaluate its imrovement. 5

5. Naive Bayesian Classifier In terms of a sam classifier Bayes theorem (4) can be exressed as P ( C F) = P( F C) P( C) P( F) (5) where F = { f,..., f n } is a set of features and C = { good, sam} are the two classes. When the number of features, n, is large, comuting P ( F C) can be time consuming. Alternatively, it can be assumed that the features, which are usually words aearing in the e-mail message, are indeendent of each other. This assumtion is, however, not true, as words in e-mails are not indeendent. For examle an e- mail with the word Viagra is likely to co-occur with urchase. However even though the indeendence assumtion is not true the classifier works well, at least on the sam domain. One argument (Domingos & Pazzani 996) is that with the indeendence assumtion the classifier would roduce oor robabilities, but the ratio between them would be aroximately the same as would occur using conditional robabilities. Using the somewhat Naive indeendence assumtion gave birth to its name Naive Bayesian classifier. Using the assumtion for indeendence, according to (), the joint robability for all n features can be obtained as a roduct of the total individual robabilities P( F C) = P( f i C). n i= (6) Inserting (6) into (5) yields P( C) P( f i C) i= P( C F) =. P( F) n (7) The denominator P (F) is the robability of observing the features in any message and can be exressed as m k= P( F) = P( C ) P( f C ). k n i= i k (8) Inserting (8) into (7) the formula used by the Naive Bayesian Classifier is obtained 6

k = P( C) P( f i C) = i= P( C F). m n P( C ) P( f C ) k n i= i k (9) The formal reresentation (9) may aear to be comlicated. However, if C = sam then (9) can basically be read as: The robability of a message being sam given its features equals the robability of any message being sam multilied by the robability of the features co-occurring in a sam divided by the robability of observing the features in any message. To determine whether or not a message is sam the robability given by (9) is comared to a λ threshold value, t =. If P ( C = sam F) > t then the message is classified as sam. For + λ examle, when λ = 9 then t = 0. 9 meaning that blocking one legitimate message is of the same order as allowing 9 sam messages to ass. 5.3 Measuring the erformance The meaning of a good classifier can vary deending on the domain in which it is used. For examle, in sam classification it is very imortant not to classify legitimate messages as sam as it can lead to e.g. economic or emotional suffering for the user. 5.3. Precision and recall A well emloyed metric for erformance measurement in information retrieval is recision and recall. These measures have been diligently used in the context of sam classification (Sahami et al. 998). Recall is the roortion of relevant items that are retrieved, which in this case is the roortion of sam messages that are actually recognized. For examle if 9 out of 0 sam messages are correctly identified as sam, the recall rate is 0.9. Precision is defined as the roortion of items retrieved that are relevant. In the sam classification context, recision is the roortion of the sam messages classified as sam over the total number of messages classified as sam. Thus if only sam messages are classified as sam then the recision is. As soon as a good legitimate message is classified as sam, the recision will dro below. Formally: Let n gg be the number of good messages classified as good (also known as false negatives). Let n gs be the number of good messages classified as sam (also known as false ositives). 7

Let n ss be the number of sam messages classified as sam (also known as true ositives). Let n sg be the number of sam messages classified as good (also known as true negatives). The recision ( ) and recall (r) are defined as = n ss nss + n gs = + n / n gs ss (0) r = n ss nss + n sg = + n / n sg ss () The recision calculates the occurrence of false ositives which are good messages classified as sam. When this haens dros below. Such misclassification could be a disaster for the user whereas the only imact of a low recall rate is to receive sam messages in the inbox. Hence it is more imortant for the recision to be at a high level than the recall rate. The recision and recall reveal little unless used together. Commercial sam filters sometimes claim that they have an incredibly high recision value of 0.9999% without mentioning the related recall rate. This can aear to be very good to the untrained eye. A reasonably good sam classifier should have recision very close to and a recall rate > 0.8. A roblem when evaluating classifiers is to find a good balance between the recision and recall rates. Therefore it is necessary to use a strategy to obtain a combined score. One way to achieve this is to use weighted accuracy. 5.3. Weighted accuracy To reflect the difference in misclassifying a good message and a sam message a cost sensitive evaluation (Androutsooulos et al. 004) is used to measure the erformance of the classifier. The weighted accuracy of a classifier is defined as W Acc λ n = λ n gg g + n + n ss s. () where n g is the total number of messages and n s is the total number of sam messages. λ is the weight of each good message. Each misclassification of a good message counts as λ misclassifications of sam. At the same time, one correctly classified message counts as λ successful classifications. Sam messages are treated as single messages. Using weighted accuracy gives the imression that legitimate messages are λ times more imortant than sam. 8

5.3.3 Cross validation There are several means of estimating how well the classifier works after training. The easiest and most straightforward means is by slitting the corus into two arts and using one art for training and the other for testing. This is called the holdout method. The disadvantage is that the evaluation deends heavily on which samles end u in which set. Another method that reduces the variance of the holdout method is k -fold cross-validation. In k -fold cross-validation (Kohavi 995) the corus, M, is slit into k mutually exclusive arts, M, M,... M k. The inducer is trained on M \ M i and tested against M i. This is reeated k times with different i such that i {,,..., k}. Finally the erformance is estimated as the mean of the total number of tests. For a k-folded test the recision and the recall r are defined as = n k i i= and r = n k r i i= (3 and 4) where i and r i are the recision (0) and recall () for each of the k tests. Research has shown that k = 0 is a satisfactory total (Breiman & Sector 99) and (Kohavi 995), therefore 0-fold cross validation was used throughout the exeriments in this thesis. 5.3.4 Benchmark coruses One roblem when attemting to benchmark a classifier arises because of the different results obtained deending on the test corus. It is easy to otimize arameters for a classifier to favor a given corus to achieve imressive results. When this classifier is alied to another corus it may erform oorly. Therefore it is imortant to use more than one corus source when testing a articular classifier. It is also referred that coruses are ublicly available in case another researcher wants to comare or reroduce the results. 5.4 Message tokenization Message tokenization is the rocess of tokenizing an e-mail message using carefully selected delimiters. The selection of delimiters affects the classification accuracy. Traditionally, selection is erformed manually by trial and error. This thesis examines the ossibility of automatic delimiter selection. The roblem of finding high erformance delimiters is called the feature subset roblem. In this case the features refer to the delimiters. There have been few attemts made to automatically find a good delimiter subset for sam classification. In fact, the author is only aware of one revious attemt (Bevilacqua-Linn 003) which attemted to use two different search methods, a hill-climber and a genetic algorithm. Both search methods roved to be too time-consuming in accomlishing their task and the exeriment was abandoned. 9

Since the dimensionality tends to be astronomically large, exhaustive search is not lausible. In this work Sequential Forward Floating Selection (SFFS) is imlemented to find a good delimiter subset. There are two common schemes for imlementing a search algorithm, called filters and wraer induction and both aroaches are examined and imlemented in this thesis. For each method a fitness function is introduced. The fitness for the filter aroach uses the discriminative ower of the tokenized data. The wraer fitness estimates the delimiter erformance by running the inducer (measures the erformance of the delimiters by running the classifier on test data). This section begins by resenting some definitions after which it exlains how delimiters interact. This exlanation is necessary to decide uon the choice of search algorithm. Finally filters and wraers are resented. 5.4. Definitions The alhabet A = { a, a,..., an} is a finite set of symbols. Each symbol is a character. For examle, the English alhabet E = { a, b,..., z} and the binary alhabet B = {0, }. E-mail messages use ASCII character code. The alhabet used is A ASCII = { char(0), char(),..., A, B,...,( char55)}, where chr(i), 0 i 55 is the character corresonding to integer i. A string X A is a sequence of symbols from the alhabet A. This is denoted by X A =< s, s,..., sn > : si A. For examle, let the alhabet A = { a, m,, s} then a string over A could be X A = sam. Every e-mail message is a string over the alhabet A ASCII. The cardinality of a string X is denoted by X = n. For examle, if X = sam, then X = 4. The cardinality of the alhabet A ASCII is A = 56. ASCII A delimiter set, D, for a string X over the alhabet A is a subset of A. For examle, if the string X=<Dear Friend how are you?> then the delimiter set for X could as an examle be D={sace, question-mark}. A tokenizer T is a function of a string X and a delimiter set D such that Q = T ( X, D). Where the roduction Q = { x, x,..., x n }: xi is a substring between two delimiters in X. For examle, the string X=<Dear Friend how are you?> and the delimiter subset D={sace}, then T ( X, D) ={Dear, Friend, how, are, you?}. A label set L = { l,..., l m } is a finite set of labels, l i, i m. The label set used for e-mail messages has only two labels, namely L = { good, sam}. A classifier C is a function that mas a message X to a label l. This is defined as C( X ) = l L. Let M l = { m, m,..., mn}: C( mi ) = l be a collection of reviously labeled messages which all have the same label. 0

5.4. Delimiter Interaction An intuitive model of the delimiter interaction would be to think of it as a diagram where the tokenizer roductions are lotted. Let the y-axis be the robability of sam (saminess) of each token (x-axis). The saminess of a token is calculated from its frequencies in the training data. (This is called the robability estimator and is examined in the next chater.) If no delimiters are used, all tokens are likely to occur in the middle of the diagram, where they have a 50% robability of being sam or good related. As new delimiters are introduced the tokens will start to searate in the diagram, as if the tokens were attracted by magnets ulling them towards either the to (high sam robability) or bottom (low sam robability) of the diagram. The more searation in both directions often means a better delimiter subset. Clearly, using no delimiters at all allows for a very high recision and a very low recall. As new delimiters are added the recall rate may increase to a articular level before the recision starts to fall. The aim is to choose delimiters that maximize the recall rate while maintaining a high recision. One reason for the recision fall is related to the fact that information is removed from the messages as the number of delimiters increases. This is called information loss. Information loss contributes to the destruction of atterns that characterize one class from another. For examle, in unigram character tokenization every character is treated as a token. This searation of characters removes all atterns and the information loss is high. The result is low classifier erformance. The interaction between delimiters for a message tokenizer is highly comlex; it is uncertain as to whether the relation between delimiters is transitive and also suffers from non-monotonicity. To demonstrate the non-transitivity and non-monotonicity roerties of delimiter interaction the following simle examle should be considered. Let the alhabet A = { a, b, c, d} and let the messages X = abcd and Y = bcad. Let the label of the messages be tokenizer T be D. L ( X ) = good and L ( Y ) = sam. Let the delimiters used by the Table illustrates the roductions Q = T ( X, D) and Q = T ( Y, D). Clearly a tokenizer that roduces different roductions deending on whether the message is good or sam is more beneficial than one that does not. This intuition can be used to demonstrate the non-transitive roerty. A simle criterion (fitness) function J ( D) = Q Q Q Q is created by counting the number of different elements in Q and Q. Table. Illustration of the non-transitive relationshi between delimiters. D Q Q J (D) {a} {bcd} { bc, d} 3 { a, b} {cd} { c, d} 3 { b, c} { a, d} {ad} 3 [ a, c} { b, d} { b, d} 0

5.4.. Non-transitivity Let us denote a beneficial relationshi between two features with. A transitive relation could aear as a b b c a c and is read as the delimiter a is of benefit to the classification together with b, b is of benefit together with c therefore a is of benefit with c. As shown the delimiters { a, b} and { b, c} are beneficial in the classification task since the roductions of T are distinguishable. The resence of a transitive roerty would imly that the roduction of { a, c} is also beneficial, but this is not the case as a b b c a c. This illustrates the fact that delimiter interaction is non-transitive. 5.4.. Non-monotonicity Monotonicity is the roerty that states that whenever a new feature is added the criterion function J (D) shows an imrovement. In mathematical notation the roerty is exressed as: Given two sets, D and D, D D J ( D ) < J ( D ). However, this is not the case for most criterion functions. If the revious examle is considered, then { a } { a, c} / J ({ a}) < J ({ a, c}). The nonmonotonicity roerty for delimiter interaction has now been shown for the simle J (D). 5.4.3 Dimensionality reduction in the search for a good delimiter subset The order of delimiters is not of imortance and neither are reetitions, hence the exhaustive search n sace for a delimiter subset of size s over an alhabet with cardinality n is. The s n dimensionality for all subsets is. The growth is combinatorial and therefore an exhaustive search would be too slow for this urose. Another choice is the branch and bound search which is much faster than an exhaustive search, but does require the monotonicity roerty. As shown reviously in this chater, finding a delimiter subset suffers from nonmonotonicity which makes the use of branch and bound algorithms inaroriate. As the delimiter interaction is non-transitive it is not ossible to rely on transitive grahs to find maximum cliques. Other search sace reduction algorithms are genetic algorithms (Siedlecki & Sklansky 988). Two other simle algorithms (Jain & Mao 999), Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS), are very fast but are sub-otimal. SFS begins with no features and reeatedly adds the one that maximizes the fitness function. This algorithm only examines n ( n +) / subsets. SBS works in the same manner as SFS but it starts with all features and removes the feature that maximizes the fitness for the remainder. In order to create results closer to the otimal solution (Pudil & Kittler 994) imroved SFS and SBS were required. Sequential Forward Floating Selection (SFFS) and Sequential Backward Floating Selection (SBFS) are, resectively, the imroved algorithms. SFFS works in a similar manner to SFS, but for every new subset it enters a backtracking loo that attemts to find a better subset than that of its redecessor by removing one feature at a time. This is reeated until no better subset is found and the backtrack loo is then exited. These algorithms gave near otimal results in

some exeriments (Pudil & Kittler 994, Jain & Zongker 997, Ferri & Pudil 994). The original SFFS and SBFS had one defect; it could dismiss suerior subsets when switching from the backtracking loo. An adative version of SFFS (Somol et al.999) was roosed to address this roblem. The solution was to kee a record of all the best subsets. After backtracking, instead of resuming that the forward subsets are better than the revious ones, a check is erformed against the record. Figure shows the seudo code for the modified SFFS. Due to the revious imressive results the adated version of SFFS was imlemented to find a good subset of delimiters. There is a small tradeoff in seed in comarison with SFS. The running time may increase by a factor of to 0. Modified SFFS Inut: A set of features, F = { f, f,..., f n}. Inut: A criterion function J (F) that oututs higher values for better subsets. Outut: A feature subset S F that maximizes J. k 0 S 0 0/ Forward loo while (k < n) Find feature fithat makes arg max( J ( S if S = 0 or J S ) J ( S ) k+ / S k k + k+ k + ( k + > k + S i k + = S k f i )) : f S i k Backtrack loo Find feature fi that makes if J S ) J ( S ) ( k > k S S k k k k else Leave backtrack loo End backtrack loo End forward loo arg max( J ( S i k = S k \ f i )) : f S i k Figure. Pseudo code insired by (Sance & Sajda 998) for the modified SFFS. 3

5.4.4 Filters and wraers The two most common feature selection methods are filters and wraer induction. The filter aroach (Kohavi & John 996), does not use the inducer to estimate the fitness of features, it only analyses the data distribution. Hence filters are relieved from bias and errors in the inducer and may be able to find a more general solution. Filter methods generally run faster than wraers. Two well-known filters are FOCUS (Almuallim & Dietterich 99), and Relief (Kira & Rendell 99). Figure 3 illustrates the filter rocedure. Training set with all features For all delimiters Search guided by the discriminate ower of the feature subset on the training set. New feature candidates. Training set and selected feature subset. Test set with selected feature subset. Induction algorithm for final erformance evaluation. Figure 3. Illustration of the filter selection rocedure. Wraer induction (Kohavi & John 996, Kushmerick 997 & 999) uses the inducer as a art of the fitness function. Wraer induction has successfully been used for information extraction of Internet resources such as the stock market and roduct catalogs. This is closely related to the urose of this thesis. Their aim was to find delimiters in HTML code for better information retrieval, whereas this work attemts to find message delimiters for better sam classification. Figure 4 illustrates the wraer model. Wraers often roduce better results than filters but they may be less general. 4

Training set with all features For all delimiters Search guided by the induction algorithm. New feature candidates. Training set and selected feature subset. Test set with selected feature subset. Induction algorithm for final erformance evaluation. Figure 4. Illustration of the wraer selection rocedure. 5

5.5 Probability estimation The most straightforward aroach for estimating the robability of an item is to divide its count by the total counts. This estimate aroaches the theoretical robability as the samle sace increases. If the entire oulation is available an exact robability can be obtained, but for most cases when the samle sace is finite the Maximum Likelihood Estimate (MLE) will be in the neighborhood of the robability. P MLE xi ( xi ) = N (5) where x i is the item and N is the total number of training instances. The roblem regarding this estimator is that it estimates the robability of unseen items to bee zero. This is sometimes referred to as the zero-frequency-roblem. In the case of a Naive Bayesian classifier where estimates are combined from two coruses a single zero estimate of a token could result in an overall sam robability of 0. The same roblem remains for rare words, those with low frequencies. For examle a robability estimated from a token occurring once in the good corus and ten times in the sam corus is less reliable than a token with 0 occurrences in the good and 00 in the sam corus. Even though the combined robability is the same the latter is more reliable. Schemes do exist to solve the roblems concerning Maximum Likelihood Estimation for cases with sarse data. It is usually said that these schemes erform smoothing. Smoothing is the method of moving robability mass from seen to unseen items. The labels used are given below. N is the total number of training instances. x i is the item. N j is the frequency of frequency of items seen j times. For examle, if there exist exactly three tokens which each have been seen five times then N = 5 3. B is the number of bins (distinct items). 5.5. Absolute Estimate ( abs ) The absolute estimate (Nay et al. 994), subtracts a constant from the frequencies P Abs xi c ( xi ) = N (6) where c is the constant usually obtained from the frequency of frequency of items seen once and twice, N and N such that 6

c = N N + N (7) 5.5. Lalace Estimate ( la ) Lalace (or add-one) is erhas the simlest smoothing technique which assumes that every event has been seen once before. P La xi + ( xi ) = N + B (8) Where B is the number of bins (distinct items). The roblem with the Lalace estimator is that it allocates too much robability mass to unseen objects (Gale & Church 994). 5.5.3 Exected Likelihood Estimate (ELE) ( ELE ) This method assigns less robability mass to unseen objects than the Lalace estimator by adding 0.5 instead of to every count. P ELE xi + 0.5 ( xi ) = N + 0.5B (9) This estimator is also referred to as Jeffreys-Perks law. 5.5.4 Lidstone Estimate ( Lid ) Lalace and ELE are both secial cases of the Lidstone estimator which adds δ to every count P Lid xi + δ ( xi ) = N + δb (0) where, usually, 0 δ. This estimate works in a similar manner as the Lalace estimate and ELE. The difference is that instead of using a constant to move the robability mass to unseen items, it uses an adjustable variable, δ. But its weakness lies in finding a good δ. 5.5.5 Witten Bell smoothing ( WB ) Witten Bell smoothing (Witten & Bell 99) was originally develoed for the task of text comression. It uses items seen once to estimate unseen items. The robability mass assigned to unseen items is N P( x i ) = for x i = 0 N + N () 7

where N is the number of items that have been seen once. The mass is taken from higher frequency items such that wb xi ( xi ) = for x i > 0 N + N () The robability for unseen items is determined by wb N ( xi ) = for x i = 0 Z( N + N ) (3) where Z is the number of unseen items which can be derived by Z = B N. 5.5.6 Good Turing Estimate ( SGT ) Good Turing estimators (Good 953) use the following equation to calculate the robability of events that have been observed before P GT r * ( xi ) = where N E( N r r* = ( r + ) E( N r ) ) + (4) (5) where E (n) is the exectation of the variable n. It estimates how many different words were seen n times, r is the frequency of the item, N r is the frequency of that frequency and r * is the new estimated frequency. There are many different Good Turing estimators deending on how the E function is selected. For the exeriment a version 6 called Simle Good-Turing (Gale 995), denoted as was used. SGT 5.5.7 Bayesian smoothing ( Rob ) Finally an aroach that smoothes (Robinson 003) the maximum likelihood estimate by assuming a beta distribution for the rior was examined. This estimate was first adated to smooth sam robabilities by Gary Robinson. Since then it has been imlemented in a number of sam filters 7 and has shown very romising results. 6 Thanks to Geoffrey Samson at the University of Sussex who has made the source code for the Simle Good Turing estimator freely available at his homeage, htt://www.grs.u-net.com/. 7 SamBayes and BogoFilter both uses Robinsons aroach. 8

P Rob ( x ) = i s 0 + n s + n MLE ( x ) i (6) where 0 is the exected robability when there is no data, n is the number of data oints, s is the strength of the smoother. Diagram illustrates how this technique smoothes ( x MLE i ) = 0. 8 for two different strengths s =.0 and s = 0. 5 as the number of data oints increase. As seen in the table, when there are no data oints a robability of 0. 5 is assigned to the data ( = 0. 0 5 ). Robinson Bayesian smoother for different s and n. 0,85 0,8 0,75 0,7 0,65 0,6 MLE Robinson, s=.0 Robinson, s=0.5 0,55 0,5 3 5 7 9 3 5 7 9 3 5 7 9 3 n Diagram. ( x MLE i ) = 0. 8 is smoothed by Rob as the data oints increase. 9

5.6 Feature Selection The rimary urose of feature selection is to reduce the dimensionality to decrease the comutation time. This is articularly imortant concerning text categorization where the high dimensionality of the feature sace is a roblem. In many cases the number of features is in the tens of thousands. Then it is highly desirable to reduce this number, referably without any loss in accuracy. Several feature selection methods have been roosed. The χ statistics, Information Gain, Mutual Information, Term Strength, Document Frequency, Probability Ratio and Odds Ratio are some of them. A number of studies have been conducted to comare their erformance. A comarative study (Yang & Pedersen 997) showed that Information Gain, χ and Document Frequency had excellent erformance on Reuters corus using a k-nearest Neighbor classifier. Another study (Mladenic & Grobelnik 999) showed that Odds-Ratio outerformed Information Gain using a Naive Bayesian classifier due to the different domains and classifier. (Forman 003) strengthens the results of (Yang & Pedersen 997) and introduced a new feature selection method, BNS which showed romising results. Another urose of feature selection is to imrove erformance when using all features. This is rarely the case, but it is ossible. For examle, (Forman 003) showed that bi-normal searation (BNS) did imrove the results when using all features. However, when dealing with sam, it is often a matter of several hundred or erhas thousands of features as oosed to hundreds of thousands in other areas of text categorization. In comarison to other classifiers such as neural networks, the comutations in a Naive Bayesian classifier are raid. This work will also investigate whether there is a need for feature selection when a Naive Bayesian sam classifier is used for sam detection. Three common feature selection methods will be investigated, Information Gain, χ and the Probability Ratio. Odds Ratio and Probability Ratio are closely related. The reason for using these three feature selection methods in this study is that they are commonly used in sam classification and have all shown good erformance. A brief descrition about each feature selection method is given in the following subsections. 30

Information Gain Information gain measures the decrease in entroy, or the number of bits gained knowing a feature is resent or absent. The information gain, IG, for a term t is defined (Yang & Pedersen 997) as IG ( t) = P( X, Y )log P( X, Y ) X { t, t } Y { c } P( X ) P( Y ) i (7) where m { ci} i= denote the set of classes, in our case, good and sam. 5.6. χ statistics The χ test of indeendence comares atterns of observed with exected frequencies to determine whether or not they differ significantly from each other. As the χ distribution is continuous and the observed frequencies are discrete the χ test is not reliable for small cell values. Recommendations are to kee frequencies larger than five. The chi square statistics is defined as ( O ij Eij ) χ =, i, j Eij (8) Where O is the observed and E is the exected frequency calculated from the joint corus. The degrees of freedom are defined from the size of the contingency table as df = ( I )( J ). Here is a two way contingency table for a term t t t c = good A = freq( t, c) C = freq( t, c) c good B = freq( t, c) D = freq( t, c) We can calculate the χ value from N ( AD CB) χ =. ( A + C) ( B + D) ( A + B) ( C + D) (9) where N = A + B + C + D. The degree of freedom is = ( ) ( ). The χ value is zero if t is comletely indeendent of c. Following on from the work of (Yang & Pedersen 997) this imlementation uses the maximum χ value between the feature and class as the weight. 3

χ = arg max{ χ ( t, c )} max m i i (30) 5.6. Probability Ratio Probability ratio encourages extreme robabilities and is highly deendent on the estimator. The feature weight is calculated by ( t good) PR ( t) = log. ( t sam) (3) In an anti-sam context the Probability Ratio has reviously been used in (Graham 00), but the robability ratio was measured as PR Gra ( t) = ( sam t) 0.5. (3) Since Graham 00 was ublished, Probability Ratio has become very oular in the imlementation of sam filters due to its romising erformance and simlicity. This work examines how well Probability Ratio comares to the more sohisticated Information Gain and χ feature selection methods. 3

6 Exerimental Results C++ rovided most of the necessary tools to carry out the exeriments. The object oriented nature together with a dynamically loaded XML file allowed a setu of comlex test configurations. All arameters and algorithms were secified in the XML. No recomilations were required between different tests. The only C++ shortcoming was its lack of high recision data tyes. In many cases the native double (64-bit) caused underflow which means that the value is rounded down to zero. When a sam token robability becomes zero, (9) will also become zero since the robabilities are multilied. The solution was to deal with the mantissa and exonent searately. Throughout this work seven coruses are used for testing, the authors ersonal corus, SamAssassin 8, Ling-Sam (Androutsooulos et al. 000b), PU, PU, PU3 and PUA 9 (Androutsooulos et al. 004). The ersonal corus contains 930 good and 876 sam messages received by the author. All messages are intact i.e. in their original state. SamAssassins ublic available corus contains 3900 good and 897 sam messages. They are ublished in their comlete form. The Ling-Sam corus consists of 4 good messages received by its author and 48 sam messages from the ublic archives of the Linguists list. Finding benchmark coruses with good messages has been roblematic due to rivacy asects. The PU coruses are an attemt to byass the rivacy issues of ublishing legitimate e-mails. Each token is maed to a number, encoding all messages in this manner creates a corus involving no exloitation of rivacy but which still maintains the original frequency distribution. Since the PU coruses have been tokenized, they are not alicable for tokenization exeriments. However, one advantage of being re-tokenized is that they are a good latform for classifier erformance benchmarking. Table resents the coruses, their acronym, the number of good ( n g ) and sam messages ( n s ), the ratio and a descrition of the arts included in the corus. Table. Coruses used in the exeriments. Corus Acronym ng ns n g : ns Descrition Personal P 930 876.06 Full messages SamAssassin SA 3900 897.05 Full messages Ling-Sam L-S 4 48 5.0 Body+subject PU PU 68 48.8 Full messages PU PU 579 4 4.08 Full messages PU3 PU3 33 86.7 Full messages PUA PUA 57 57.00 Full messages 8 Samassasins ublic corus is available at htt://samassassin.aache.org/ubliccorus/. 9 Ling-Sam, PU, PU, PU3, PUA ublic coruses are available at htt://www.aueb.gr/users/ion/ and is rovided by Ion Androutsooulos. 33

This chater is divided into three sections, one for each exeriment. The exeriments are indeendent and can be read searately. The results in each section are followed by an analysis, conclusion and some ossibilities concerning future work. 6. Delimiter selection The urose of this exeriment was to find an imroved delimiter subset via filter and wraer guided searches. To accomlish this it was necessary to build a fitness function for both aroaches. 6.. Filter for delimiter selection using KL-divergence A simle filter was created as a function of the robability distribution of the tokenized data. The outut of the filter is a value reresenting the fitness of a delimiter set D. The criterion (or fitness) function is denoted by J (D) and the search algorithm attemts to maximize it for the delimiter set. Let M g = { g, g,..., g n }: L( g i ) = good and M s = { s, s,..., sn }: L( si ) = sam be the training data. g s Let T ( mi, D) = Q be the tokenizer. The two roductions T M, D) = Q = { x, x,..., x } and ( g n T ( M s, D) = Q = { y, y,..., yn} contain the tokenized data (tokens) for M g and M s. The robability estimator assigns a robability indicating the token saminess for each token in Q and Q from their frequency distribution. The distance between the two resective discrete robability distributions, q ( ) and q ( ), is measured and used as the criterion function for D. x x This filter imlements the Kullback-Liebler divergence (or relative entroy) to measure the distance between the two robability mass functions, q ( x ) and q ( x). The KL-divergence KL q q ) is defined as ( KL( q q = q q ( x) ) ( )log. x q ( x) x (33) KL ( q q ) is always non-negative and is zero if and only if q = q. Furthermore, it is not symmetric but can be made so by calculating the two-way divergence (34) KL s q q ) = KL( q q ) + KL( q ). ( q The greater the distance between q and q the better the delimiter set D ; hence, the criterion function is set as J KL ( D) = KLs ( q q ). (35) 34

Even though it is indeendent of the inducer, the drawback of this filter is that it deends uon the robability estimator. 6.. Wraer aroach for delimiter selection The wraer uses the inducer as a art of the criterion function. The inducer in this work uses stratified 0-fold cross-validation to establish more reliable results. Let M g = { g, g,..., g n }: L( g i ) = good and M s = { s, s,..., sn }: L( si ) = sam be the training g data. Let I( M g, M s, D) be the inducer as a function of the training data and a delimiter set D. The criterion function J (D) returns a weight which reresents how well the delimiter set D contributes to the classification. For the exeriments an extended version of Weighted Accuracy, W, defined in () for λ = was used. Acc This heuristic could guide the search for a delimiter set to a certain level. Over time, W Acc will have the same value. This haens when future subsets do not affect the recision or recall. The roblem arises when one or more J ( D i+ ) exists such that J ( Di ) = J ( Di+ ), where D i is a delimiter subset of size i. The solution is to increase the granularity of W Acc by introducing several new arameters in order to obtain the Extended Weighted Accuracy, where the mean values of the already classified data will be used. This is erformed in the following way. Let s µ g be the mean sam robability for all M g and M such that s µ s be the mean robability of sam for all µ g = n g n g i P ( g ) s i (36) µ = s n s n s i P ( s ) s s (37) The two means are combined with the F-measure (Rijsbergen 979) which is the harmonic mean such that µ gs = ( µ ) µ g g ( µ ) + µ s s (38) Finally µ gs is incororated into W Acc to obtain the Extended Weighted Accuracy: 35

λ ngg + nss EW Acc = + α µ λ n + n g s gs (39) where α is a small value to make µ gs a faction of the Extended Weighted Accuracy. For this work the influence of µ gs is set to be less imortant than W Acc, that is α µ gs <. Since λ n + n 0 µ gs it is ossible to set α =. This means that EW Acc is foremost guided by 0 ( λ n + ) g n s W Acc. When W Acc no longer changes, EW Acc will instead be guided by µ gs. The Extended Weighted Accuracy function was imlemented as the criterion function for the wraer. g s J ( D) =. Acc EW EW Acc (40) 6..3 Exerimental settings The tests were carried out on three coruses, Ling-Sam, SamAssassin and the author s ersonal corus. Since PU, PU, PU3 and PUA are re-tokenized it was not ossible to use them for this exeriment. The size of the SamAssassin corus was cut down to 5% since the wraer algorithm would otherwise have taken too long. For simlicity, only single characters are used as delimiters, even though ordered clusters of characters could be beneficial. Further low frequency delimiters are also runed since their affect on the outcome is minimal. It was found that delimiters that occurred less than five times in each message could be removed without affecting the erformance significantly. A further restriction was to only use non alhanumerical characters as delimiter candidates. Both the body and header art of messages are tokenized. HTML and lain message bodies are tokenized in the same manner. Attachments are removed rior to the tokenization rocess. Case is reserved and no word stemming is alied. All tests were conducted with stratified 0-fold cross-validation. If there are several maximal solutions Occams Razor 0 is alied to choose the simlest. The simlest delimiter set is defined as the one with lowest cardinality. Robinsons Bayesian smoother described in 5.5.7 was used since it works both raidly and satisfactorily. Two base-lines were established, the first only using sace as the delimiter and the second using all non alhanumerical characters as delimiters, as resented in Table 3. As it was 0 Occams Razor states that one should not make more assumtions than the minimum needed. It admonishes us to choose from a set of otherwise equivalent models of a given henomenon the simlest one. 36

anticiated that a better subset would be found somewhere between a sace and all non alhanumerical characters, it was natural to choose these two as base-line delimiters. Also, it aears to be customary (Graham 003) to consider all non alhanumerical characters as delimiters. Table 3. Delimiter subsets used as base-line. Base-Line Descrition D Only sace is used as delimiter, char(3). S D All non alha numerical characters are used as delimiters. NAN 6..4 Results Table 4 dislays the delimiter subsets found by SFFS for the filter ( J KL ) and wraer ( J EW ) Acc algorithms. The subsets are denoted by D KL and D W resectively. The delimiters are searated Acc by sace. The delimiter sace is labeled as sace. Table 4. Delimiters subsets automatically found by the SFFS for different coruses. Criterion D J D J = SFFS( D, J ) : DJ D Ling-Sam J KL D sace - / _ * = KL J EW Acc D sace ) = _ ( * WAcc SamAssassin J KL D sace. = / - @ % _ <, + ; ' KL J EW Acc D sace - W Acc Personal J KL D sace / \ r +. - = _ @ :, ( ; ) KL D sace., J EW Acc WAcc Table 5 to 7 dislay the erformances of the delimiters from Table 4 on the three coruses. They also shows the erformance of the base-line delimiters, D S and D NAN. The erformance is shown as recision ( ), recall (r) and Weighted Accuracy (WAcc) for λ =, λ = 9 and λ = 99. Table 5. Performance of the different delimiter subsets on Ling-Sam. Ling-Sam λ = λ = 9 λ = 99 W r W Acc r W Acc Delimiter subset r Acc D S 9.7 98.96 98.34 9.6 98.96 98.44 93.87 98.75 98.7 D NAN 93.3 98.75 98.6 94.43 98.75 98.84 95.9 98.75 99.00 D KL 93.3 98.75 98.6 94.43 98.75 98.84 95.38 98.75 99.05 D 94.6 98.96 99.8 95.0 98.96 99.00 96.5 98.75 99. EW Acc 37

Table 6. Performance of the different delimiter subsets on SamAssassin. SamAssassin λ = λ = 9 λ = 99 Delimiter subset r W Acc r W Acc r W Acc D S 00.00 9.83 97.65 00.00 9.4 99.6 00.00 9.98 99.96 D NAN 99.5 86.9 95.58 99.5 86.7 99. 99.5 86.9 99.73 D KL 99.30 90.08 96.55 99.30 90.08 99.0 99.30 89.4 99.64 D EW Acc 00.00 94.09 98.07 00.00 93.88 99.69 00.00 93.04 99.97 Table 7. Performance of the different delimiter subsets on Personal. Personal λ = λ = 9 λ = 99 W r W Acc r W Acc Delimiter subset r Acc D S 98.0 99.00 98.50 98.0 99.00 98.0 98.0 99.00 98.0 D NAN 98.8 83.00 9.00 98.80 8.00 97.30 98.77 80.00 98.8 D KL 98.84 85.00 9.00 98.8 84.00 97.50 98.8 83.00 98.84 D 00.00 99.00 99.50 00.00 99.00 99.90 00.00 99.00 99.50 EW Acc 6..5 Analysis Using only sace ( D S ) roved to be more effective than using all non-alhanumerical ( D NAN ) as delimiters excet on the Ling-Sam corus ( λ = 9 ). This may be due to the different characteristics of the coruses; the Ling-Sam corus is stried from its header, while the messages in the Personal and SamAssassin are contained in their entirety. The decomosition of headers when using D may result in lower erformance which would exlain the differences. NAN The wraer aroach did find subsets of delimiters that significantly imroved the overall classification erformance on all coruses. The filter aroach using the KL divergence did not imrove the erformance on the subsets. Both aroaches (articularly the wraer) found delimiter subsets with low cardinality for all coruses. For, D for both the Personal and SamAssassins corus consisted only of three EWAcc delimiters (see Table 4). Even though subset, D WAcc D NAN worked slightly better than D S on Ling-Sam, an even better low cardinality, was found by the wraer. This suggests that a near-otimal solution is likely to consist of only a few delimiters. The reason for this is robably due to the information loss as the number of delimiters increases (see section 5.4.). This is illustrated in Diagram where EW is Acc 38

lotted against k, where s k is the otimal delimiter subset of size k. Diagram shows how the erformance decreases as the number of delimiters increases for EW Acc. The size of the delimiter subset is k. Performance of near-otimal delimiter subsets of size k. EWAcc 00,000000 99,500000 99,000000 98,500000 98,000000 97,500000 97,000000 96,500000 96,000000 95,500000 95,000000 94,500000 SamAssassin Ling-Sam Personal 3 5 7 9 3 5 7 9 k Diagram. Performance as the number of delimiters increase. The subsets D EW had little in common aart from the fact that sace was always found to be the Acc otimal subset for k =. On a.7ghz Intel Pentium machine it took about 3 hours to run the wraer on 5% of the SamAssassin corus, the filter ran about ten times faster. The code is not otimized and can robably be made to run twice as fast, by using tokenization caching and other techniques. 6..6 Conclusion In this chater a near-otimal search algorithm called SFFS was alied to find a subset of delimiters for the tokenizer. Then a filter and a wraer algorithm were roosed to determine how beneficial a grou of delimiters is to the classification task. The filter aroach ran about ten times faster than the wraer, but did not roduce significantly better subsets than the base-lines. The wraer did imrove the erformance on all coruses by finding small subsets of delimiters. This suggested an idea concerning how to select delimiters for a near-otimal solution, namely to start with sace and then add a few more. Since the wraer generated subsets had nothing in common aart from sace, the recommendation is to only use sace as a delimiter. The wraer was far too slow to use in sam filter software. 39

6..7 Future work In this work it was found that near-otimal solutions are likely to have low cardinality. Even though the wraer was too slow to use at run time in sam filter software, it is believed that a nearotimal solution can be found by using SFS for small delimiter subsets. Such an algorithm could be included with the sam filtering software and executed by the user to tune the delimiter subset. During testing an attemt was made to lug λ = 99 into J EW which resulted in a comletely Acc different delimiter subsets (without sace) yielding a recision of 00% and recall of aroximately 60% on the test coruses. It would be interesting to imlement this under more controlled exeriments. Another idea is to use different delimiters on the header and body of messages, due to their different characteristics. Also, in natural language it is common to use n-grams instead of unigrams for ambiguity resolution. Even though recent studies (Androutsooulos 004) found no evidence of any benefit from using n-grams for sam classification it was ointed out that further research is necessary. 40

6. Probability estimation In this section the robability estimators described in 5.5 were tested against all coruses. 6.. Exerimental settings The erformance was measured in the weighted accuracy using λ =, λ = 9 and λ = 99. All tests were conducted with stratified 0-fold cross-validation. In order to avoid biased feature selection the classifier was fed with all features. The same set of delimiters was used throughout all the tests. For the Lidstone estimate ( ) a value of δ = 0. was found to work well for the given coruses. Lid For the Bayesian Smoother ( ) the strength of s = 0. 5 and exected robability for unseen Rob items = 0. 0 5 was used. These values were found by trial and error. Robinson describes (Robinson 003) how these arameters can be set automatically, unfortunately there were time constraints uon the author and this was not ossible. 6.. Results The results are dislayed in Table 8 to Table 3. Each table dislays the erformance of an estimator on a corus. The erformance is shown as recision ( ), recall (r) and Weighted Accuracy (WAcc). Each estimator is also ranked based on WAcc to simlify the interretation of the tables. Table 8. Probability estimators tested on PU. PU λ = λ = 9 λ = 99 Est. r WAcc R r WAcc R r WAcc R abs la ELE Lid WB SGT Rob 94,4 98,34 96,7 6 94,78 98,3 95,98 6 95,7 97,7 96,6 6 93, 98,54 96,8 7 93,86 98,54 95,7 7 94,4 98,3 95,49 7 94,78 98,3 96,8 5 95,74 98,3 96,7 5 95,9 97,7 96,77 5 96,3 97,9 97,45 4 97, 97,7 97,73 4 97,30 97,5 97,89 3 97,3 98,3 98,00 97,9 97,9 98,35 97,9 97,5 98,38 97,3 98,34 98,00 97,3 98,3 97,9 98, 97,09 98,53 96,53 98,34 97,73 3 97, 97,9 97,75 3 97, 97,7 97,73 4 4

Table 9. Probability estimators tested on PU. PU λ = λ = 9 λ = 99 Est. r WAcc R r WAcc R r WAcc R abs la ELE Lid WB SGT Rob 88,36 90,85 95,84 6 89,44 89,44 97, 7 90,58 88,03 97,73 7 90,7 89,44 96, 5 9,65 88,73 98,0 6 94,57 85,9 98,76 5 95,4 88,03 96,8 96,03 85, 98,77 3 96,77 84,5 99,7 3 97,39 78,87 95,4 7 97,30 76,06 98,86 97,5 74,65 99,4 9,75 90,4 96,67 3 93,38 89,44 98, 5 94,07 89,44 98,60 6 94,07 89,44 96,8 94,74 88,73 98,5 4 94,70 88,03 98,76 4 98,35 83,8 96,53 4 98,33 83,0 99, 98,3 8,69 99,6 Table 0. Probability estimators tested on PU3. PU3 λ = λ = 9 λ = 99 Est. r WAcc R r WAcc R r WAcc R abs la ELE Lid WB SGT Rob 97,95 96,93 97,75 5 98,05 96,55 98,33 5 98, 96, 98,60 5 96,54 97,86 97,5 7 96,85 97,54 97,50 7 97,0 97,3 97,7 7 97,43 97,65 97,83 3 97,69 97,6 98, 6 97,84 96,88 98,30 6 98, 97,6 97,97 98,8 96,93 98,5 3 98,38 96,50 98,73 3 98,7 96,8 97,80 4 98,70 95,67 98,74 98,86 95,9 99,0 98,60 96,7 97,70 6 98,65 95,73 98,70 98,8 95,8 99,06 98, 97,37 98,0 98,7 96,99 98,45 4 98,3 96,39 98,68 4 Table. Probability estimators tested on PUA. PUA λ = λ = 9 λ = 99 Est. r WAcc R r WAcc R r WAcc R abs 98,70 93,33 96,05 6 98,69 9,8 98,8 6 98,68 9,93 98,70 7 la 98,54 94,56 96,58 98,88 93,33 98,39 98,87 9, 98,88 ELE 98,7 94, 96,49 98,89 93,86 98,44 98,87 9, 98,88 Lid WB SGT Rob 98,89 93,68 96,3 3 98,88 93,6 98,37 3 98,87 9,46 98,88 98,70 93,33 96,05 6 98,69 9,8 98,8 6 98,69 9,8 98,7 6 98,89 93,5 96,3 4 98,88 9,98 98,35 4 98,87 9,8 98,88 98,89 93,5 96,3 4 98,88 9,8 98,33 5 98,86 9,58 98,87 5 4

Table. Probability estimators tested on Ling-Sam. L-S λ = λ = 9 λ = 99 Est. r WAcc R r WAcc R r WAcc R abs 75,00 00,00 94,46 6 76,5 99,79 93,9 6 77,35 99,58 94,0 6 la 73,3 99,58 93,9 7 74, 99,58 93, 7 76, 99,58 93,79 7 ELE 85,03 99,38 96,99 5 86,3 99,7 96,90 5 87,8 99,7 97,0 5 Lid WB SGT Rob 93,4 98,96 98,6 94,43 98,96 98,84 94,99 98,75 98,96 9,94 99,79 98,5 9,46 99,58 98,4 4 94,07 99,7 98,76 3 9,08 99,38 98,48 3 9,97 99,7 98,5 94,6 99,7 98,80 9,07 99,7 98,44 4 9,6 99,7 98,44 3 93,5 99,7 98,63 4 Table 3. Probability estimators tested on SamAssassin. SA λ = λ = 9 λ = 99 Est. r WAcc R r WAcc R r WAcc R abs la ELE Lid WB SGT Rob 99,0 9,88 97,0 4 99,0 9,7 99,4 5 99,0 9,5 99,60 7 99,47 89,09 96,7 7 99,47 88,93 99, 7 99,47 88,67 99,7 4 99,4 89,98 96,55 6 99,4 89,77 99,3 6 99,4 89,67 99,69 6 99,48 90,6 96,77 5 99,48 90,46 99,9 4 99,48 90,30 99,7 3 99,50 94,94 98,9 99,50 94,94 99,5 99,50 94,89 99,75 99,34 94,94 98,4 99,39 94,89 99,47 99,39 94,83 99,69 5 99,49 9,30 97,33 3 99,49 9,99 99,37 3 99,49 9,99 99,73 To summarize the results, the estimators mean erformance over all coruses is taken and dislayed in the tables below. One table for each for each λ. The tables are sorted after the mean rank (R). Table 4. Mean results for λ = Overall Rank λ = Est. r Acc SGT WB Rob ELE Lid abs la W R 96,57 95,33 97,54,50 96,6 95,40 97,49 3,00 97,07 94,05 97,34 3,00 94,99 94,60 96,89 3,67 96,93 9,89 97,03 3,83 9,4 95, 96,8 5,50 9,7 94,85 96,0 5,67 43

Table 5. Mean results for λ = 9 Overall Rank λ = 9 Est. r Acc SGT Lid WB W R 96,87 94,94 98,55,67 97,3 9, 98,54,83 96,60 95,03 98,53 3,00 Rob 97,8 93,63 98,56 3,7 ELE 95,57 93,94 98,00 4,33 abs la 9,59 94,74 97,09 5,83 9,55 94,44 96,90 6,00 6..3 Analysis Table 6. Mean results for λ = 99 Overall Rank λ = 99 Est. r Acc Lid SGT Rob WB ELE la abs W R 97,47 9,73 98,88,7 97,0 94,40 98,9,50 97,54 93,05 98,86 3,7 96,88 94,76 98,8 3,33 96,03 93,34 98,34 4,33 93,9 93,6 97,34 5,7 93,8 94,6 97,57 6,33 As exected, the overall erformance of the Lalace estimator was oor excet on PUA where it had the highest rank for λ =. The reason for this could be that PUA includes legitimate non- English messages in Greek. Rare Greek words are unlikely to occur in sam, hence moving a great deal of robability mass to rare events may increase the recision. The erformance of the Absolute estimate was similar to that of Lalace and together they ranked bottom of all ranking lists. ELE erformed slightly better than Lalace and Absolute and was ranked in the middle for all tests. There was no outright winner outerforming all the other; however there are four candidates for the to sot. With regards to both high recision and recall, the Simle Good Turing and Witten Bell estimate erformed very well on most coruses. What is surrising is the erformance of Lidstone with δ = 0. as it maintains high recision over all coruses. It was exected it to behave more a like ELE and Lalace. In fact, it was ranked as the to smoother for λ = 99 (see Table 4). 44

Robinson s Bayesian estimate is very attractive as it maintains high recision in all tests. In the ranking table it achieves the highest average recision for λ = and λ = 99. 6..4 Conclusion Simle Good Turing has reviously shown (Gale 995) imressive erformance which our tests strengthen. Perhas the main obstacle with the Simle Good Turing lies in its imlemental comlexity. Our test results show that Witten Bell is a reasonable substitute for Good Turing with only a small erformance loss. Witten Bell is simle to imlement and well used in natural language. Even though Lidstone erformed very well, it cannot be considered to be a robust estimate until it has gone through more evaluation. δ = 0. was found through manual testing of different values and it is ossible that it erforms less well on different coruses. Robinson s estimate, based on a Bayesian smoother showed excellent results, it is easy to imlement and less comutationally exensive than both Witten Bell and Good Turing. Table 7. Overall erformance of the tested estimators. Estimator SGT Rob Lid WB Performance Easy to imlement Excellent No No Excellent Yes Yes Excellent? Yes Yes Good Quite No Need manual arameters 6..5 Future work ELE Average Yes No la abs Poor Yes No Poor Yes No All exeriments were done on full coruses; further research on smaller corus sizes is encouraged to examine how they scale. More imortantly, this work has only exerimented with a few estimators and there are many other interesting smoothers such as Kneser-Ney smoothing (Nay & Kneser 995), Katz smoothing (Katz 987) and Church-Gale smoothing (Church & Gale 99). 45

6.3 Feature selection This exeriment exlores the feature selection erformance as a function of the number of selected features. Since e-mails do not usually have many features it was interesting to investigate whether feature selection is necessary when using a Naive Bayesian sam classifier. This exeriment comared feature selections time gain and accuracy influence against using all features. 6.3. Exerimental settings All tests were conducted with stratified 0-fold cross-validation. Due to the romising results of the Simle Good Turing estimator (see section 6..4) it was used throughout the tests. The erformance was measured by the weighted accuracy using λ = 9. The same set of delimiters was used for all tests. 6.3. Results The results are dislayed in Diagram to Diagram 8. A base-line with the erformance of all features selected is resented in each diagram. The diagrams show how the erformance WAcc of the classifier varies as the number of features increases for the three tested feature selection methods. Feature Selection for PU WAcc 99 98,5 98 97,5 97 96,5 96 95,5 95 94,5 94 40 80 0 60 00 40 80 30 360 400 Number of features 440 480 50 560 600 Diagram 3. Performance as a function of the number of features selected on PU. Probability Ratio Information Gain χ All features 46

Feature Selection for PU WAcc 00 98 96 94 9 90 88 86 40 80 0 60 00 40 Number of features 80 30 360 400 440 480 50 560 600 Diagram 4. Performance as a function of the number of features selected on PU. Probability Ratio Information Gain χ All features Feature Selection for PU3 WAcc 99,5 99 98,5 98 97,5 97 96,5 96 95,5 95 94,5 40 80 0 60 00 40 Number of features 80 30 360 400 440 480 50 560 600 Diagram 5. Performance as a function of the number of features selected on PU3. Probability Ratio Information Gain χ All Features 47

Feature Selection for PUA WAcc 99 98,5 98 97,5 97 96,5 96 95,5 95 40 80 0 60 00 40 Number of features 80 30 360 400 440 480 50 560 600 Diagram 6. Performance as a function of the number of features selected on PUA. Probability Ratio Information Gain χ All features Feature Selection for SA WAcc 00 99 98 97 96 95 94 93 40 80 0 60 00 40 Number of features 80 30 360 400 440 480 50 560 600 Diagram 7. Performance as a function of the number of features selected on SA. Probability Ratio Information Gain χ All features 48

Feature Selection for L-S 00 98 WAcc 96 94 9 90 88 86 40 80 0 60 00 40 Number of features 80 30 360 400 440 480 50 560 600 Diagram 8. Performance as a function of the number of features selected on L-S. Probability Ratio Information Gain χ All features Diagram 9 shows how the classification time changes as the number of features increases. Classification time 0,045 0,04 0,035 0,03 0,05 0,0 0,05 0,0 0,005 0,0 0,095 40 80 0 60 00 40 80 30 360 400 440 480 50 Time er message (sec) 560 600 Probability Ratio Information Gain χ All features Number of features Diagram 9. The average time to classify one message for the different feature selectors. 49

6.3.3 Analysis The test results validated those already found (Mladenic & Grobelnik 999), the Probability Ratio (which is closely related to the Odds-Ratio) works very well with a Naive Bayesian classifier and did, in fact, outerform both Information Gain and χ in all tests aart from PU, where Information Gain was slightly better. The reason for this may be that the Simle Good Turing estimator is less accurate for PU (see Table 9) than the other coruses. Information Gain and χ both favor common terms, while the Probability Ratio favors those with high robability of being either good or sam. This emhasizes the imortance of rare words as indicators for classification. Of course a good estimator is required to rely on the robability ratio. The diagrams also revealed that the features selected by the Probability Ratio roughly estimated the erformance of using all features. For Ling-Sam it was even more accurate than using all features. The reason for this could be that Ling-Sam may roduce over otimistic results (Androutsooulos et al. 004) due to its characteristics. To classify a message using all features took on average (see Diagram 9) t All = 0.03 The maximum time gained by using feature selection was obtained by Probability Ratio. The average time when the number of features where reduced to 40 was This means that aroximately t PR 40 = 0.0 t All t PR = 0.03 0.0 = 0.00s ms 40 = is gained er message, i.e. a saving of about 0% of the total time er message. 50

6.3.4 Conclusion The exeriments with feature selection showed that the Probability Ratio was suerior to both Information Gain and χ when using a Naive Bayesian classifier in the sam domain. In addition to this, this exeriment attemted to answer the question as to whether feature selection was required in this domain, which is erhas more imortant. It was shown that if Probability Ratio is used for feature selection, it is ossible to imrove the classification time by 0% without any significant accuracy loss. In this (non-otimized) test environment the time saved er message was about ms. This effectively means that for every thousand e-mails, seconds are saved. This time gain was considered to be far too small to encourage the use of feature selection as a time saver. Furthermore, none of the tested feature selection methods consistently imroved the erformance over using all features. Since none of the tested feature selection methods imroves erformance or saves time, the recommendation is to use all features, until a better solution is found. 6.3.5 Future work This exeriment has concluded that the rimary urose of feature selection for Naive Bayesian sam classification is not to reduce the feature dimensionality. The basic urose is to imrove the erformance when using all features. Such a feature selection algorithm was resented in (Forman 003) under the name BNS (bi-normal searation). The author conducted some reliminary testing with BNS on the PU coruses and it did erform aroximately as the Probability Ratio. However, further more controlled exeriments should be carried out on BNS and other interesting feature selection methods. 5

7 Summary This work has emirically tested the tokenizer, estimator and feature selection for a Naive Bayesian sam classifier. It was shown that a good delimiter subset is likely to be of low cardinality. The reason is due to the information loss of larger subsets. Further it was found that Simle Good Turing and Robinson s Bayesian smoother are good choices for estimating robabilities. Finally this work showed that feature selection is not required to decrease the already low dimensionality of features in messages. Hence, a feature selection method should only be used if it benefits the erformance. In these tests both Information Gain and χ, which are common for selecting features for a Naive Bayeisan classifier, degraded the erformance of the classifier. The Probability Ratio outerformed both Information Gain and χ but did not consistently increase the classifier erformance. The recommendation is thus to kee all features and only use feature selection if it imroves the erformance over a majority of test coruses. To describe the outcome of this work in one sentence Use sace as delimiter (and ossibly a few more), estimate the robabilities with Simle Good Turing or Robinsons Bayesian smoother and do not use any feature selection. 5

References Almuallim, H. and T. Dietterich. (99), Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence,. 547-55. Menlo Park, CA: AAAI Press/The MIT Press. Androutsooulos I., Paliouras G., Karkaletsis V., Sakkis G., Syrooulos C. and Stamatooulos, P. (000a) Learning to filter sam email: A comarison of a naive bayesian and a memory-based aroach. In Worksho on Machine Learning and Textual Information Access, 4th Euroean Conference on Princiles and Practice of Knowledge Discovery in Databases (PKDD 000). Androutsooulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, George and Syrooulos, C.D. (000b), An Evaluation of Naive Bayesian Anti-Sam Filtering. In Potamias, G., Moustakis, V. and van Someren, M. (Eds.), Proceedings of the Worksho on Machine Learning in the New Information Age, th Euroean Conference on Machine Learning (ECML 000), Barcelona, Sain,. 9-7. Androutsooulos, I., Paliouras, G., Michelakis, E. (004), Learning to Filter Unsolicited Commercial E-Mail. Athens University of Economics and Business and National Centre for Scientific Research Demokritos Bevilacqua-Linn M. (003), Machine Learning for Naive Bayesian Sam Filter Tokenization Breiman, L., and Sector, P. (99), Submodel selection and evaluation in regression: The X- random case. International Statistical Review, 60, 9-39. Bullard, F. (00), A Brief Introduction to Bayesian Statistics, htt://courses.ncssm.edu/math/talks/pdfs/bullardnctm00.df, 004-03-3. Chen D., Chen T. and Ming H. Sam Email Filter, Using Naive Bayesian, Decision Tree, Neural Network, and AdaBoost Church, K.W., Gale, W.A. (99), A comarison of the enhanced GoodTuring and deleted estimation methods for estimating robabilities of English bigrams. Comuter Seech and Language, 5:9--54. Domingos, P., and Pazzani, M. (996), Beyond Indeendence: Conditions for the Otimality of the Simle Bayesian Classifier. Proceedings of the 3th Int. Conference on Machine Learning,. 05, Bari, Italy. Forman G. (003), An Extensive Emirical Study of Feature Selection Metrics for Text Classification. JMLR 3(Mar):89-305. Ferri F.J., Pudil P., Hatef, M. and Kittler J. (994), Comarative Study of Techniques for Large- Scale Feature Selection. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice IV, ages 403-43. Elsevier. 53

Gale, W. A, Church, K. W. (994), What is wrong with adding one? In Oostdijk, N. and de Haan, P. (eds) Corus-Based Research into Language, 89-98. Amsterdam: Rodoi. Gale, W. A. (995), Good-Turing Smoothing without Tears. Journal of Quantitative Linguistics,. 7-54. Good, I.J. (953), The oulation frequencies of secies and the estimation of oulation arameters. Biometrika, 40(3 and 4):37-64. Graham, P. (00), A lan for sam. htt://www.aulgraham.com/sam.html, 003--3 Heckerman, D. (995), A Tutorial on Learning Bayeisan Networks, htt://citeseer.ist.su.edu/35897.html, 004-03-3 Hinton, R. (004), Statistics Exlained, nd Edition, Routledge Holden, S. (003), Sam Filters, htt://freshmeat.net/articles/view/964/, 004-03-8. Jain, A.K. and Zongker, D. (997), Feature Selection: Evaluation, alication, and small samle erformance. IEEE Trans, PAMI 9.. 53-58. Jain, A.K. and Mao, J. (999), Statistical attern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, (000) 4-37. Katz, S. M. (987), Estimation of robabilities from sarse data for the language model comonent of a seech recognizer. IEEE Transactions on Acoustics, Seech and Signal Processing, ASSP- 35(3):400--40, March. Kira, K., and Rendell, L.A. (99). The feature selection roblem: Traditional methods and a new algorithm. In AAAI-9, Proceedings of the Ninth National Conference on Artificial Intelligence,. 9-34. AAAI Press. Kohavi, R. (995), A study of cross-validation and bootstra for accuracy estimation and model selection, International Joint Conference on Artificial Intelligence (IJCAI). Kohavi, R., John, G.H. (996), Wraers for feature subset selection. Artificial Intelligence Kushmerick, N. (997), Wraer induction for information extraction, Ph.D. Thesis, University of Washington, Seattle, WA. Kushmerick, N. (999), Wraer induction: Efficiency and exressiveness, Artificial Intelligence 8. Lee, P. (004), Bayesian Statistics an introduction, Third Edition, Provost of Wentworth Collage, University of York, UK. Arnold Publishers. 54

Mertz, D., (00), Sam filtering techniques, htt://www-06.ibm.com/develoerworks/linux/library/l-samf.html#author, 004-0-03. Mladenic D. and Grobelnik M. (999), Feature Selection for Unbalanced Class Distribution and Naive Bayes. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), ages 58-67. Nay. H., Essen, U. and Kneser, R. (994), On structuring robabilistic deendences in stochastic language modeling. Comuter, Seech, and Language, 8:-38. Nay. H. and Kneser, R. (995), Imroved backingoff for mgram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Seech and Signal Processing, volume, ages 8--84. Ng A.Y. and Jordan M.I. (00) On discriminative vs. generative classifiers: A comarison of logistic regression and naive Bayes. In T. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems (NIPS) 4. Pudil, P. and Kittler J. (994), Floating search methods in feature selection, Pattern Recognition Letters, vol 5, no.,. 9-5. Rijsbergen, C. J. (979), Information Retireval. Butterworths, London. Robinson, G., (003), Sam Detection, htt://radio.weblogs.com/00454/stories/00/09/6/samdetection.html, 004-0-3. Sahami, M., Dumais S., Heckerman D., and Horvitz, E. (998), A Bayesian aroach to filtering junk e-mail. Learning for Text Categorization. Paers from the AAAI Worksho, Madison Wisconsin,. 55 6. AAAI Technical Reort WS-98-05. Siedlecki, W. and Sklansky, J. (988), On automatic feature selection, International Journal of Pattern Recognition and Artificial Intelligence 97-00. Somol, P., Pudil, P., Novoviˇcov a, J. and Pacl ık, P. (999), Adative floating search methods in feature selection, Patt. Recog. Lett., 0( 3):57 63. Sance C. and Sajda P. (998), The role of feature selection in building attern recognizers for comuter aided diagnosis. National Information Dislay Laboratory, Sarno Cororation, Princeton. Stanley, F.C. and Goodman, J. (996), An Emirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th Annual Meeting of the Association for Comutational Linguistics, ages 30-38. Wiener E., Pedersen, J.O. and Weigend, A.S. (995) A neural network aroach to toic sotting. In Processings of the Fourth Annual Symosium on Document Analysis and Information Retrieval (SDAIR'95). 55

Vanik V. (995), The nature of statistical learning theory. Sringer, New York. Witten, Ian H. and Timothy C. Bell. (99), The zerofrequency roblem: Estimating the robabilities of novel events in adative text comression. IEEE Transactions on Information Theory, 37(4):085-094, July. Yang Y. and Liu X. (998), A re-examination of text categorization methods. School of Comuter Science, Carnegie Mellon University Yang, Y. and Pedersen, J.O. (997), A Comarative Study on Feature Selection in Text Categorization. Proceedings of ICML-97, 4th International Conference on Machine Learning. 56