Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools Spam Track Wednesday 1 March, 2006 APRICOT Perth, Australia James Carpinter & Ray Hunt Dept. of Computer Science and Software Engineering University of Canterbury, New Zealand 1 Where are we at? Current state of the art Outline Spam classification and filtering engines Filtering technologies Machine learning and non-machine learning Corpus performance comparison Case Study: heuristic, Bayesian, combined filters Conclusions 2 1

Where are we at? Overall the industry has been unsuccessful in solving spam problem Current tools are limited or ineffective Network providers cannot or do not want to address this issue Legislation has been a waste of time Problem of spam is getting worse We need new systems and solutions as current products are largely ineffective 3 Where are we at? Spam ranges from minor irritant to major threat to productivity Stanford University study [36]: average Internet user loses ten working days a year dealing with spam Also 15% of emails contains viruses Estimates worldwide cost of spam in 2005, in terms of lost productivity and IT infrastructure investment at > US$10 billion [29, 52] 4 2

Where are we at? Effectiveness of spam filters to improve user productivity is ultimately limited by extent to which users must manually: Review filtered messages for false positives Review incoming email for false negatives 99% accuracy rate with 1% false negatives (and no false positives) is preferable to same level of accuracy with 1% false positives (and no false negatives) 5 Where are we at? Business model of spammers is too attractive Commissions to spammers of 25 50% on products sold are not unusual [30] On a collection of 200 million email addresses a response rate of 0.001% would yield a spammer a return of $25,000, given a $50 product Any solution to this problem must reduce the profitability of the underlying business model by: substantially reducing number of emails reaching valid recipients or increasing expenses faced by the spammer 6 3

Current state of the art Interactive filters (challenge-response systems), intercept incoming emails from suspected spammers Held by recipient s email server, issues challenge to sender to establish that email came from human sender rather than bulk mailer Belief is that spammers will be uninterested in completing challenge If fake email address is used by sender, they will not receive the challenge Selective c/r systems issue a challenge only when (non-interactive) spam filter is unable to determine class of message 7 Current state of the art Current prime focus is automated, non-interactive filters Some found in current commercial systems, others confined to current research Two key current approaches: Machine learning-based filters Non-machine learning-based filters Current range of commercial systems dominated by: Heuristic filtering Bayesian filtering 8 4

Spam Classification & Filtering Engines Non-machine learning: heuristics, signatures, blacklisting, hashbased, traffic analysis, etc Machine learning techniques: Bayesian, spare binary polynomial hashing, support vector machine, Markov models, pattern discovery etc Key developments in this area 9 Spam Classification & Filtering Engines Machine learning filtering techniques can be further categorised into: Complementary solutions Complete solutions Complementary solutions designed to work as a component of larger filtering system, offering support to primary filter (ML or non-ml based) Complete solutions aim to construct comprehensive knowledge base that allows them to classify all incoming messages independently 10 5

Spam Classification & Filtering Engines Complete solutions come in variety of flavours: some aim to build a unified model some compare incoming email to previous examples (previous likeness) others use collaborative approach, combining multiple classifiers to evaluate email (ensemble) 11 Spam Classification and Filtering Engines 12 6

Filtering Technologies Non-machine learning Heuristics (rule-based analysis) Signatures Blacklisting Traffic Analysis Machine learning Unified model filters (Bayesian filtering et al) Previous likeness based filters Ensemble filters Complementary filters 13 Corpus Performance Comparison Many techniques described are in various stages of research and development Difficult to compare as there is no single email benchmark database SpamAssassin (spamassassin.apache.org) maintains a collection of legitimate and spam emails Ling-Spam corpus [1] Enron bankruptcy: 400 MB of realistic workplace email [11] Techniques used by spammers are continually evolving [27] Any static spam corpus would, over time, no longer 14 resemble the makeup of current spam email 7

Case Study Two-stage email filtering: DNS blacklisting system (eliminates 50,000 of 110,000 per day) and then Process Software s Precise-Mail Anti-spam System (PMAS) discards another 42% and quarantines 35% for review PMAS based on comprehensive heuristic rule collection combining both server and user-level block and allow lists Bayesian filtering option, works in conjunction with heuristic filter, and was not currently active before the evaluation 15 Case Study Two database benchmarks used SpamAssassin corpus (public) SpamArchive corpus (internal) Training of PMAS Bayesian filter took place over 2 weeks 16 8

Case Study Overall results consistent with those published by NetworkWorldFusion [51] They recorded 0.75% false positives, and 96% accuracy, while we recorded 0.75% (with the partial SpamAssassin corpus) false positives and 97.67% accuracy Under both corpora, combined filtering option surpasses the alternatives in the two key areas lower level of false positives higher level of spam caught 17 Case Study Results indicate that filtering best placed at user rather than server Consistent with Garcia et al. [19] We conclude two things from these experiments: Use of a Bayesian filtering component improves overall filter performance; however it is not a substitute for traditional heuristic filters Effects of time on validity of the corpora - older spam is more readily identified, suggesting changing techniques 18 9

Spam Filtering Engines performance of heuristic, Bayesian & combined filters 19 Conclusions Spam is a very serious problem for the internet community Threatens both integrity of networks and productivity Anti-spam vendors offer wide array of products These can be implemented in various ways (software, hardware, service) and at various levels (server and user) Introduction of new technologies, such as Bayesian filtering, is improving filter accuracy The implementation of machine learning algorithms is likely to represent the next step in this ongoing fight 20 10