Bayesian spam filtering

Similar documents

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Spam filtering. Peter Likarish Based on slides by EJ Jung 11/03/10

Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

CSE 473: Artificial Intelligence Autumn 2010

Why Bayesian filtering is the most effective anti-spam technology

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

. Daniel Zappala. CS 460 Computer Networking Brigham Young University

Understanding Proactive vs. Reactive Methods for Fighting Spam. June 2003

Bayes and Naïve Bayes. cs534-machine Learning

Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang

Anti Spamming Techniques

Adaption of Statistical Filtering Techniques

Savita Teli 1, Santoshkumar Biradar 2

Spam Filtering Methods for Filtering

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Multi-Protocol Content Filtering

Supervised Learning (Big Data Analytics)

Controlling Spam at the Routers

1 Maximum likelihood estimation

15-381: Artificial Intelligence. Probabilistic Reasoning and Inference

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Intent Based Filtering: A Proactive Approach Towards Fighting Spam

Combining Global and Personal Anti-Spam Filtering

Solutions IT Ltd Virus and Antispam filtering solutions

Marketing Do s and Don ts A Sprint Mail Whitepaper

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

An Overview of Spam Blocking Techniques

Machine Learning Final Project Spam Filtering

How to keep spam off your network

Antispam Security Best Practices

Detecting Spam in VoIP Networks. Ram Dantu Prakash Kolan

Spam Filters: Bayes vs. Chi-squared; Letters vs. Words

BARRACUDA. N e t w o r k s SPAM FIREWALL 600

Machine Learning.

Intercept Anti-Spam Quick Start Guide

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Penetration Testing for Spam Filters

The Basics of Graphical Models

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Week 2: Conditional Probability and Bayes formula

Basics of Statistical Machine Learning

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Bayesian Spam Detection

Machine learning for algo trading

Bayesian Spam Filtering

Prediction of Heart Disease Using Naïve Bayes Algorithm

International Journal of Research in Advent Technology Available Online at:

How to keep spam off your network

Achieve more with less

6367(Print), ISSN (Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME

RedListing as a New Means to Combat Spam. MailFoundry December 7, 2006

Bayesian Filtering: Beyond Binary Classification Ben Kamens Fog Creek Software, Inc.

Reputation Metrics Troubleshooter. Share it!

K7 Mail Security FOR MICROSOFT EXCHANGE SERVERS. v.109

Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Spam Filters: Bayes vs. Chi-squared; Letters vs. Words

REVIEW AND ANALYSIS OF SPAM BLOCKING APPLICATIONS

Question 2 Naïve Bayes (16 points)

Gibbs Sampling and Online Learning Introduction

A Game Theoretical Framework for Adversarial Learning

How To Block Ndr Spam

Anti Spam Best Practices

SPAM FILTER Service Data Sheet

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Discrete Structures for Computer Science

Rafael Witten Yuze Huang Haithem Turki. Playing Strong Poker. 1. Why Poker?

Why Content Filters Can t Eradicate spam

CS Computer and Network Security: Intrusion Detection

Less naive Bayes spam detection

The Mathematics of Gambling

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

CSE/ISE 311: Systems Administra5on Administra5on

Transcription:

Bayesian spam filtering Keith Briggs Keith.Briggs@bt.com more.btexact.com/people/briggsk2 Pervasive ICT seminar 2004 Apr 29 1500 bayes-2004apr29.tex typeset 2004 May 6 10:48 in pdflatex on a linux system Bayesian spam filtering 1 of 19

Outline the problem some `solutions' probability theory Bayesian ideas... the real solution! The aim: to understand why Bayesian methods work so well Keith Briggs Bayesian spam filtering 2 of 19

The problem Total bt.com spam blocked per day Keith Briggs Bayesian spam filtering 3 of 19

RFC 706 Network Working Group Jon Postel (SRI-ARC) Request for Comments: 706 NIC #33861 On the Junk Mail Problem In the ARPA Network Host/IMP interface protocol there is no mechanism for the host to selectively refuse messages. This means that a Host which desires to receive some particular messages must read all messages addressed to it. Such a host could be sent many messages by a malfunctioning host. This would constitute a denial of service to the normal users of this host. Both the local users and the network communication could suffer. The services denied are the processor time consumed in examining the undesired messages and rejecting them, and the loss of network thruput or increased delay due to the unnecessary busyness of the network. Request for Comments: 706 Nov 1975 Keith Briggs Bayesian spam filtering 4 of 19

Solutions? Method version pro con legal CAN-SPAN just politics legal opt-in you re kidding! legal opt-out blacklist spamhaus etc. centralized whitelist new emailers must register financial penalty micropayment who administers? computational penalty new protocol tripoli [5]? new protocol challenge-response e.g. about.mailblocks.com lists attacks filter keyword it doesn t work filter probabilistic it works! honeypot beat them at admin load their own game commercial solutions??? Keith Briggs Bayesian spam filtering 5 of 19

Prediction 00000000000000000000000000000000000000000 01010101010101010101010101010101010101010 HHHTHHHHHTHHHTTHHHHHHTHHHHHHHTHHHHHHHHHHH GAATTCTAGGCTTTCTTTGAAGAGGTAGTAATCTGTAGCCC BT is Britain's greatest company Keith Briggs Bayesian spam filtering 6 of 19

#events P (event) lim n n Probability vs. degree of belief objective must be able to repeat the experiment indefinitely rate of convergence of limit unspecified strictly speaking, this rules out using this definition in the real world `degree of belief' B is more or less subjective meaningful for a single, non-repeatable event your B might not be the same as my B chance of rain tomorrow chance of horse winning a race spamminess of an email Cox's axioms are a set of reasonable assumptions under which there exists a function mapping beliefs to probabilities Keith Briggs Bayesian spam filtering 7 of 19

Probability theory conditional probability P (x = a y = b) P (x = a, y = b) P (y = b) product rule P (x, y H) = P (x y, H)P (y H) = P (y x, H)P (x H) marginalization P (x H) = y P (x, y H) = y P (x y, H)P (y H) Keith Briggs Bayesian spam filtering 8 of 19

Bayes' theorem Bayes' theorem - is just the product rule: P (y x, H) = P (x y, H)P (y H) P (x H)... with y interpreted as the data D, and x interpreted as parameters θ: P (θ D, H) = P (D θ, H)P (θ H) P (D H) Keith Briggs Bayesian spam filtering 9 of 19

Prior, likelihood and posterior we can think of Bayes' rule as: posterior likelihood*prior for example, a single bit s sent twice over a noisy channel, received as r 1 r 2 : P (s = 1 r 1 r 2 ) = P (r 1 r 2 s = 1)P (s = 1)/P (r 1 r 2 ) P (s = 0 r 1 r 2 ) = P (r 1 r 2 s = 0)P (s = 0)/P (r 1 r 2 ) that is, your prior (degree of belief before you observed that data r), is updated by the information the data provides about the value of s (the likelihood), to provide your posterior degree of belief Keith Briggs Bayesian spam filtering 10 of 19

Thomas Bayes (1702-1761) I now send you an essay which I have found among the papers of our deceased friend Mr Bayes, and which, in my opinion, has great merit... In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times. Richard Price. Keith Briggs Bayesian spam filtering 11 of 19

C biased? Friday 2002 January 04: When spun on edge 250 times, a Belgian C1 coin came up heads 140 times and tails 110. `It looks very suspicious to me', said Barry Blight, a statistics lecturer at the LSE `If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%' we compare the models H 0 (the coin is fair) and H 1 (the coin is biased), with uniform prior P (p H 1 ) = 1 the likelihood ratio is: 140!110! P (D H 1 ) P (D H 0 ) = 251! 0.48 1/2250 thus the data give weak evidence (2.08/1) in favour of H 0! Keith Briggs Bayesian spam filtering 12 of 19

Text classification theory could be based on various choices of features: words, or n-grams corpora C 1, C 2,..., C k priors π 1, π 2,..., π k models P C1, P C2,..., P Ck if x is an unknown document, the posterior probability that x belongs to C j is P (C j x) P Cj π j decision rule: choose j to maximize P (C j x) other uses sorting emails - work or personal forwarding emails in French or German etc. to the right person Keith Briggs Bayesian spam filtering 13 of 19

Digram measure word w = w 1 w 2... w k reference measure R C (w) p C (, w 1 )p C (w 1, w 2 )... p C (w k, $) this is naïve - it assumes adjacent digrams are statistically independent Dirichlet digram measure p C (u, v) = #(v u) + αµ(v) r #(r u) + α α is a hyperparameter, and the optimum α should be chosen from tests on various corpora for spam we use two corpora, spam and ham the latest softwares claim 99.9% accuracy false positives are much worse than false negatives, so we can adjust our algorithm to account for this Keith Briggs Bayesian spam filtering 14 of 19

Example digram measure Keith Briggs Bayesian spam filtering 15 of 19

Example using dbacl [1] # learn... kbriggs:~/bayes> dbacl -l ham ham/* kbriggs:~/bayes> dbacl -l spam spam/* # test... kbriggs:~/bayes> echo "meet for lunch?" dbacl -v -c ham -c spam ham kbriggs:~/bayes> echo "viagra" dbacl -v -c ham -c spam spam kbriggs:~/bayes> echo "mortgage" dbacl -v -c ham -c spam spam Keith Briggs Bayesian spam filtering 16 of 19

Can they beat us? the old trick of deliber@te spel1ing erors and füñny diàçrítics now works in our favour! dictionary salad: lots of random words - fails because the spammer doesn't know our model it's a long story: some genuine text - a possible danger habeas haiku: copyright poem, attempt at legal protection - now a strong spam indicator Keith Briggs Bayesian spam filtering 17 of 19

Honesty in signalling - some philosophy spammers cannot violate the rules of any machine protocol like SMTP... but they violate the rules of human netiquette all the time... so perhaps we have to write some software which emulates human behaviour to detect the breach normal human communication has evolved mechanisms to detect sincerity: tone of voice, body language... and in particular, complex rules of grammar which although they impose a cost on the speaker, the consequence is that the listener knows the speaker is making an effort, and is thus worth listening to does this imply that any solution to the spam problem (or any other network security problem) must involve some similar mechanism? in any case, I now understand why correct spelling and grammar are so important in written documents! Keith Briggs Bayesian spam filtering 18 of 19

References [1] L Breyer dbacl - a digramic Bayesian classifier http://dbacl.sourceforge.net/ [2] W Yerazunis The spam-filtering accuracy plateau at 99.9% and how to get past it http://crm114.sourceforge.net/plateau Paper.pdf [3] D J C MacKay & L C Peto A hierarchical Dirichlet language model http://www.inference.phy.cam.ac.uk/mackay/bayeslang.html [4] P Gburzynski & J Maitan Fighting the spam wars: a remailer approach with restrictive aliasing ACM Trans. Internet Tech. 4 1-30 (2004) [5] Tripoli http://www.pfir.org/tripoli-overview [6] J D M Rennie & T Jaakkola Automatic feature induction for text classification MIT AI Lab, September 2002 Keith Briggs Bayesian spam filtering 19 of 19