Spam filtering. Peter Likarish Based on slides by EJ Jung 11/03/10



Similar documents
Spam Filtering Methods for Filtering

How To Stop Spam From Being A Problem

An Overview of Spam Blocking Techniques

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

COMBATING SPAM. Best Practices OVERVIEW. White Paper. March 2007

ETH Zürich - Mail Filtering Service

Adaptive Filtering of SPAM

Spam, Spam and More Spam. Spammers: Cost to send

Evios. A Managed, Enterprise Appliance for Identifying and Eliminating Spam

Do you need to... Do you need to...

Commtouch RPD Technology. Network Based Protection Against -Borne Threats

A Monitor Tool for Anti-spam Mechanisms and Spammers Behavior

Intercept Anti-Spam Quick Start Guide

The spam economy: the convergent spam and virus threats

Being labeled as a spammer will drive your customers way, ruin your business, and can even get you a big fine or a jail sentence!

Government of Canada Managed Security Service (GCMSS) Annex A-5: Statement of Work - Antispam

How To Prevent Spam From Being Filtered Out Of Your Program

Who will win the battle - Spammers or Service Providers?

Bayesian Spam Filtering

Savita Teli 1, Santoshkumar Biradar 2

How to keep spam off your network

ModusMail Software Instructions.

Spamfilter Relay Mailserver

Anti Spam Best Practices

How to keep spam off your network

Ipswitch IMail Server with Integrated Technology

Recurrent Patterns Detection Technology. White Paper

Setting up Microsoft Outlook to reject unsolicited (UCE or Spam )

Migration Project Plan for Cisco Cloud Security

Handling Unsolicited Commercial (UCE) or spam using Microsoft Outlook at Staffordshire University

IMPROVING SPAM FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

AntiSpam QuickStart Guide

Marketing 201. How a SPAM Filter Works. Craig Stouffer Pinpointe On-Demand cstouffer@pinpointe.com (408) x125

s and anti-spam Page 1

Antispam Security Best Practices

. Daniel Zappala. CS 460 Computer Networking Brigham Young University

SCORECARD MARKETING. Find Out How Much You Are Really Getting Out of Your Marketing

Achieve more with less

Spam Testing Methodology Opus One, Inc. March, 2007

The Network Box Anti-Spam Solution

Spam Classification Using Machine Learning Techniques - Sinespam

Panda Cloud Protection

Fighting Spam in an ISP Environment:

Emerging Trends in Fighting Spam

Spam DNA Filtering System

Marketing Glossary of Terms

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Comprehensive Filtering. Whitepaper

Fighting Spam: Tools, Tips, and Techniques

Enhanced Spam Defence

Stop Spam Now! By John Buckman. John Buckman is President of Lyris Technologies, Inc. and programming architect behind Lyris list server.

How to Use Red Condor Spam Filtering

A Case-Based Approach to Spam Filtering that Can Track Concept Drift

Global Reputation Monitoring The FortiGuard Security Intelligence Database WHITE PAPER

Get Started Guide - PC Tools Internet Security

Why Content Filters Can t Eradicate spam

How To Filter From A Spam Filter

Anti-SPAM Solutions as a Component of Digital Communications Management

K7 Mail Security FOR MICROSOFT EXCHANGE SERVERS. v.109

How To Protect Your From Spam On A Barracuda Spam And Virus Firewall

Combining Global and Personal Anti-Spam Filtering

Intelligent Word-Based Spam Filter Detection Using Multi-Neural Networks

How To Configure Forefront Threat Management Gateway (Forefront) For An Server

Improving the Performance of Heuristic Spam Detection using a Multi-Objective Genetic Algorithm. James Dudley

A White Paper. VerticalResponse, Delivery and You A Handy Guide. VerticalResponse,Inc nd Street, Suite 700 San Francisco, CA 94107

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Transcription:

Spam filtering Peter Likarish Based on slides by EJ Jung 11/03/10

What is spam? An unsolicited email equivalent to Direct Mail in postal service UCE (unsolicited commercial email) UBE (unsolicited bulk email) 82% of US email in 2004 [MessageLabs 2004] Etymology Monty Python s spam episode ham = not spam

Historical events May 3 rd, 1978 Sent to Arpanet west coast users Invitation to see new models of DEC-20 Violated ARPA s user policy Vendors were chastised & ceased (for then) March 31 st, 1993 first UUCP spam Unix-to-Unix Copy April, 1994 Green Card Lottery spam Message posted to EVERY UUCP group

Who s the victim? Internet Service Provider (ISP) much higher bandwidth usage Email service providers need to support larger disk space need to deploy anti-spam measures Businesses time spent on spam damage by spams and viruses and worms Us?

Does spam bother you? Potentially malicious viruses, worms phishing emails Fill up the disk space Does it bother you more than DMs? if so, why?

How much email is spam? >97% filtered! Source: Microsoft Security Intelligence Report 09 (http://www.microsoft.com/downloads/details.aspx?familyid=037f3771-330e-4457-a52c- 5b085dc0a4cd&displaylang=en)

Other numbers Spamhaus >90% http://www.spamhaus.org/effective_filtering.html Postini (Google owned) >90%-95% http://googleenterprise.blogspot.com/search/label/spam%20and%20security%20trends Consensus: >90% of all email is spam

How many messages? Is (was) exponentially increasing

Cost of spam Very back of the envelope 100,000,000,000 messages/month Recipient $0.025 $2.5 billion/month for recipients Sender $0.00001 for sender to generate $1 million/month for sender Profit? 1:200,000 = 500,000 sales. $2 is break-even point

Legal actions CAN-SPAM Act of 2003 UBE must be labeled Must have opt-out option Some people went to jail Some ISPs have been shut down

Some ways to blocking spam Network-level DNS blackholes/blacklisting Edge filtering Content-level Rule-based checksum Machine learning Probabilistic SVM Cost-based approach

DNS filtering Blacklist = list of known spammers IP addresses/domains/senders Top five spamming ISPs account for > 80% of spam traffic! Whitelist = list of known good senders IP addresses/domains/senders Further reading: Zhang et al. Highly Predictive Blacklisting (http://www.cyber-ta.org/releases/hpb/highlypredictiveblacklists-sri-tr-format.pdf)

How much email is spam? >97% filtered! Source: Microsoft Security Intelligence Report 09 (http://www.microsoft.com/downloads/details.aspx?familyid=037f3771-330e-4457-a52c- 5b085dc0a4cd&displaylang=en)

Edge filtering Filtering at edges of network. Ingress: filtering traffic entering your network Egress: filtering traffic exiting your network IP Sender Attachments Mailstream Providers Content

Content filtering Rule-based regular expressions on headers, contents, blacklist Hash/checksum database Bayesian probability/networks probabilistic approach Support Vector Machine High dimension hyper-plane!

Goal Given new email message do I deliver or trash? Spam I ve seen New? S?? S? S

Rule-based filtering Rules are expressed as regular expressions (or SNORT traces) If contains refinancing & mortgage, trash If contains non-english alphabets, trash If attachment type = executable, trash Limitations Difficult to write rules general enough to catch spam but not legit email often miss obvious tricks misspelling on purpose

Checksums Fuzzy hashing Goal: resilient to spelling changes Does it look like spam I ve already seen?? S?? S? S

Fuzzy vs. non-fuzzy hashing Traditional hashing Same: Fixed length output Different: Unique input = (nearly) unique output 1 bit change in input = COMPLETELY different hash Fuzzy hashing Same: Fixed length output Different: mostly the same = SAME value

Fuzzy hashing example Input Viagra Vi4gr4 Fuzzy hash Output DEF8F32 DE28F3C V!aGra DGF8E32 V i a g r a DAF8W3B Output = fixed length. Look at distance between outputs. Shorter distance = more similar. Input = message

Bayesian Theorem Conditional probability P( A B) = P( A! B) P( B) Bayes Theorem P ( A B) = P( B A) P( A) P( B)

Bayesian in Spam [Graham 02, Sahami 98] Given a word stock in an email, what is the probability of this email being spam? we can compute these from the sample set of emails P ( spam stock) = P( stock spam) P( spam) P( stock)

More words Does more words mean higher probability? related words such as stock, jackpot, blue chip, not-related words such as stock, diet, n words require O(2 n ) probabilities to compute Naïve Bayesian filter assumes the independence between words O(n) probabilities to compute P ( spam / stock, jackpot) = P( stock, jackpot P( stock, spam) P( spam) jackpot)

Support Vector Machine Given records, plot them in n-dimension spaces and find a plane that divides the plots the best.

Feature selection How do we choose which tokens to examine? Words alone are not sufficient phrases, punctuation, sender s email address, IP address, domain name, URLs, images, tags, More sophisticated conditions attachment? contains images?

Hybrid approaches SpamAssassin open-source combination of rule-based and Bayesian commonly run at SMTP server can be applied for an individual user Vipul s Razor Collaborative, distributed, checksum (statistical & randomized signature-based) anti-spam solution Commercial version: Cloudmark, Inc

Charging for email If you can t win the game, change the rules Would you pay $0.0025 to send an email? If you sent 50 emails/day: ~$46/year Spammers: 100,000,000,000 * $.0025 = $250,000,000 dollars Assuming adoption rate of 1:200,000 Break even = $1,250