Design and Evalua.on of a Real- Time URL Spam Filtering Service

Similar documents
How To Create A Spam Detector On A Web Browser

Detection of Malicious URLs by Correlating the Chains of Redirection in an Online Social Network (Twitter)

Gyrus: A Framework for User- Intent Monitoring of Text- Based Networked ApplicaAons

LASTLINE WHITEPAPER. Large-Scale Detection of Malicious Web Pages

Design and Evaluation of a Real-Time URL Spam Filtering Service

Design and Evaluation of a Real-Time URL Spam Filtering Service

Gyrus: A Framework for User- Intent Monitoring of Text- Based Networked ApplicaAons

Understanding and Detec.ng Real- World Performance Bugs

Domain Name System Security

So#ware quality assurance - introduc4on. Dr Ana Magazinius

ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on

Introduc8on to Apache Spark

CS 558 Internet Systems and Technologies

New DNS Traffic Analysis Techniques to Identify Global Internet Threats. Dhia Mahjoub and Thomas Mathew January 12 th, 2016

Phishing Scams Security Update Best Practices for General User

/Endpoint Security and More Rondi Jamison

Who will win the battle - Spammers or Service Providers?

Business Con*nuity with Docker

A Practical Attack to De Anonymize Social Network Users

Recurrent Patterns Detection Technology. White Paper

80 % Section I: Web Page Analysis TOP 5 WORDS URL DESCRIPTION TAG TITLE TAG SPEED COPY. ocean19.com

Cloud Based Tes,ng & Capacity Planning (CloudPerf)

SPAM, VIRUSES AND PHISHING, OH MY! Michael Starks, CISSP, CISA ISSA Fellow 10/08/2015

Connec(ng to the NC Educa(on Cloud

benefit of virtualiza/on? Virtualiza/on An interpreter may not work! Requirements for Virtualiza/on 1/06/15 Which of the following is not a poten/al

Adventures in Bouncerland. Nicholas J. Percoco Sean Schulte Trustwave SpiderLabs

Ipswitch IMail Server with Integrated Technology

NGFW is yesterdays news what is next in scope for the firewall in the threat intelligence age

Main Research Gaps in Cyber Security

Prophiler: A Fast Filter for the Large-Scale Detection of Malicious Web Pages

Panda Cloud Protection

Service description for SUNET mailfilter

Collax Mail Server. Howto. This howto describes the setup of a Collax server as mail server.

Exchange of experience from a SuccessFactors LMS Implementa9on

Corporate Account Takeover & Information Security Awareness

Three Step Redirect API

Cloud Services. Anti-Spam. Admin Guide

Blue Medora VMware vcenter Opera3ons Manager Management Pack for Oracle Enterprise Manager

User Guide to the Content Analysis Tool

Protec'ng Communica'on Networks, Devices, and their Users: Technology and Psychology

Protec'ng Informa'on Assets - Week 8 - Business Continuity and Disaster Recovery Planning. MIS 5206 Protec/ng Informa/on Assets Greg Senko

DNS Traffic Monitoring. Dave Piscitello VP Security and ICT Coordina;on, ICANN

Mining DNS for Malicious Domain Registrations

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

anomaly, thus reported to our central servers.

WatchGuard QMS End User Guide

India s Integrated Taxpayer Data Management System (ITDMS) - A data mining tool for non-intrusive anti-tax evasion work

This is a picture of a kiqen

Commtouch RPD Technology. Network Based Protection Against -Borne Threats

K7 Mail Security FOR MICROSOFT EXCHANGE SERVERS. v.109

Additional information >>> HERE <<< Getting Instant Access free web hosting with cpanel and mysql Real User Experience

Peering Through the iframe

IT Change Management Process Training

Member Municipality Security Awareness Training. End- User Informa/on Security Awareness Training

Anti Spam Best Practices

Offensive & Defensive & Forensic Techniques for Determining Web User Iden<ty

POP3 Connector for Exchange - Configuration

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Applying Machine Learning to Network Security Monitoring. Alex Pinto Chief Data Scien2st

Website Report: To-Do Tasks: 0. Speed SEO SCORE: 73 / 100. Load time: 0.268s Kilobytes: 1 HTTP Requests: 0

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Cyan Networks Secure Web vs. Websense Security Gateway Battle card

Google Apps Sync for Microsoft Outlook

1 Introduction About this manual Terms and conventions used in this manual 12

Financial Fraud Threats & Preven3on. Mark Frank EVP, Senior Opera3ons Officer Colorado Business Bank

Project Overview. Collabora'on Mee'ng with Op'mis, Sept. 2011, Rome

Splunk and Big Data for Insider Threats

Hosted CanIt. Roaring Penguin Software Inc. 26 April 2011

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

Op#mizing NAT and Firewall Keepalives using PCP

Cisco Cloud Security Interoperability with Microsoft Office 365

Kaspersky Anti-Spam 3.0

Transcription:

Design and Evalua.on of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Jus.n Ma, Vern Paxson, Dawn Song University of California, Berkeley Interna.onal Computer Science Ins.tute

Mo.va.on Social Networks (Facebook, TwiMer) Spam Blogs, Services (Blogger, Yelp) Web Mail (Gmail, Live Mail)

Mo.va.on Exis.ng solu.ons: Blacklists Service- specific, account heuris.cs Develop new spam filter service: Filter spam: scams, phishing, malware Real-.me, fine- grained, generalizable

Overview Our system Monarch: Accepts millions of URLs from web service Crawls, labels each URL in real-.me Spam Classifica.on Decision based on URL content, page behavior, hos.ng Large- scale; distributed collec.on, classifica.on Implemented as a cloud service

Monarch in Ac.on URL Spam Account Social Network

Monarch in Ac.on URL Monarch Spam Account Social Network

3. Fetch Content Monarch in Ac.on URL Monarch Spam Account Social Network Spam URL Content

3. Fetch Content Monarch in Ac.on URL Monarch Spam Account Social Network Spam URL Content

3. Fetch Content Monarch in Ac.on URL Monarch Spam Account Social Network Message Recipients Spam URL Content

Challenges Accuracy Real- Time Scalability Tolerant to Feature Evolu.on

Outline Architecture Results & Performance Limita.ons Conclusion

System Architecture

System Architecture

System Architecture

System Architecture

URL Aggrega.on Source Spam email URLs Blacklisted TwiMer URLs Non- spam TwiMer URLs Sample Size 1.25 million 567,000 9 million Collec.on period: 9/8/2010 10/29/2010

Feature Collec.on High Fidelity Browser NavigaGon Lexical features of URLs (length, subdomains) Obfusca.on (directory opera.ons, nested encoding) HosGng IP/ASN A, NS, MX records Country, city if available

Feature Collec.on Content Common HTML templates, keywords Search engine op.miza.on Content of request, response headers Behavior Prevent naviga.ng away Pop- up windows Plugin, JavaScript redirects

Classifica.on Distributed LogisGc Regression Data overload for single machine

Classifica.on Distributed LogisGc Regression Data overload for single machine L1- regularizagon Reduces feature space, over- figng 50 million features - > 100,000 features

Implementa.on System implemented as a cloud service on Amazon EC2 AggregaGon: 1 machine Feature CollecGon: 20 machines Firefox, extension + modified source ClassificaGon & Feature ExtracGon: 50 machines Hadoop - Spark, Mesos Straighjorward to scale the architecture

Result Overview High- level summary: Performance Overall accuracy Highlight important features Feature evolu.on Spam independence between services

Performance Rate: 638,000 URLs/day Cost: $1,600/mo Process.me: 5.54 sec Network delay: 5.46 sec Can scale to 15 million URLs/day Es.mated $22,000/mo

Measuring Accuracy Dataset: 12 million URLs (<2 million spam) Sample 500K spam (half tweets, half email) Sample 500K non- spam Training, Tes.ng 5- fold valida.on Vary training folds non- spam:spam ra.o Test fold equal parts spam, non- spam

Overall Accuracy Training RaGo Accuracy False PosiGve Rate False NegaGve Rate 1:1 94% 4.23% 7.5% 4:1 91% 0.87% 17.6% 10:1 87% 0.29% 26.5% Correctly labeled samples Non- spam labeled as spam Spam labeled as non- spam

Overall Accuracy Training RaGo Accuracy False PosiGve Rate False NegaGve Rate 1:1 94% 4.23% 7.5% 4:1 91% 0.87% 17.6% 10:1 87% 0.29% 26.5% Correctly labeled samples Non- spam labeled as spam Spam labeled as non- spam

Error by Feature 50 40 Error (%) 30 20 10 0 Error False Posi.ve Rate Error = 1 - Accuracy

Error by Feature 50 40 Error (%) 30 20 10 0 Error False Posi.ve Rate Error = 1 - Accuracy

Error by Feature 50 40 Error (%) 30 20 10 0 Error False Posi.ve Rate Error = 1 - Accuracy

Feature Evolu.on Retraining Required Accuracy (%) 98 96 94 92 90 88 86 12- Sep 16- Sep 20- Sep 24- Sep With Retraining Without Retraining

Spam Independence Unexpected result: TwiMer, email spam qualita.vely different Training Set TesGng Set Accuracy False NegaGves TwiRer TwiRer 94% 22% TwiMer Email 81% 88% Email TwiMer 80% 99% Email Email 99% 4%

Spam Independence Unexpected result: TwiMer, email spam qualita.vely different Training Set TesGng Set Accuracy False NegaGves TwiMer TwiMer 94% 22% TwiRer Email 81% 88% Email TwiRer 80% 99% Email Email 99% 4%

Dis.nct Email, TwiMer Features

Email Features Shorter Lived

Limita.ons Adversarial Machine Learning We provide oracle to spammers Can adversaries tweak content un.l passing? Time- based Evasion Change content aser URL submimed for verifica.on Crawler Fingerprin.ng Iden.fy IP space of Monarch, fingerprint Monarch browser client Dual- personality DNS, page behavior

Related Work C. WhiMaker, B. Ryner, and M. Nazif, Large- Scale Automa1c Classifica1on of Phishing Pages J. Ma, L. Saul, S. Savage, and G. Voelker, Iden1fying suspicious URLs: an applica1on of large- scale online learning Y. Zhang, J. Hong, and L. Cranor, Can1na: a content- based approach to detec1ng phishing web sites M. Cova, C. Kruegel, and G. Vigna, Detec1on and analysis of drive- by- download afacks and malicious JavaScript code

Conclusion Monarch provides: Real-.me scam, phishing, malware detec.on Experiments show 91% accuracy, 0.87% false posi.ves Readily scalable cloud service Applicable to all URL- based spam Spam not guaranteed to overlap between web services TwiMer, email qualita.vely different Despite overlap, can s.ll provide generalizable filtering Require training data from each service