How To Create A Spam Detector On A Web Browser



Similar documents
Design and Evaluation of a Real-Time URL Spam Filtering Service

Design and Evaluation of a Real-Time URL Spam Filtering Service

Design and Evalua.on of a Real- Time URL Spam Filtering Service

LASTLINE WHITEPAPER. Large-Scale Detection of Malicious Web Pages

EVILSEED: A Guided Approach to Finding Malicious Web Pages

Pwning Intranets with HTML5

Threat Spotlight: Angler Lurking in the Domain Shadows

Using big data analytics to identify malicious content: a case study on spam s

Recurrent Patterns Detection Technology. White Paper

REGULATORY OPTIONS TO FACILITATE THE ADOPTION OF INTERNET PARENTAL CONTROLS PUBLIC CONSULTATION RESPONSE FROM NETSWEEPER INC.

Flow Analysis Versus Packet Analysis. What Should You Choose?

Ipswitch IMail Server with Integrated Technology

Getting Started with WPM

Cross Site Scripting in Joomla Acajoom Component

FireEye Threat Prevention Cloud Evaluation

The Advanced Attack Challenge. Creating a Government Private Threat Intelligence Cloud

From Internet Data Centers to Data Centers in the Cloud

W3Perl A free logfile analyzer

Detection of Malicious URLs by Correlating the Chains of Redirection in an Online Social Network (Twitter)

Real Time Analytics for Big Data. NtiSh Nati

An Efficient Methodology for Detecting Spam Using Spot System

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

A Server and Browser-Transparent CSRF Defense for Web 2.0 Applications. Slides by Connor Schnaith

Why Mobile Performance is Hard

Cymon.io. Open Threat Intelligence. 29 October 2015 Copyright 2015 esentire, Inc. 1

SPAM, VIRUSES AND PHISHING, OH MY! Michael Starks, CISSP, CISA ISSA Fellow 10/08/2015

Computer Networks - CS132/EECS148 - Spring

The State of Spam A Monthly Report August Generated by Symantec Messaging and Web Security

Index. AdWords, 182 AJAX Cart, 129 Attribution, 174

LOAD BALANCING TECHNIQUES FOR RELEASE 11i AND RELEASE 12 E-BUSINESS ENVIRONMENTS

When Reputation is Not Enough: Barracuda Spam & Virus Firewall Predictive Sender Profiling

Computer Networking LAB 2 HTTP

When Reputation is Not Enough: Barracuda Spam Firewall Predictive Sender Profiling. White Paper

Accelerating Wordpress for Pagerank and Profit

How To Understand The Power Of A Content Delivery Network (Cdn)

SK International Journal of Multidisciplinary Research Hub

CS 558 Internet Systems and Technologies

Commtouch RPD Technology. Network Based Protection Against -Borne Threats

The Dirty Secret Behind the UTM: What Security Vendors Don t Want You to Know

IBM Advanced Threat Protection Solution

Cisco Cloud Web Security Key Functionality [NOTE: Place caption above figure.]

TREND MICRO. InterScan VirusWall 6. SMTP Configuration Guide. Integrated virus and spam protection for your Internet gateway.

Security Design.

Request Routing, Load-Balancing and Fault- Tolerance Solution - MediaDNS

GoToMyPC Corporate Advanced Firewall Support Features

INTERNET MARKETING. SEO Course Syllabus Modules includes: COURSE BROCHURE


Getting Started with AWS. Hosting a Static Website

Service Performance Management: Pragmatic Approach by Jim Lochran

GLOBAL SERVER LOAD BALANCING WITH SERVERIRON

+ web + DLP. Secure 1, 2, or all 3 with one powerful solution. The best security you can get for one or for all.

A Hybrid Approach to Detect Zero Day Phishing Websites

McAfee Web Gateway Administration Intel Security Education Services Administration Course Training

Content Delivery and the Natural Evolution of DNS

Websense Web Security Solutions. Websense Web Security Gateway Websense Web Security Websense Web Filter Websense Hosted Web Security

Web Traffic Capture Butler Street, Suite 200 Pittsburgh, PA (412)

YouServ: A Web Hosting and Content Sharing Tool for the Masses

LASTLINE WHITEPAPER. Using Passive DNS Analysis to Automatically Detect Malicious Domains

Analysis of Web Archives. Vinay Goel Senior Data Engineer

85.1. Review of google.com. Your website score. Generated on Introduction. Table of Contents SEO. Iconography Pass

Comprehensive Understanding of Malicious Overlay Networks

Configuration Information

A new fake Citibank phishing scam using advanced techniques to manipulate users into surrendering online banking access has emerged.

Marketing 201. How a SPAM Filter Works. Craig Stouffer Pinpointe On-Demand cstouffer@pinpointe.com (408) x125

WEB APPLICATION FIREWALLS: DO WE NEED THEM?

AKAMAI WHITE PAPER. Delivering Dynamic Web Content in Cloud Computing Applications: HTTP resource download performance modelling

Cyan Networks Secure Web vs. Websense Security Gateway Battle card

Big Data in Action: Behind the Scenes at Symantec with the World s Largest Threat Intelligence Data

Testing & Assuring Mobile End User Experience Before Production. Neotys

Domain Name Abuse Detection. Liming Wang

Anti-Phishing Best Practices for ISPs and Mailbox Providers

How To Integrate Hosted Security With Office 365 And Microsoft Mail Flow Security With Microsoft Security (Hes)

Best Practices for Evaluating Anti-spam Solutions

When Reputation is Not Enough. Barracuda Security Gateway s Predictive Sender Profiling. White Paper

EyjafjallajöKull Framework (aka: Exploit Kits Krawler Framework)

Global Reputation Monitoring The FortiGuard Security Intelligence Database WHITE PAPER

A Practical Attack to De Anonymize Social Network Users

THE OPEN UNIVERSITY OF TANZANIA

CiscoWorks Internetwork Performance Monitor 4.0

Chapter 4: Networking and the Internet

Advancements in Botnet Attacks and Malware Distribution

Enterprise Buyer Guide

Transcription:

Design and Evaluation of a Real-Time URL Spam Filtering Service Geraldo Franciscani 15 de Maio de 2012 Teacher: Ponnurangam K (PK)

Introduction Initial Presentation Monarch is a real-time system for filtering scam, phishing, and malware URLs as they are submitted to web services.

Introduction General Considerations An alternative for account-based system (inneficient because can incur delays between a fraudulent account s creatin and its subsequent detection). Development and evaluation of a real-time, scalable system for detecting spam content in web services. Exposure of fundamental differences between email and Twitter spam, showing that spam targeting one web service does not generalize to other web services.

Introduction General Considerations Presentation of a novel feature collection and classification architecture that employs an instrumented browser and a new distributed classifier that scales to tens of millions of features. Examination of the salience of each feature used for detecting spam and evaluate their performance over time.

Architecture Overview

Architecture Design Goals Real-time results. Readily scalable to required throughput (applied to big social networks like Twitter). Accurate decisions (emphasize low false positives). Tolerant to feature evolution (easily adapt to new features). Context-independent classification (applied to different types of web services).

Architecture System Flow

Architecture Design Goals URL Aggregation. Feature Collection (crawler). Feature Extraction (raw data into a sparse feature vector understood by the classification engine). Classification.

Feature Collection Overview Expand upon the idea of considering just lexical properties of URLs, page content, and hosting properties of domains. The system collects his own sources of features based on one of three components: web browser, DNS resolver and IP address analysis.

Web Browser Features Initial URL and Landing URL. Redirects (monitoring each redirected page). Sources and Frames (analysis of pages with spam content embedded within a non-spam page). HTML Content (similar layout and content across spam webpages). Page Links (search and analyze each link block - ex. href s).

Web Browser Features JavaScript Events (webpages that require user action). Pop-up Windows (saves the total number and associate them with the parent URL). Plugins (requests that leads to an outgoing HTTP request, causes a page to redirect). HTTP Headers (languages and versions of spam hosts).

DNS Resolver Overview The system collect features from the initial, final and redirect URLs submitted to a DNS resolver. Features: hostnames, nameservers, mailservers, and IP addresses associated with each domain. Each feature provides a means for potentially identifying common hosting infrastructure across spam.

IP Address Analysis Overview Extract geolocation and routing information. Identify portions of the Internet with a higher prevalence of spam.

Feature Extraction Overview Preparation for the classification. Transform the unprocessed features gathered during feature collection into a meninful feature vector. Provide processed sparse maps to the classifier for training and decision making.

Distributed Classifier Design Overview Design as a parallel online learner for quick training and memory fit. Combination of strategies of iterative parameter mixing and subgradient L1-regularization.

Implementation Overview The system operates on Amazon Web Services (AWS). Each of the four components of Monarch was implemented as an independent system.

Evaluation Overview Evaluation of the classifier s accuracy and its run-time performance. 90.78% accuracy (0.87% false positives). Median feature collection and classification time of 5.54 seconds. Overlap between email and tweet spam features, requiring the classifier to learn two distinct sets of rules.

Classifier Performance Overall Accuracy Trained with all the features. Non-spam to spam ratio. 4:1 ratio adoption because of its low false positives and reasonable false negatives.

Classifier Performance Accuracy of Individual Components Accuracy of classifier when trained on a single type of feature.

Classifier Performance Accuracy Over Time Determinate how often the classifier needs to be retrained and how long it takes to become out of date. Two types: one classifier trained every four days, and another trained just at the begining of the test. Conclusion: the classifier needs to be retrained on a continual basis to achieve a good classification.

Classifier Performance Accuracy Over Time - Chart

Classifier Performance Training Across Input Sources Effects of training and testing on matching and mismatching data sets. Email and tweet spam are independent. Low cross classification accuracy.

Classifier Performance Context vs. Context Free Training Omitting account and tweet properties from classification has no statistically significant effect on accuracy.

Run Time Performance Overview Analysis Latency: median time of 5.54 seconds per URL. Throughput: 638,000 URLs per day. Cost: $1, 600.00 on cloud machinery per month. Twitter example: total cost for filtering 15.3 million URLs per day to $22, 751 per month.

Comparing Email and Tweet Spam Overview Email: short-lived hosting infrastructure and compaigns. Twitter: longer lasting campaigns that push quite different content. Email and Twitter spam share only 10% of features in common. The classifier must learn two separate sets of rules to identify both spam types. Persistence: email spam is marked by much shorter lived features compared to tweet spam and non-spam samples.

Comparing Email and Tweet Spam Overview - Chart

Spam Infrastructure Redirecting to spam Both Twitter and email spammers use redirects to deliver victims to spam content. Twitter: 67% of spam URLs use redirects, with a median path length of 3. Email: only 20% of spam URLs use redirects, with a median path lenght of 2.

Spam Infrastructure Page Content Motivation: observation of popular news sites with non-spam content displaying ads that cause a variety of spam popups, sounds, and video to play. Conclusion: it is necessary to analyze all of a webpage s content to not overlook spam with dynamic page behavior or mash-up content that includes known spam domains.

Discussion Overview Attackers can tune features to fall below the spam classification (feature evasion), modify content after classification (Time Based Evasion) and block the crawler (Crawler Evasion). Future work: in depth study of each attack and potential solutions.

Conclusion Overview Presented a real-time system for filtering scam, phishing, and malware URLs as they are submitted to web services. Explored the distinctions between email and Twitter spam. Demonstrated that a modest deployment of Monarch on cloud infrastructure can achieve a throughput of 638,000 URLs per day with an overall accuracy of 91% with 0.81% false positives. Estimated it would cost $22, 751 a month to run a deployment of Monarch capable of processing 15 million URLs per day.