Multi-Protocol Content Filtering



Similar documents
Antispam Security Best Practices

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

An Overview of Spam Blocking Techniques

Objective This howto demonstrates and explains the different mechanisms for fending off unwanted spam .

About this documentation

Filtering Spam Using Search Engines

Spam Filtering using Naïve Bayesian Classification

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Intercept Anti-Spam Quick Start Guide

COMBATING SPAM. Best Practices OVERVIEW. White Paper. March 2007

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

PANDA CLOUD PROTECTION User Manual 1

ASAV Configuration Advanced Spam Filtering

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

procmail and SpamAssassin

Improving the Performance of Heuristic Spam Detection using a Multi-Objective Genetic Algorithm. James Dudley

MDaemon configuration recommendations for dealing with spam related issues

Spam filtering. Peter Likarish Based on slides by EJ Jung 11/03/10

USER S MANUAL Cloud Firewall Cloud & Web Security

BARRACUDA. N e t w o r k s SPAM FIREWALL 600

Configuring MDaemon for Centralized Spam Blocking and Filtering

Anti Spamming Techniques

What makes Panda Cloud Protection different? Is it secure? How messages are classified... 5

Handling Unsolicited Commercial (UCE) or spam using Microsoft Outlook at Staffordshire University

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

SPAM FILTER Service Data Sheet

Achieve more with less

Spam Filtering Methods for Filtering

Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development

Detecting spam using social networking concepts Honours Project COMP4905 Carleton University Terrence Chiu

Antispam Evaluation Guide. White Paper

Securepoint Security Systems

SPAM FILTERING IMPLEMENTATION USING OPEN SOURCE SOFTWARE

Introduction. How does filtering work? What is the Quarantine? What is an End User Digest?

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Do you need to... Do you need to...

Eiteasy s Enterprise Filter

Kaspersky Anti-Spam 3.0

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Solutions IT Ltd Virus and Antispam filtering solutions

AntiSpam QuickStart Guide

University of Mary s Spam Solution

Filter User Guide

How To Stop Spam From Being A Problem

Immunity from spam: an analysis of an artificial immune system for junk detection

Symantec Hosted Mail Security. Console and Spam Quarantine User Guide

Why Bayesian filtering is the most effective anti-spam technology

FortiMail Filtering Course 221-v2.0. Course Overview. Course Objectives

Spam, Spam and More Spam. Spammers: Cost to send

IMPROVING SPAM FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

The Network Box Anti-Spam Solution

Anti Spam Best Practices

BULLGUARD SPAMFILTER

Spam Filtering with Naive Bayesian Classification

Title: Spam Filter Active / Spam Filter Active : CAB Page 1 of 5

A Case-Based Approach to Spam Filtering that Can Track Concept Drift

Bayesian Filtering. Scoring

How to Create and Manage your Junk Inbox Rule

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

MailScanner Tips for NOCO Hosting Clients

Personal Spam Solution Overview

How To Secure A Website With A Password Protected Login Process (

Security. on your terms SOFTSCAN

Anti-spam filtering techniques

the barricademx end user interface documentation for barricademx users

Anti-SPAM Solutions as a Component of Digital Communications Management

International Journal of Research in Advent Technology Available Online at:

Transcription:

Multi-Protocol Content Filtering Matthew Johnson <mwj99@doc.ic.ac.uk> MEng Individual Project 1 Title hello, etc. 1-1

Why filter content? Information overload Specific personal interests General signal-to-noise ratio...affected by unwanted content, usually commercial or advertisement-based... 2 Information overload too much content, or too many content items to handle, but nothing we specifically don t want to know about, message digests of mailing lists, kerneltraffic and friends Specific personal interest lots of content, but a lot of it is about stuff we are not interested in knowing about, yet, that other content may be interesting to other people, e.g. on a developers mailing list, we may be interested in bugfixes and problems with a Linux version, but couldn t care less about VMS. General signal-to-noise ratio SNR: ratio of content in which the userbase is interested compared to that in which the userbase is disinterested. 2-1

Why is spam such a problem? Not just email Usenet / Netnews also suffers. Email: significant increase, 48% in last 12 months 3 Half of all emails monitored by MessageLabs in May 2003 were spam, June seems to have reduced so far but we are not at the end of the month yet. 3-1

Email filtration options Killfiles / Blacklists - simplistic header-based filter Spammers regularly spoof headers not much help. Precise hash matches (e.g. Vipul s Razor) Spammers regularly insert hashbusters into their content. But collaborative filtering not without merit... Regexp-based content matching and server blacklisting (e.g. SpamAssassin) Very effective, but suffers due to static heuristic rules. 4 Killfiles still used because users understand them REALLY well, despite their lack of effectiveness. Still useful for deliberately blocking posts from contributors who rub you up the wrong way, but useless for spam. Concept of matching content: right direction but not foolproof. Spamware agents the ability to insert hashbusters which are deliberately designed to throw off trivial hash-collision detection methods. Note the benefits of collab filtering though; when it works, it s good. Discuss SA rules body and header matching, e.g. Nigerian spam, mail-client spoofing. Effectiveness is excellent but errors remain possible. Quite a lot of confirmation emails (e.g. Easyjet, Ryanair) get misclassified because they match the heuristics. Equally, if spam comes along which doesn t match the static rules, it s not detected. 4-1

The dynamic solution Static rules can make an educated guess as to what the user thinks may be spam......but the only way to find out precisely is to have the user tell us. The user s wishes are unlikely to be codifiable as a set of static rules we must find a different way. 5 Project Objectives Implementation of a content filter for mail and news, controlled and influenced by the individual user. Content filtration by statistical classification and distribution of content hashes Investigation of statistical classification as applied to news 6

System Architecture Incoming Mail Incoming News Mail Handler News Handler Mgmt Clients Spam Handler Collab Handler Content Handler Management Interface Core Bayesian Classifier, Collaborative Filter Filtered Mail Filtered News Collab Messages Incoming Mail 7 Statistical filtering Analyze a set of examples which the user tells us are either spam or non-spam. Calculate the prior probability of each word in the examples based on how often they appear in spam content. e.g. Click appears in 939 out of 2,355 spam examples and 113 out of 4,787 non-spam content. p spam = 939 2355 113 4787 + 939 2355 = 0.9441 8

The Naïve Bayesian Classifier To test a content item, search for the probability of every word in the new content in the table we created. Find the most extreme n probabilities (those closest to 0 or 1) Use the word probabilities as likelihood indicators for the new content being spam. n k=1 P spam = p k n k=1 p k + n k=1 (1 p k) 9 Collaborative filtration Users generally in some form of community The same spam content may reach more than one member of the community Time delay in mail handling works to our advantage Can we share knowledge within communities to reduce the amount of spam a user sees? 10

Better content matching Current hash-detection systems fail too readily Need function such that: If content a and b are substantively similar...... values α and β are arithmetically similar. A fuzzy hash hash where two hashes are quantitatively comparable. 11 Using fuzzy hashing in collaboration Alice receives an email, which is detected as spam. Alice s mail filter hashes the content, notes the hash, and sends it on to any interested collaborators. Bob s mail filter receives a collaborative message regarding the new spam. It notes the hash. Bob then receives an email. The email is hashed, and compared with those it knows about. Bob s mail filter discovers the new mail is a 98% match with the spam Alice told us about. Bob has set his hash match threshold to 70%, so the mail is detected as spam. 12

Implementation Challenges Homogenization of content from various protocols abstract message format PGP integration for trustworthy collaboration News protocol implementation 13 Results Like-for-like testing: My filter: 75% accuracy with no false positives SpamAssassin: 90% accuracy with no false positives Hard to test collaborative filtering Reasonable performance but not really comparable with the bleeding edge 14

Demonstration 15 Further Work Optimization of configuration variables Token thresholds, number of tokens used in testing. Optimization of fuzzy hash matching algorithm Slow due to attempted rolling window matches Addition of other protocols Web-based bulletin boards? User interface extensions Provide a usable mail/news client SpamAssassin for news, meta-filtration Infrastructure could apply SpamAssassin to news, refactor to allow multiple content testing methods. 16

Summary A content filter which functions acceptably Bayesian filtering and fuzzy hash matching are useful Sole use of these technologies may not be sufficient Combining filters likely to be the best solution 17 Any further questions? 18