Spam Filtering using Naïve Bayesian Classification



Similar documents
escan Anti-Spam White Paper

Software Engineering 4C03 SPAM

An Overview of Spam Blocking Techniques

Feature Subset Selection in Spam Detection

A Content based Spam Filtering Using Optical Back Propagation Technique

Why Content Filters Can t Eradicate spam

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Using the Barracuda to Filter Your s

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Kaspersky Anti-Spam 3.0

A Collaborative Approach to Anti-Spam

COMBATING SPAM. Best Practices OVERVIEW. White Paper. March 2007

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Is Spam Bad For Your Mailbox?

Anti-Spam Methodologies: A Comparative Study

International Journal of Research in Advent Technology Available Online at:

FortiMail Filtering Course 221-v2.0. Course Overview. Course Objectives

On Attacking Statistical Spam Filters

Savita Teli 1, Santoshkumar Biradar 2

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Discrete Structures for Computer Science

Journal of Information Technology Impact

Eiteasy s Enterprise Filter

Tweaking Naïve Bayes classifier for intelligent spam detection

Spam detection with data mining method:

Detecting Spam Using Spam Word Associations

Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002

Using the Barracuda Spam Firewall to Filter Your s

Efficient Spam Filtering using Adaptive Ontology

FortiMail Filtering Course 221-v2.2 Course Overview

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

KUMC Spam Firewall: Barracuda Instructions

Spam Detection A Machine Learning Approach

Spam By Numbers. June, 2003

Spam DNA Filtering System

Barracuda Spam & Virus Firewall User's Guide 5.x

ContentCatcher. Voyant Strategies. Best Practice for Gateway Security and Enterprise-class Spam Filtering

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Recurrent Patterns Detection Technology. White Paper

Three-Way Decisions Solution to Filter Spam An Empirical Study

Adaptive Filtering of SPAM

Version 5.x. Barracuda Spam & Virus Firewall User s Guide. Barracuda Networks Inc S. Winchester Blvd Campbell, CA

Impact of Feature Selection Technique on Classification

Powerful and reliable virus and spam protection for your GMS installation

Learning from Data: Naive Bayes

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Handling Unsolicited Commercial (UCE) or spam using Microsoft Outlook at Staffordshire University

Anti-Spam White Paper

Combining Global and Personal Anti-Spam Filtering

Anti Spamming Techniques

Naïve Bayesian Anti-spam Filtering Technique for Malay Language

SPAM FILTER Service Data Sheet

Achieve more with less

The Guardian Digital Control and Policy Enforcement Center

Machine Learning in Spam Filtering

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

How To Filter From A Spam Filter

Bayesian Spam Filtering

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Antispam Security Best Practices

About this documentation

Configuring MDaemon for Centralized Spam Blocking and Filtering

Improving Spam Blacklisting Through Dynamic Thresholding and Speculative Aggregation

Multi-Protocol Content Filtering

Purchase College Barracuda Anti-Spam Firewall User s Guide

ITU WSIS Thematic Meeting on Countering Spam: The Scope of the problem. Mark Sunner, Chief Technical Officer MessageLabs

Application of Machine Learning Techniques to Spam Filtering

Comprehensive Filtering. Whitepaper

SurfControl Filter for SMTP

Machine Learning using MapReduce

PineApp Anti IP Blacklisting

Barracuda Spam Firewall User 's Guide 6.x

Communications and Information Technology Commission. Suggested SPAM Monitoring Framework for CITC

Ipswitch IMail Server with Integrated Technology

Spam Filtering with Naive Bayesian Classification

Spam Testing Methodology Opus One, Inc. March, 2007

Symantec Protection Suite Add-On for Hosted and Web Security

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Protecting your business from spam

Controlling Spam s at the Routers

Barracuda Spam Firewall User s Guide

BARRACUDA. N e t w o r k s SPAM FIREWALL 600

anomaly, thus reported to our central servers.

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

6367(Print), ISSN (Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME

Version 3.x. Barracuda Spam & Virus Firewall User s Guide. Barracuda Networks Inc S. Winchester Blvd Campbell, CA

SECURITY S INSIDER SECRETS

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Service Level Agreement for Microsoft Online Services

Improving Business Outcomes: Plug in to Security As A Service Adrian Covich

Antispam Evaluation Guide. White Paper

It is designed to resist the spam in the Internet. It can provide the convenience to the user and save the bandwidth of the network.

Spam? Not Any More! Detecting Spam s using neural networks

How To Run A Realtime Blackhole List (Rbl) In Hkong Kong Ken Kong

Evios. A Managed, Enterprise Appliance for Identifying and Eliminating Spam

Intercept Anti-Spam Quick Start Guide

Ying Tan. Anti-spam Techniques Based on Artificial Immune System

Transcription:

Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes

Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering Fingerprint Analysis Lexical Analysis Artificial Intelligence Statistical Analysis Heuristics Naïve Bayesian Classification Some Definitions Equations Evaluation Measures Conclusions

So What is Spam? Spam = Unsolicited Bulk e-maile Used to promote: mass advertising & marketing Scams and Fraud Other less than desirable content Reasons for using e-maile Cheap and cost effective No widely spread legislation banning its use for such purposes

Some Statistics Current spam traffic accounts for: 60% of all e-mail e traffic Approximately 10 billion messages / day Approximately 2.6 trillion messages / year Equivalent to 2000TB / day

Some Statistics (ctnd( ctnd.) Source: Brightmail Incorporated Loss of Productivity adds up to more than $10 billion in the US alone!

Why is Spam a Problem Financial Costs Resource Abuse Storage Employee Productivity Over-Burdened Mail Servers Fraud Social Costs

So how to Stop Spam Several Techniques Exist: Transport Level Filtering Fingerprint Analysis Lexical Analysis Artificial Intelligence Statistical Analysis Heuristics

Transport Level Filtering Transport Level Filtering: Secure SMTP Realtime Blackhole Lists (RBLs( RBLs)

Fingerprint Analysis Similar to Anti-Virus Software Requires large fingerprint database Not a viable solution considering subtle variations in messages

Lexical Analysis Context examination of words or phrases Uses rule based discrimination and weighted words to filter

Artificial Intelligence Uses Neural Networks techniques to filter out spam Spam Incoming E-mail NOT Spam

Statistical Analysis Includes Bayesian classifiers Similar classification rates to NN Associates probabilities to words Will be covered in subsequent slides

Heuristics Is a combination of the above listed methods Used in large enterprises architectures

Naïve Bayesian Classification Some Definitions Corpus = Body of words used in statistical classification and training. Vector v = (x 1, x 2,., x n ) is an array of values with associated attributes. (in spam filters attributes are words) Lemmatizer = Engine that returns the root of a word (i.e. Lem( going going ) ) = go ) Stop-List = List of frequently used wording (i.e. the, is,, etc )

What is Bayesian Classification Given a document d create a vector v of values and compute the mutual information (MI) of each candidate using the following formula: MI ( X ; C) = P( X = x, C x { 0,1}, c { spam, legitimate} P( X = x). P( C = c) = c).log P( X = x, C = c) Taking the attributes with highest MI s s they are compared to frequency ratios from the training corpus.

What is Bayesian Classification (ctnd.) Having computed the MI s s a new vector v is constructed and fed to the following formula: P( C = c X = x) = P( C = c). P( C i= 1 P( X = k). k { spam, legitimate} i= 1 n n i = x i P( X C = c) i = x i C = k) With the assumption that the various attributes of vector v are conditionally independent

What is Bayesian Classification (ctnd.) The classification is committed if the following criteria is satisfied: P( C P( C spam X = legitimate = X x) = x) > λ The threshold λ is usually set to 999 in order to avoid classifying legitimate e-mail e as spam

How it all fits together

Results and Measures Results based on the testing of the Bayesian Classifier were as follows:

Results and Measures (ctnd( ctnd.) Evaluation was also performed to assess the usefulness of lemmatizers and stop-lists in increasing the accuracy of the filter with a threshold of 999

Results and Measures (ctnd( ctnd.) The following table shows precision, recall and accuracy of the filter with various configurations. TCR = Total Cost Ratio Higher TCR = Better Results

Conclusions Authors believe naïve bayesian classification is not viable. In vivo results of the technique have shown a high degree of accuracy in catching spam. This technique is used in various commercial products targeted mainly to the single user.