Machine Learning in Spam Filtering

Similar documents

Machine Learning Techniques in Spam Filtering

Machine Learning Final Project Spam Filtering

Social Media Mining. Data Mining Essentials

Lecture 2: The SVM classifier

Statistical Machine Learning

Supervised Learning (Big Data Analytics)

Support Vector Machine (SVM)

Linear Classification. Volker Tresp Summer 2015

Introduction to Support Vector Machines. Colin Campbell, Bristol University

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Spam Detection A Machine Learning Approach

A Content based Spam Filtering Using Optical Back Propagation Technique

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Big Data Analytics CSCI 4030

An Introduction to Machine Learning

Scalable Developments for Big Data Analytics in Remote Sensing

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Support Vector Machines Explained

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Local classification and local likelihoods

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Three-Way Decisions Solution to Filter Spam An Empirical Study

Classification algorithm in Data mining: An Overview

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining - Evaluation of Classifiers

CSCI567 Machine Learning (Fall 2014)

Machine Learning and Pattern Recognition Logistic Regression

Survey of Spam Filtering Techniques and Tools, and MapReduce with SVM

Investigation of Support Vector Machines for Classification

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Support Vector Machines with Clustering for Training with Very Large Datasets

CSE 473: Artificial Intelligence Autumn 2010

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Simple and efficient online algorithms for real world applications

Comparison of machine learning methods for intelligent tutoring systems

A Simple Introduction to Support Vector Machines

Question 2 Naïve Bayes (16 points)

Knowledge Discovery from patents using KMX Text Analytics

LCs for Binary Classification

Towards better accuracy for Spam predictions

IMPROVING SPAM FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

Feature Subset Selection in Spam Detection

Detecting Corporate Fraud: An Application of Machine Learning

Spam detection with data mining method:

Filtering Spam with Generalized Additive Neural Networks

Lecture 3: Linear methods for classification

Support Vector Machines

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Decompose Error Rate into components, some of which can be measured on unlabeled data

Projektgruppe. Categorization of text documents via classification

A Spam Filtering Technique using SVM Light Tool

Several Views of Support Vector Machines

Detecting Spam Using Spam Word Associations

Content-Based Recommendation

Combining SVM classifiers for anti-spam filtering

Linear Models for Classification

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Linear Threshold Units

A Game Theoretical Framework for Adversarial Learning

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Data Mining Techniques for Prognosis in Pancreatic Cancer

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

A fast multi-class SVM learning method for huge databases

Machine learning for algo trading

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

WE DEFINE spam as an message that is unwanted basically

Lecture 6: Logistic Regression

How to Win at the Track

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

1 Maximum likelihood estimation

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Bayesian Spam Filtering

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Logistic Regression for Spam Filtering

Cross-Validation. Synonyms Rotation estimation

An Introduction to Data Mining

Mining a Corpus of Job Ads

Improving spam mail filtering using classification algorithms with discretization Filter

Spam Filtering using Naïve Bayesian Classification

E-commerce Transaction Anomaly Classification

Classification Problems

Analysis Tools and Libraries for BigData

Transcription:

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu

Overview Spam is Evil ML for Spam Filtering: General Idea, Problems. Some Algorithms Naïve Bayesian Classifier k-nearest Neighbors Classifier The Perceptron Support Vector Machine Algorithms in Practice Measures and Numbers Improvement ideas: Striving for the ideal filter Machine Learning Techniques in Spam Filtering p.1/26

Spam is Evil It is cheap to send, but expensive to receive: Large amount of bulk traffic between servers Dial-up users spend bandwidth to download it People spend time sorting thru unwanted mail Important e-mails may get deleted by mistake Pornographic spam is not meant for everyone Machine Learning Techniques in Spam Filtering p.2/26

Eliminating Spam Social and political solutions Never send spam Never respond to spam Put all spammers to jail Technical solutions Block spammer s IP address Require authorization for sending e-mails (?) Mail filtering Knowledge engineering (KE) Machine learning (ML) Machine Learning Techniques in Spam Filtering p.3/26

Knowledge Engineering Create a set of classification rules by hand: if the Subject of a message contains the text BUY NOW, then the message is spam procmail Message Rules in Outlook, etc. The set of rules is difficult to maintain Possible solution: maintain it in a centralized manner Then spammer has access to the rules Machine Learning Techniques in Spam Filtering p.4/26

Machine Learning Classification rules are derived from a set of training samples For example: Training samples Subject: "BUY NOW" -> SPAM Subject: "BUY IT" -> SPAM Subject: "A GOOD BUY" -> SPAM Subject: "A GOOD BOY" -> LEGITIMATE Subject: "A GOOD DAY" -> LEGITIMATE Derived rule Subject contains "BUY" -> SPAM Machine Learning Techniques in Spam Filtering p.5/26

Machine Learning A training set is required. It is to be updated regularly. Hard to guarantee that no misclassifications occur. No need to manage and understand the rules. Machine Learning Techniques in Spam Filtering p.6/26

Machine Learning Training set: {(m 1,c 1 ), (m 2,c 2 ),...,(m n,c n )} m i M are training messages, a class c i {S,L} is assigned to each message. Using the training set we construct a classification function f : M {S,L} We use this function afterwards to classify (unseen) messages. Machine Learning Techniques in Spam Filtering p.7/26

ML for Spam: Problem 1 Problem: We classify text but most classification algorithms either require numerical data (R n ) require a distance metric between objects require a scalar product Machine Learning Techniques in Spam Filtering p.8/26

ML for Spam: Problem 1 Problem: We classify text but most classification algorithms either require numerical data (R n ) require a distance metric between objects require a scalar product Solution: use a feature extractor to convert messages to vectors: φ : M R n Machine Learning Techniques in Spam Filtering p.8/26

ML for Spam: Problem 2 Problem: A spam filter may not make mistakes False positive: a legitimate mail classified as spam False negative: spam classified as legitimate mail False negatives are ok, false positives are very bad Solution:? Machine Learning Techniques in Spam Filtering p.9/26

Algorithms: Naive Bayes The Bayes rule: P(c x) = P(x c)p(c) P(x) = P(x c)p(c) P(x S)P(S) + P(x L)P(L) Machine Learning Techniques in Spam Filtering p.10/26

Algorithms: Naive Bayes The Bayes rule: P(c x) = P(x c)p(c) P(x) = P(x c)p(c) P(x S)P(S) + P(x L)P(L) Classification rule: P(S x) > P(L x) SPAM Machine Learning Techniques in Spam Filtering p.10/26

Algorithms: Naive Bayes Bayesian classifier is optimal, i.e. its average error rate is minimal over all possible classifiers. The problem is, we can never know the exact probabilities in practice. Machine Learning Techniques in Spam Filtering p.11/26

Algorithms: Naive Bayes How to calculate P(x c)? Machine Learning Techniques in Spam Filtering p.12/26

Algorithms: Naive Bayes How to calculate P(x c)? It is simple if the feature vector is simple: Let the feature vector consist of a single binary attribute x w. Let x w = 1 if a certain word w is present in the message and x w = 0 otherwise. Machine Learning Techniques in Spam Filtering p.12/26

Algorithms: Naive Bayes How to calculate P(x c)? It is simple if the feature vector is simple: Let the feature vector consist of a single binary attribute x w. Let x w = 1 if a certain word w is present in the message and x w = 0 otherwise. We may use more complex feature vectors if we assume that presence of one word does not influence the probability of presence of other words, i.e. P(x w,x v c) = P(x w c)p(x v c) Machine Learning Techniques in Spam Filtering p.12/26

Algorithms: k-nn Suppose we have a distance metric d defined for messages. To determine the class of a certain message m we find its k nearest neighbors in the training set. If there are more spam messages among the neighbors, classify m as spam, otherwise as legitimate mail. Machine Learning Techniques in Spam Filtering p.13/26

Algorithms: k-nn k-nn is one of the few universally consistent classification rules. Theorem (Stone): as the size of the training set n goes to infinity, if k, n k 0, then the average error of the k-nn classifier approaches its minimal possible value. Machine Learning Techniques in Spam Filtering p.14/26

Algorithms: The Perceptron The idea is to find a linear function of the feature vector f(x) = w T x + b such that f(x) > 0 for vectors of one class, and f(x) < 0 for vectors of other class. w = (w 1,w 2,...,w m ) is the vector of coefficients (weights) of the function, and b is the so-called bias. If we denote the classes by numbers +1 and 1, we can state that we search for a decision function d(x) = sign(w T x + b) Machine Learning Techniques in Spam Filtering p.15/26

Algorithms: The Perceptron Start with arbitrarily chosen parameters (w 0,b 0 ) and update them iteratively. On the n-th iteration of the algorithm choose a training sample (x, c) such that the current decision function does not classify it correctly (i.e. sign(w T nx + b n ) c). Update the parameters (w n,b n ) using the rule: w n+1 = w n + cx b n+1 = b n + c Machine Learning Techniques in Spam Filtering p.16/26

Algorithms: The Perceptron Start with arbitrarily chosen parameters (w 0,b 0 ) and update them iteratively. On the n-th iteration of the algorithm choose a training sample (x, c) such that the current decision function does not classify it correctly (i.e. sign(w T nx + b n ) c). Update the parameters (w n,b n ) using the rule: w n+1 = w n + cx b n+1 = b n + c The procedure stops someday if the training samples were linearly separable Machine Learning Techniques in Spam Filtering p.16/26

Algorithms: The Perceptron Fast and simple. Easy to implement. Requires linearly separable data. Machine Learning Techniques in Spam Filtering p.17/26

Algorithms: SVM The same idea as in the case of the Perceptron: find a separating hyperplane w T x + b = 0 This time we are not interested in any separating hyperplane, but the maximal margin separating hyperplane. Machine Learning Techniques in Spam Filtering p.18/26

Algorithms: SVM Maximal margin separating hyperplane Machine Learning Techniques in Spam Filtering p.19/26

Algorithms: SVM Finding the optimal hyperplane requires minimizing a quadratic function on a convex domain a task known as a quadratic programme. Machine Learning Techniques in Spam Filtering p.20/26

Algorithms: SVM Finding the optimal hyperplane requires minimizing a quadratic function on a convex domain a task known as a quadratic programme. Statistical Learning Theory by V. Vapnik guarantees good generalization for SVM-s. Machine Learning Techniques in Spam Filtering p.20/26

Algorithms: SVM Finding the optimal hyperplane requires minimizing a quadratic function on a convex domain a task known as a quadratic programme. Statistical Learning Theory by V. Vapnik guarantees good generalization for SVM-s. There are lots of further options for SVM-s (soft margin classification, nonlinear kernels, regression). Machine Learning Techniques in Spam Filtering p.20/26

Algorithms: SVM Finding the optimal hyperplane requires minimizing a quadratic function on a convex domain a task known as a quadratic programme. Statistical Learning Theory by V. Vapnik guarantees good generalization for SVM-s. There are lots of further options for SVM-s (soft margin classification, nonlinear kernels, regression). SVM-s are one of the most widely used ML classification techniques currently. Machine Learning Techniques in Spam Filtering p.20/26

Practice: Measures Denote by N S L the number of false negatives, and by N L S number of false positives. The quantities of interest are then the error rate and precision E = N S L + N L S N, P = 1 E legitimate mail fallout and spam fallout F L = N L S N L, F S = N S L N S Note that the error rate and precision must be considered relatively to the case of no classifier. Machine Learning Techniques in Spam Filtering p.21/26

Practice: Numbers Algorithm N L S N S L P F L F S Naïve Bayes 0 138 87.4% 0.0% 28.7% k-nn 68 33 90.8% 11.0% 6.9% Perceptron 8 8 98.5% 1.3% 1.7% SVM 10 11 98.1% 1.6% 2.3% Results of 10-fold cross-validation on PU1 spam corpus Machine Learning Techniques in Spam Filtering p.22/26

Eliminating False Positives Algorithm N L S N S L P F L F S Naïve Bayes 0 140 87.3% 0.0% 29.1% l/k-nn 0 337 69.3% 0.0% 70.0% SVM soft margin 0 101 90.8% 0.0% 21.0% Results after tuning the parameters to eliminate false positives Machine Learning Techniques in Spam Filtering p.23/26

Combining Classifiers If we have two different classifiers f and g that have low probability of false positives, we may combine them to get a classifier with higher precision: Classify message m as spam, if f or g classifies it as spam. Denote the resulting classifier as f g Algorithm N L S N S L P F L F S N.B. SVM s. m. 0 61 94.4% 0.0% 12.7% Machine Learning Techniques in Spam Filtering p.24/26

Combining Classifiers If we add to f and g another classifier h with high precision, we may use it to make f g even safer: If f(m) = g(m), classify message m as f(m), otherwise (if f and g give different answers) consult h (instead of blindly setting m as spam). In other words: classify m to the class, which is proposed by at least 2 of the three classifiers. Denote the classifier as (f g) (g h) (f h). Algorithm N L S N S L P F L F S 2-of-3 0 62 94.4% 0.0% 12.9% Machine Learning Techniques in Spam Filtering p.25/26

Questions? Machine Learning Techniques in Spam Filtering p.26/26