and its Applications and Lowd & Meek (2005) Presented by Tianyu Cao



Similar documents
On Attacking Statistical Spam Filters

Good Word Attacks on Statistical Spam Filters

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Spam Filtering using Naïve Bayesian Classification

eprism Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

Conclusions and Future Directions

Classification and Prediction

OIS. Update on the anti spam system at CERN. Pawel Grzywaczewski, CERN IT/OIS HEPIX fall 2010

Intercept Anti-Spam Quick Start Guide

Deliverability Demystified:

Machine Learning using MapReduce

Data Mining Classification: Decision Trees

Introduction. How does filtering work? What is the Quarantine? What is an End User Digest?

Data Mining for Knowledge Management. Classification

Personalized Spam Filtering for Gray Mail

Decision Trees from large Databases: SLIQ

Big Data Analytics CSCI 4030

1 Introductory Comments. 2 Bayesian Probability

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Gerry Hobbs, Department of Statistics, West Virginia University

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Predicting borrowers chance of defaulting on credit loans

Spam Detection A Machine Learning Approach

Detecting client-side e-banking fraud using a heuristic model

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Manual Prepared by GalaxyVisions Customer Care Team

The Enron Corpus: A New Dataset for Classification Research

Active Learning SVM for Blogs recommendation

Less naive Bayes spam detection

Categorical Data Visualization and Clustering Using Subjective Factors

Sentiment analysis using emoticons

A Procedure for Classifying New Respondents into Existing Segments Using Maximum Difference Scaling

Spam Detection and Pattern Recognition

Anti Spamming Techniques

On Attacking Statistical Spam Filters

Machine Learning in Spam Filtering

Task Scheduling in Hadoop

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Chapter 20: Data Analysis

Data Mining Practical Machine Learning Tools and Techniques

Search and Information Retrieval

Solutions to Math 51 First Exam January 29, 2015

Combining Global and Personal Anti-Spam Filtering

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

The versatile solution of anti-spam, personal backup and recovery, easy security policy management and enforcement.

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

PROOFPOINT - SPAM FILTER

Feature Subset Selection in Spam Detection

Learn How to Defend Your Online Marketplace from Unwanted Traffic

Bayesian Spam Detection

Supervised and unsupervised learning - 1

Machine Learning for Naive Bayesian Spam Filter Tokenization

Social Media Mining. Data Mining Essentials

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Active Learning with Boosting for Spam Detection

Adaption of Statistical Filtering Techniques

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Commtouch RPD Technology. Network Based Protection Against -Borne Threats

Professor Anita Wasilewska. Classification Lecture Notes

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Data Mining. Nonlinear Classification

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Globally Optimal Crowdsourcing Quality Management

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

SVM-Based Spam Filter with Active and Online Learning

Analysis of Algorithms I: Optimal Binary Search Trees

Detecting Corporate Fraud: An Application of Machine Learning

Math Quizzes Winter 2009

Industrial Challenges for Content-Based Image Retrieval

Protein Protein Interaction Networks

Chapter 4. Probability and Probability Distributions

Removing Web Spam Links from Search Engine Results

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Index Terms Domain name, Firewall, Packet, Phishing, URL.

A Practical Attack to De Anonymize Social Network Users

Comparing Industry-Leading Anti-Spam Services

Personal Spam Solution Overview

Semi-Supervised and Unsupervised Machine Learning. Novel Strategies

IBM SPSS Direct Marketing 23

A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters

IBM SPSS Direct Marketing 22

Learning to Recognize Missing Attachments

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Anti Spam Best Practices

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Automatic Text Processing: Cross-Lingual. Text Categorization

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

COS 116 The Computational Universe Laboratory 11: Machine Learning

6367(Print), ISSN (Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME

CHAPTER-12. Analytical Characterization : Analysis of Attribute Relevance

1 Sufficient statistics

Text Analytics Illustrated with a Simple Data Set

Bayesian spam filtering

Procedia Computer Science 00 (2012) Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

Sender Identity and Reputation Management

Spam Detection Using Customized SimHash Function

Transcription:

On the Inverse Classification Problem and its Applications Based on Aggarwal, Chen & Han (2006) and Lowd & Meek (2005) Presented by Tianyu Cao

Outline Section 1 Application i Background Problem Definition Gini Index and Support Threshold Inverted List & Inverse Classification Algorithm Experimental Results Conclusions & Discussions Section 2 A Naive Idea for Inverse Classification Section 3 Implications for Security Issues on Machine learning 2

Section 1 Based on the paper On the Inverse Classification and its Application. 3

Application Background Action driven applications in which the features can be used to define certain actions. How to select feature values to drive the decision support system towards a desired end result? Example. Mass marketing MilS Mail Spammer Possible attacks to a general classification system 4

Problem Definition Given a training data set D that contains N records, X 1, X 2, X 3, X n. The test dataset contains M records, Y 1, Y 2, Y 3, Y m. Each record Y i is not completely defined. Each record Y i is associated with a desired class label C i. The inverse classification is defined as how to fill the unknown feature f j for each record Y i such that the probability of Y i being classified in to C i is maximized. 5

Example Customer ID Decision variable 1 Decision variable 2 0001?? Yes 0002?? Yes Yes?? Yes Response How to fill the missing values such that the customer will most likely to respond to the mailer? 6

Ways to fill the unknown features Assume the missing features are i 1, i 2, i 3, i q. Suppose the number of possible values for these features are v i1, v i2, v i3, v iq. The number of ways to fill the missing value is as follows. q j= 1 v i j 7

How to Search Efficiently in this Space Some feature value combinations are not good and thus should beprunedout. Two prune strategies. Gini index Support threshold Use Gini index to rank the candidate combinations. 8

Gini Index Some notations. L(i,q) be the set of points in the training set such that the value of the ith feature is the qth possible value of the ith feature. F(i,q,s) be the fraction of points in the set L(i,q) such that the class label is s. 9

Gini Index Gini index is defined as follows. G( L( i, q)) = k s = 1 f ( i, q, s) The least discriminative case. Each f(i,q,s)=1/k. The most discriminative case. Only one f(i,q,s) is 1, all the other f(i,q,s) is 0. 2 10

Example Customer ID Decision Decision Decision Response variable 1 variable 2 variable 3 0001 1 0 0 No 0002 1 0 1 Yes 0003 1 1 0 No 0004 1 1 1 Yes 0005 0 0 0 No 0006 0 0 1 No 0007 0 1 0 No 0008 0 1 1 Yes 11

Example L(1,1) is {0001,0002,0003,0004}. Therefore the gini index of L(1,1) 1) is (1/2)^2+(1/2)^2=1/2 2+(1/2) 2=1/2, which is the least discriminative case. Consider the combination of feature 1 is 1 and feature 3 is 1. The set of points are {0002,0004}. The gini index is 1^2+0=1, which is the most discriminative case. 12

Support Threshold The number of points in the set L(i,q) must be greater than the support threshold. In the previous example, if the support threshold is 4. Even though the combination of feature 1 is 1 and feature 3 is 1 satisfies the gini index requirement, it does not satisfy the support threshold. 13

Inverted List An efficient data structure that can be used to calculate gini index and support for a given combination of feature values. Similar to inverted index in information retrieval. 14

Inverted List Feature Value 1 0 5 6 7 8 1 1 1 2 3 4 2 0 2 1 3 0 3 1 1 2 5 6 3 4 7 8 1 3 5 7 2 4 6 8 15

Calculate Combination of L(i,q) To calculate the set of points that satisfy feature 1 is 1 and feature 2 is 0, we simply merge the inverted list L(1,1) and L(2,0). This corresponds to L(1,1) L(2,0). For the convenience of calculating Gini index, the class label is also stored. 16

The Inverse Classification Algorithm Step 1. Compute all possible singleton sets L1 for test example T with support at least s and gini index at least a and prune out those L1 that dominant class is not the target class. Set k=1. Step 2. Join Lk and L1 to form Ck+1; Prune Ck+1 according to gini index and support threshold. Assume theremaining is Ck+1 Step 3 Let Lk+1=Ck+1, if Lk+1 is not empty. Go to step 2 17

The Inverse Classification Algorithm Step 4 Let LF be the union of all Lk. Step 5 While there are missing values in T, go to step 6 otherwise terminate. Step 6 Pick the feature value combination H with highest gini index from LF. Add the missing value in T using the corresponding values in H. Remove any sets in LF which are inconsistent i twith the filled value in T. Go to step 5. 18

Experimental Considerations and Settings No benchmark datasets for this problem. Used UCI dataset by dividing the dataset into three parts. Two parts are used as training and the remaining part are used as testing. Set some feature in the testing set to be missing. Try to use the inverse classification algorithm to fill the missing value. In this case the desired class label is the test instance s actual class label. 19

Experimental Results 20

Experimental Results 21

Experimental Results 22

Experimental Results 23

Experimental Results 24

Experimental Results 25

Experimental Results 26

Summary of Section 1 The accuracy before substitution of the missing attributes reduces with increasing numberof missing attributes. The accuracy of the classifier showed a considerable boost over original dataset. In general the default gini index in training set served an effective default value. The threshold value could be defined well by the average number of data points per nominal attributes. The algorithm is scalable to the dataset size and the number of missing attributes. 27

Discussions The inverse classification problem algorithm does not make useof any classification models. Why not? Relaxation to the problem itself. The inverse classification is defined as how to fill the unknown feature f j for each record Y i such that the probability of Y i being classified in to C i is maximized. 28

Discussions Sometimes it is enough to fill the missing value so that the data point Y i is likely to be classified into the target class C i. It means that the probability of data point Y i being classified in to C i is reasonably high. Combining these two ideas we can get a somewhat naive solution to the inverse classification problem. 29

Section 2 Based on the presenter s thoughts. 30

A Naive Idea Step1 Sep Build uda cass classifier C from the training gdata. a Step2 Consider the missing attributes of a test instance T as a subspace, uniformly random sample a point Pi from this subspace. Step3 Fill Pi into the missing attributes of the testing instance T and send the test tinstance to the classifier C. If C classifies T in to the desired class Ci. The process ends. Otherwise goto step 2. 31

A Naive Idea As long as the classifier C fits to the training data set reasonably well, we should expect that the instance T with missing value filled is likely to be classified to the target class by any classifier that fits the training data reasonably well. 32

A Naive Idea Second Problem. How many points do we need to sample in order to find one that will be classified into the target class by the classifer C? It turns out we only need to sample a few points. 33

A Naive Idea Assume we sample n points in this subspace, we want a blue point. The probability that we can get a blue point is as follows. 1 (P ) red n This probability approaches to 1 exponentially fast as long as the fraction of blue points is not 0. Consider an extreme case, the fraction of blue points is 1% and the fraction of red points is 99%. Even with such an extreme case, the probability that you can get a blue point of sampling 500 points is actually 34higher that 99%.

A Naive Idea There are problems with the naive solution even for the relaxed inverse classification problem. It is not always able to find a classifier to fit well to the training data. How to justify the point that you find is likely to be classified to the desired target class. (You cannot use the same classifier for testing ti this. ) 35

Section 3 Based on the slides from Good Word Attacks on Statistical Spam Filters. 36

Implications for Security Issues As such problem arises, this poses a potential threat to machinelearning system. Adversarial learning deals with such issues. Consider a Content Based Spam Filtering System. The success of inverse classification implies that the spammer can manipulate the spam mail so that it can pass the Spam Filter. 37

Content-based Spam Filtering 1. From: spammer@example.com Cheap mortgage now!!! Feature Weights 2. cheap = 10 1.0 mortgage = 1.5 3. Total score = 2.5 > 1.0 (threshold) Spam 38

Content-based Spam Filtering 1. From: spammer@example.com Cheap mortgage now!!! Corvallis Corvallis OSU Feature Weights 2. cheap = 10 1.0 mortgage = 1.5 Corvallis = 1.0 OSU = 1.0 3. Total score = 0.5 < 1.0 (threshold) 39 OK

Implications for Security Issues It is generally difficult because the spammer does not know thefeature thespam filter uses. Thespammer does not have the necessary training data. But assume the spammer does somehow both of these, how well can he do? 40

Good Word Attacks Practice: good word attacks Passive attacks Active attacks Experimental results 41

Attacking Spam Filters Can we efficiently find a list of good words? Types of attacks Passive attacks no filter access Active attacks test emails allowed Metrics Expected number of words required dto get median (blocked) spam past the filter Numberof querymessages sent 42

Filter Configuration Models used Naïve Bayes: generative Maximum Entropy (Maxent): discriminative Training 500,000 messages from Hotmail feedback loop 276,000 features Maxent let 30% less spam through 43

Passive Attacks Heuristics Select random dictionary words (Dictionary) Select most frequent English words (Freq. Word) Select highest ratio: English freq./spam freq. (Freq. Ratio) Spam corpus: spamarchive.org English corpora: Reuters news articles Written English Spoken English 1992 USENET 44

Passive Attack Results 45

Active Attacks Learn which words are best by sending test messages (queries) through the filter First N: Find n good words using as few queries as possible Best N: Find the best n words 46

First-N Attack Step 1: Find a Barely spam message Original legit. Barely legit. Barely spam Original spam Hi, mom! now!!! mortgage gg Cheap mortgage gg now!!! now!!! Legitimate Spam Threshold 47

First-N Attack Step 2: Test each word Good words Barely spam message Legitimate Spam Less good words Threshold 48

Best-N Attack Legitimate Better Worse Spam Threshold Key idea: use spammy words to sort the good words. 49

Active Attack Results (n = 100) Best N twice as effective as First N Maxent more vulnerable to active attacks Active attacks much more effective than passive attacks 50

References 1. Charu C. Aggarwal, Chen Chen, Jiawei Han. On the Inverse Classification Problem and its Applications. ICDE 2006. 2. Daniel Lowd, Christopher Meek. Good Word Attacks on Statistical Spam Filters. In Proceedings of the 2nd Conference on Email and Anti Spam, 2005. 51

Thanks 52