Active Learning with Boosting for Spam Detection



Similar documents
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Boosting.

Robust Real-Time Face Detection

Model Combination. 24 Novembre 2009

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Local features and matching. Image classification & object localization

How Boosting the Margin Can Also Boost Classifier Complexity

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Machine Learning Final Project Spam Filtering

Data Mining Practical Machine Learning Tools and Techniques

Logistic Regression for Spam Filtering

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

1 What is Machine Learning?

Ensemble Data Mining Methods

Online Algorithms: Learning & Optimization with No Regret.

Data Mining. Nonlinear Classification

Machine Learning using MapReduce

L25: Ensemble learning

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Online Forecasting of Stock Market Movement Direction Using the Improved Incremental Algorithm

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Machine Learning in Spam Filtering

REVIEW OF ENSEMBLE CLASSIFICATION

Data Mining Methods: Applications for Institutional Research

Towards better accuracy for Spam predictions

Active Learning SVM for Blogs recommendation

Interactive Machine Learning. Maria-Florina Balcan

An Overview of Knowledge Discovery Database and Data mining Techniques

CSE 473: Artificial Intelligence Autumn 2010

Social Media Mining. Data Mining Essentials

Beating the NCAA Football Point Spread

CS570 Data Mining Classification: Ensemble Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Operations Research and Knowledge Modeling in Data Mining

Decompose Error Rate into components, some of which can be measured on unlabeled data

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Source. The Boosting Approach. Example: Spam Filtering. The Boosting Approach to Machine Learning

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Knowledge Discovery and Data Mining

Asymmetric Gradient Boosting with Application to Spam Filtering

Machine Learning Algorithms for Classification. Rob Schapire Princeton University

Introduction to Learning & Decision Trees

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

FilterBoost: Regression and Classification on Large Datasets

On Adaboost and Optimal Betting Strategies

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

A Content based Spam Filtering Using Optical Back Propagation Technique

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

On the effect of data set size on bias and variance in classification learning

Comparison of Data Mining Techniques used for Financial Data Analysis

Classification algorithm in Data mining: An Overview

Monday Morning Data Mining

Journal of Asian Scientific Research COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM2.5 IN HONG KONG RURAL AREA.

Open-Set Face Recognition-based Visitor Interface System

Learning is a very general term denoting the way in which agents:

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Classification of Bad Accounts in Credit Card Industry

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

SVM Ensemble Model for Investment Prediction

Ensemble of Classifiers Based on Association Rule Mining

Maschinelles Lernen mit MATLAB

Sibyl: a system for large scale machine learning

The Artificial Prediction Market

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Support Vector Machine (SVM)

Content-Based Spam Filtering and Detection Algorithms- An Efficient Analysis & Comparison

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Data Mining - Evaluation of Classifiers

Domain Classification of Technical Terms Using the Web

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Active Learning in the Drug Discovery Process

Predicting the Stock Market with News Articles

Crowdfunding Support Tools: Predicting Success & Failure

MHI3000 Big Data Analytics for Health Care Final Project Report

1 Maximum likelihood estimation

Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation

Incremental SampleBoost for Efficient Learning from Multi-Class Data Sets

K-Means Clustering Tutorial

Author Gender Identification of English Novels

Supervised Learning (Big Data Analytics)

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

II. RELATED WORK. Sentiment Mining

DUOL: A Double Updating Approach for Online Learning

Car Insurance. Havránek, Pokorný, Tomášek

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Using One-Versus-All classification ensembles to support modeling decisions in data stream mining

Azure Machine Learning, SQL Data Mining and R

Transcription:

Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38

Outline 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 2 / 38

Outline Spam Filters 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 3 / 38

Spam Filtering Spam Filters Active Learning with Boosting for Spam Detection Last update: March 22, 2008 4 / 38

Outline Active Learning and Boosting 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 5 / 38

Active Learning and Boosting What is Active Learning Given data, X 1 X n, n=# examples And labels Y 1 Y t, t=# labels And t <<< n How do we build a good classifier? Active Learning with Boosting for Spam Detection Last update: March 22, 2008 6 / 38

Active Learning and Boosting Boosting Given data, < X 1, Y 1 > < X n, Y n > A weak learner that does slightly better than a random classifier that is error, ɛ < 0.5 builds a set of hypotheses h 1 h t over t trials and assigns a confidence on each hypotheses α t after T trials a final strong classifier is constructed using a weighted majority vote of the obtained T hypotheses Active Learning with Boosting for Spam Detection Last update: March 22, 2008 7 / 38

Outline Algorithm 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 8 / 38

Algorithm Active Learning using Confidence based data sampling Given data S, with labeled data set S t and unlabeled data S u. Repeat Train a classifier using the current training data S t. Predict on S u using this classifier Compute confidence scores on S u Sort the scores Label the lowest scored k scored examples Call the new labeled set S i Set S t = S t S i ; S u = S u S i Active Learning with Boosting for Spam Detection Last update: March 22, 2008 9 / 38

Algorithm AdaBoost algorithm Given (x 1, y 1 )... (x n, y n ) S t wherey i = 0, 1 Initialize weights W 1... W f = 1/f, f= number of features Active Learning with Boosting for Spam Detection Last update: March 22, 2008 10 / 38

Algorithm for t=1 to T do W i = W i / i W i for each feature j, train a classifier h j compute error, ε j = i W i h j (x i ) y i choose classifier h t with lowest error update weights W t+1,i = W t,i β 1 e i { t where 0 if classified correctly e i = 1 otherwise ε t β t = 1 ε t compute α t = log(1/β t ) Active Learning with Boosting for Spam Detection Last update: March 22, 2008 11 / 38

Algorithm final output, { strong classifier, 1 if T h(x) = t=1 α th t (x) 1/2 T t=1 α t) 0 otherwise Active Learning with Boosting for Spam Detection Last update: March 22, 2008 12 / 38

Outline Sampling Methods 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 13 / 38

Sampling Methods Confidence based sampling Compute confidence scores on S u Sort the scores Label the lowest scored k scored examples These k examples are the ones closest to the classifier hyperplane. Active Learning with Boosting for Spam Detection Last update: March 22, 2008 14 / 38

Sampling Methods Commitee Based Sampling Boosting is inherently a comitte based decision maker final output, strong classifier h(x)=1 if T t=1 α th t (x 1/2 T t=1 α t) and 0 otherwise Note not all the hypotheses are equally weighted The final confidence scores are low for examples for which multiple hypotheses disagree upon Active Learning with Boosting for Spam Detection Last update: March 22, 2008 15 / 38

Scoring Function Sampling Methods T t=1 confidence score score(x i ) = α th t(x i ) T t=1 α { t 1 if where, h t(x ht (x i ) = i ) = 0 1 if h t (x i ) = 1 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 16 / 38

Outline Weak Learner 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 17 / 38

Weak Learner Visualization of the data Active Learning with Boosting for Spam Detection Last update: March 22, 2008 18 / 38

Weak Learner Single Feature Weak Learner { 1 if pj f h j (x) = j (x) < p j θ j 0 otherwise where, p j = +1, 1 and θ j = 0.5, 0.5 Error, ε j = i W i h j (x i ) y i Active Learning with Boosting for Spam Detection Last update: March 22, 2008 19 / 38

Outline Performance Analysis 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 20 / 38

Performance Analysis Testing and Analysis I used the SPAM data set provided in the class. It has 2000 examples with 2000 features per example. Restricted the total number of labeled examples used in training to 250 out of 2000 examples. Start with S t = 50 labeled examples k = 20 hard examples in each iteration Total 10 active learning iterations Active Learning with Boosting for Spam Detection Last update: March 22, 2008 21 / 38

Performance Analysis Does Active learning using Confidence based label sampling work? Do we see improvement in the true prediction rate? Do we see a decrease in the false prediction rate? Active Learning with Boosting for Spam Detection Last update: March 22, 2008 22 / 38

Performance Analysis TPR and FPR of the training set and test set Active Learning with Boosting for Spam Detection Last update: March 22, 2008 23 / 38

Performance Analysis Confidence based sampling vs Random sampling Does it do better than the random sampling? What are we measuring: True Positive rate True Prediction rate Misclassification rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 24 / 38

Performance Analysis True positive rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 25 / 38

Performance Analysis True prediction rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 26 / 38

Performance Analysis Misclassification rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 27 / 38

Performance Analysis Effect of boosting on active learning Active Learning with Boosting for Spam Detection Last update: March 22, 2008 28 / 38

Performance Analysis Adaboost performance on training data Active Learning with Boosting for Spam Detection Last update: March 22, 2008 29 / 38

Performance Analysis True Positive Rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 30 / 38

Performance Analysis False Positive Rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 31 / 38

Performance Analysis AdaBoost Training Margin Active Learning with Boosting for Spam Detection Last update: March 22, 2008 32 / 38

Performance Analysis Comparision of AdaBoost algorithm with AdaBoost ρ Active Learning with Boosting for Spam Detection Last update: March 22, 2008 33 / 38

Outline Future Work 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 34 / 38

Future Work 1 Implement other more sophisticated boosting algorithms 2 Compare Active Learning with Boosting with Active Learning using SVM 3 Implement other types of weak learners 4 Try to come up with an adaptive sampling technique for labeling Active Learning with Boosting for Spam Detection Last update: March 22, 2008 35 / 38

Outline Conclusions 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 36 / 38

Conclusions Achieved 86% accuracy level was achieved by restricting the labeled training data to 10% Active learning with confidence based sampling performed much better than random sampling Building a classifier using a weighted average of single feature hypotheses performed much better than best single feature based training. AdaBoost on this SPAM data set needs around 35 boosting iterations to build the perfect classifier. Margin of the training data also converges after 35 iterations. Constraining the margin using AdaBoost ρ did not improve the test error. More tests need to be performed to analyze the performance of soft margin based boosting for active learning. Should compare boosting as a classifier with other classifiers such as SVM which are commonly used for active learning. Active Learning with Boosting for Spam Detection Last update: March 22, 2008 37 / 38

Outline References 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 38 / 38

References Y. Abramson and Y. Freund. Active learning for visual object recognition. UCSD Report, 1, 2006. Y. Freund and R.E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771 780, 1999. D.Z. Hakkani-Tur, R.E. Schapire, and G. Tur. Active learning for spoken language understanding, August 28 2007. US Patent 7,263,486. G. Rätsch and M.K. Warmuth. Efficient Margin Maximizing with Boosting. The Journal of Machine Learning Research, 6:2131 2152, 2005. Active Learning with Boosting for Spam Detection Last update: March 22, 2008 38 / 38

References R.E. Schapire. A brief introduction to boosting. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 2:1401 1406, 1999. D. Sculley. Online Active Learning Methods for Fast Label-Efficient Spam Filtering. P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision, 1(2), 2002. M.K. Warmuth, K. Glocer, and G. Ratsch. Boosting Algorithms for Maximizing the Soft Margin. Active Learning with Boosting for Spam Detection Last update: March 22, 2008 38 / 38