Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38
Outline 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 2 / 38
Outline Spam Filters 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 3 / 38
Spam Filtering Spam Filters Active Learning with Boosting for Spam Detection Last update: March 22, 2008 4 / 38
Outline Active Learning and Boosting 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 5 / 38
Active Learning and Boosting What is Active Learning Given data, X 1 X n, n=# examples And labels Y 1 Y t, t=# labels And t <<< n How do we build a good classifier? Active Learning with Boosting for Spam Detection Last update: March 22, 2008 6 / 38
Active Learning and Boosting Boosting Given data, < X 1, Y 1 > < X n, Y n > A weak learner that does slightly better than a random classifier that is error, ɛ < 0.5 builds a set of hypotheses h 1 h t over t trials and assigns a confidence on each hypotheses α t after T trials a final strong classifier is constructed using a weighted majority vote of the obtained T hypotheses Active Learning with Boosting for Spam Detection Last update: March 22, 2008 7 / 38
Outline Algorithm 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 8 / 38
Algorithm Active Learning using Confidence based data sampling Given data S, with labeled data set S t and unlabeled data S u. Repeat Train a classifier using the current training data S t. Predict on S u using this classifier Compute confidence scores on S u Sort the scores Label the lowest scored k scored examples Call the new labeled set S i Set S t = S t S i ; S u = S u S i Active Learning with Boosting for Spam Detection Last update: March 22, 2008 9 / 38
Algorithm AdaBoost algorithm Given (x 1, y 1 )... (x n, y n ) S t wherey i = 0, 1 Initialize weights W 1... W f = 1/f, f= number of features Active Learning with Boosting for Spam Detection Last update: March 22, 2008 10 / 38
Algorithm for t=1 to T do W i = W i / i W i for each feature j, train a classifier h j compute error, ε j = i W i h j (x i ) y i choose classifier h t with lowest error update weights W t+1,i = W t,i β 1 e i { t where 0 if classified correctly e i = 1 otherwise ε t β t = 1 ε t compute α t = log(1/β t ) Active Learning with Boosting for Spam Detection Last update: March 22, 2008 11 / 38
Algorithm final output, { strong classifier, 1 if T h(x) = t=1 α th t (x) 1/2 T t=1 α t) 0 otherwise Active Learning with Boosting for Spam Detection Last update: March 22, 2008 12 / 38
Outline Sampling Methods 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 13 / 38
Sampling Methods Confidence based sampling Compute confidence scores on S u Sort the scores Label the lowest scored k scored examples These k examples are the ones closest to the classifier hyperplane. Active Learning with Boosting for Spam Detection Last update: March 22, 2008 14 / 38
Sampling Methods Commitee Based Sampling Boosting is inherently a comitte based decision maker final output, strong classifier h(x)=1 if T t=1 α th t (x 1/2 T t=1 α t) and 0 otherwise Note not all the hypotheses are equally weighted The final confidence scores are low for examples for which multiple hypotheses disagree upon Active Learning with Boosting for Spam Detection Last update: March 22, 2008 15 / 38
Scoring Function Sampling Methods T t=1 confidence score score(x i ) = α th t(x i ) T t=1 α { t 1 if where, h t(x ht (x i ) = i ) = 0 1 if h t (x i ) = 1 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 16 / 38
Outline Weak Learner 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 17 / 38
Weak Learner Visualization of the data Active Learning with Boosting for Spam Detection Last update: March 22, 2008 18 / 38
Weak Learner Single Feature Weak Learner { 1 if pj f h j (x) = j (x) < p j θ j 0 otherwise where, p j = +1, 1 and θ j = 0.5, 0.5 Error, ε j = i W i h j (x i ) y i Active Learning with Boosting for Spam Detection Last update: March 22, 2008 19 / 38
Outline Performance Analysis 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 20 / 38
Performance Analysis Testing and Analysis I used the SPAM data set provided in the class. It has 2000 examples with 2000 features per example. Restricted the total number of labeled examples used in training to 250 out of 2000 examples. Start with S t = 50 labeled examples k = 20 hard examples in each iteration Total 10 active learning iterations Active Learning with Boosting for Spam Detection Last update: March 22, 2008 21 / 38
Performance Analysis Does Active learning using Confidence based label sampling work? Do we see improvement in the true prediction rate? Do we see a decrease in the false prediction rate? Active Learning with Boosting for Spam Detection Last update: March 22, 2008 22 / 38
Performance Analysis TPR and FPR of the training set and test set Active Learning with Boosting for Spam Detection Last update: March 22, 2008 23 / 38
Performance Analysis Confidence based sampling vs Random sampling Does it do better than the random sampling? What are we measuring: True Positive rate True Prediction rate Misclassification rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 24 / 38
Performance Analysis True positive rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 25 / 38
Performance Analysis True prediction rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 26 / 38
Performance Analysis Misclassification rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 27 / 38
Performance Analysis Effect of boosting on active learning Active Learning with Boosting for Spam Detection Last update: March 22, 2008 28 / 38
Performance Analysis Adaboost performance on training data Active Learning with Boosting for Spam Detection Last update: March 22, 2008 29 / 38
Performance Analysis True Positive Rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 30 / 38
Performance Analysis False Positive Rate Active Learning with Boosting for Spam Detection Last update: March 22, 2008 31 / 38
Performance Analysis AdaBoost Training Margin Active Learning with Boosting for Spam Detection Last update: March 22, 2008 32 / 38
Performance Analysis Comparision of AdaBoost algorithm with AdaBoost ρ Active Learning with Boosting for Spam Detection Last update: March 22, 2008 33 / 38
Outline Future Work 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 34 / 38
Future Work 1 Implement other more sophisticated boosting algorithms 2 Compare Active Learning with Boosting with Active Learning using SVM 3 Implement other types of weak learners 4 Try to come up with an adaptive sampling technique for labeling Active Learning with Boosting for Spam Detection Last update: March 22, 2008 35 / 38
Outline Conclusions 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 36 / 38
Conclusions Achieved 86% accuracy level was achieved by restricting the labeled training data to 10% Active learning with confidence based sampling performed much better than random sampling Building a classifier using a weighted average of single feature hypotheses performed much better than best single feature based training. AdaBoost on this SPAM data set needs around 35 boosting iterations to build the perfect classifier. Margin of the training data also converges after 35 iterations. Constraining the margin using AdaBoost ρ did not improve the test error. More tests need to be performed to analyze the performance of soft margin based boosting for active learning. Should compare boosting as a classifier with other classifiers such as SVM which are commonly used for active learning. Active Learning with Boosting for Spam Detection Last update: March 22, 2008 37 / 38
Outline References 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, 2008 38 / 38
References Y. Abramson and Y. Freund. Active learning for visual object recognition. UCSD Report, 1, 2006. Y. Freund and R.E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5):771 780, 1999. D.Z. Hakkani-Tur, R.E. Schapire, and G. Tur. Active learning for spoken language understanding, August 28 2007. US Patent 7,263,486. G. Rätsch and M.K. Warmuth. Efficient Margin Maximizing with Boosting. The Journal of Machine Learning Research, 6:2131 2152, 2005. Active Learning with Boosting for Spam Detection Last update: March 22, 2008 38 / 38
References R.E. Schapire. A brief introduction to boosting. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 2:1401 1406, 1999. D. Sculley. Online Active Learning Methods for Fast Label-Efficient Spam Filtering. P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision, 1(2), 2002. M.K. Warmuth, K. Glocer, and G. Ratsch. Boosting Algorithms for Maximizing the Soft Margin. Active Learning with Boosting for Spam Detection Last update: March 22, 2008 38 / 38