Active Learning with Boosting for Spam Detection

Transcription

1 Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, / 38

2 Outline 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

3 Outline Spam Filters 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

4 Spam Filtering Spam Filters Active Learning with Boosting for Spam Detection Last update: March 22, / 38

5 Outline Active Learning and Boosting 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

6 Active Learning and Boosting What is Active Learning Given data, X 1 X n, n=# examples And labels Y 1 Y t, t=# labels And t <<< n How do we build a good classifier? Active Learning with Boosting for Spam Detection Last update: March 22, / 38

7 Active Learning and Boosting Boosting Given data, < X 1, Y 1 > < X n, Y n > A weak learner that does slightly better than a random classifier that is error, ɛ < 0.5 builds a set of hypotheses h 1 h t over t trials and assigns a confidence on each hypotheses α t after T trials a final strong classifier is constructed using a weighted majority vote of the obtained T hypotheses Active Learning with Boosting for Spam Detection Last update: March 22, / 38

8 Outline Algorithm 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

9 Algorithm Active Learning using Confidence based data sampling Given data S, with labeled data set S t and unlabeled data S u. Repeat Train a classifier using the current training data S t. Predict on S u using this classifier Compute confidence scores on S u Sort the scores Label the lowest scored k scored examples Call the new labeled set S i Set S t = S t S i ; S u = S u S i Active Learning with Boosting for Spam Detection Last update: March 22, / 38

10 Algorithm AdaBoost algorithm Given (x 1, y 1 )... (x n, y n ) S t wherey i = 0, 1 Initialize weights W 1... W f = 1/f, f= number of features Active Learning with Boosting for Spam Detection Last update: March 22, / 38

11 Algorithm for t=1 to T do W i = W i / i W i for each feature j, train a classifier h j compute error, ε j = i W i h j (x i ) y i choose classifier h t with lowest error update weights W t+1,i = W t,i β 1 e i { t where 0 if classified correctly e i = 1 otherwise ε t β t = 1 ε t compute α t = log(1/β t ) Active Learning with Boosting for Spam Detection Last update: March 22, / 38

12 Algorithm final output, { strong classifier, 1 if T h(x) = t=1 α th t (x) 1/2 T t=1 α t) 0 otherwise Active Learning with Boosting for Spam Detection Last update: March 22, / 38

13 Outline Sampling Methods 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

14 Sampling Methods Confidence based sampling Compute confidence scores on S u Sort the scores Label the lowest scored k scored examples These k examples are the ones closest to the classifier hyperplane. Active Learning with Boosting for Spam Detection Last update: March 22, / 38

15 Sampling Methods Commitee Based Sampling Boosting is inherently a comitte based decision maker final output, strong classifier h(x)=1 if T t=1 α th t (x 1/2 T t=1 α t) and 0 otherwise Note not all the hypotheses are equally weighted The final confidence scores are low for examples for which multiple hypotheses disagree upon Active Learning with Boosting for Spam Detection Last update: March 22, / 38

16 Scoring Function Sampling Methods T t=1 confidence score score(x i ) = α th t(x i ) T t=1 α { t 1 if where, h t(x ht (x i ) = i ) = 0 1 if h t (x i ) = 1 Active Learning with Boosting for Spam Detection Last update: March 22, / 38

17 Outline Weak Learner 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

18 Weak Learner Visualization of the data Active Learning with Boosting for Spam Detection Last update: March 22, / 38

19 Weak Learner Single Feature Weak Learner { 1 if pj f h j (x) = j (x) < p j θ j 0 otherwise where, p j = +1, 1 and θ j = 0.5, 0.5 Error, ε j = i W i h j (x i ) y i Active Learning with Boosting for Spam Detection Last update: March 22, / 38

20 Outline Performance Analysis 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

21 Performance Analysis Testing and Analysis I used the SPAM data set provided in the class. It has 2000 examples with 2000 features per example. Restricted the total number of labeled examples used in training to 250 out of 2000 examples. Start with S t = 50 labeled examples k = 20 hard examples in each iteration Total 10 active learning iterations Active Learning with Boosting for Spam Detection Last update: March 22, / 38

22 Performance Analysis Does Active learning using Confidence based label sampling work? Do we see improvement in the true prediction rate? Do we see a decrease in the false prediction rate? Active Learning with Boosting for Spam Detection Last update: March 22, / 38

23 Performance Analysis TPR and FPR of the training set and test set Active Learning with Boosting for Spam Detection Last update: March 22, / 38

24 Performance Analysis Confidence based sampling vs Random sampling Does it do better than the random sampling? What are we measuring: True Positive rate True Prediction rate Misclassification rate Active Learning with Boosting for Spam Detection Last update: March 22, / 38

25 Performance Analysis True positive rate Active Learning with Boosting for Spam Detection Last update: March 22, / 38

26 Performance Analysis True prediction rate Active Learning with Boosting for Spam Detection Last update: March 22, / 38

27 Performance Analysis Misclassification rate Active Learning with Boosting for Spam Detection Last update: March 22, / 38

28 Performance Analysis Effect of boosting on active learning Active Learning with Boosting for Spam Detection Last update: March 22, / 38

29 Performance Analysis Adaboost performance on training data Active Learning with Boosting for Spam Detection Last update: March 22, / 38

30 Performance Analysis True Positive Rate Active Learning with Boosting for Spam Detection Last update: March 22, / 38

31 Performance Analysis False Positive Rate Active Learning with Boosting for Spam Detection Last update: March 22, / 38

32 Performance Analysis AdaBoost Training Margin Active Learning with Boosting for Spam Detection Last update: March 22, / 38

33 Performance Analysis Comparision of AdaBoost algorithm with AdaBoost ρ Active Learning with Boosting for Spam Detection Last update: March 22, / 38

34 Outline Future Work 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

35 Future Work 1 Implement other more sophisticated boosting algorithms 2 Compare Active Learning with Boosting with Active Learning using SVM 3 Implement other types of weak learners 4 Try to come up with an adaptive sampling technique for labeling Active Learning with Boosting for Spam Detection Last update: March 22, / 38

36 Outline Conclusions 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

37 Conclusions Achieved 86% accuracy level was achieved by restricting the labeled training data to 10% Active learning with confidence based sampling performed much better than random sampling Building a classifier using a weighted average of single feature hypotheses performed much better than best single feature based training. AdaBoost on this SPAM data set needs around 35 boosting iterations to build the perfect classifier. Margin of the training data also converges after 35 iterations. Constraining the margin using AdaBoost ρ did not improve the test error. More tests need to be performed to analyze the performance of soft margin based boosting for active learning. Should compare boosting as a classifier with other classifiers such as SVM which are commonly used for active learning. Active Learning with Boosting for Spam Detection Last update: March 22, / 38

38 Outline References 1 Spam Filters 2 Active Learning and Boosting 3 Algorithm 4 Sampling Methods 5 Weak Learner 6 Performance Analysis 7 Future Work 8 Conclusions 9 References Active Learning with Boosting for Spam Detection Last update: March 22, / 38

39 References Y. Abramson and Y. Freund. Active learning for visual object recognition. UCSD Report, 1, Y. Freund and R.E. Schapire. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5): , D.Z. Hakkani-Tur, R.E. Schapire, and G. Tur. Active learning for spoken language understanding, August US Patent 7,263,486. G. Rätsch and M.K. Warmuth. Efficient Margin Maximizing with Boosting. The Journal of Machine Learning Research, 6: , Active Learning with Boosting for Spam Detection Last update: March 22, / 38

40 References R.E. Schapire. A brief introduction to boosting. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 2: , D. Sculley. Online Active Learning Methods for Fast Label-Efficient Spam Filtering. P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision, 1(2), M.K. Warmuth, K. Glocer, and G. Ratsch. Boosting Algorithms for Maximizing the Soft Margin. Active Learning with Boosting for Spam Detection Last update: March 22, / 38