Machine Learning Final Project Spam Email Filtering



Similar documents
Machine Learning in Spam Filtering

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Supervised Learning (Big Data Analytics)

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Chapter 6. The stacking ensemble approach

Big Data Analytics CSCI 4030

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Question 2 Naïve Bayes (16 points)

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Active Learning SVM for Blogs recommendation

1 Maximum likelihood estimation

A Simple Introduction to Support Vector Machines

Towards better accuracy for Spam predictions

Spam Detection A Machine Learning Approach

E-commerce Transaction Anomaly Classification

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

A Content based Spam Filtering Using Optical Back Propagation Technique

Spam Filtering with Naive Bayesian Classification

Bayesian Spam Detection

Active Learning with Boosting for Spam Detection

Data Mining - Evaluation of Classifiers

LCs for Binary Classification

Feature Subset Selection in Spam Detection

Bayes and Naïve Bayes. cs534-machine Learning

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Making Sense of the Mayhem: Machine Learning and March Madness

Less naive Bayes spam detection

Recognizing Informed Option Trading

Sentiment analysis using emoticons

CSE 473: Artificial Intelligence Autumn 2010

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Machine learning for algo trading

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Social Media Mining. Data Mining Essentials

Predicting the Stock Market with News Articles

Investigation of Support Vector Machines for Classification

Support Vector Machine (SVM)

Tweaking Naïve Bayes classifier for intelligent spam detection

4.1 Introduction - Online Learning Model

Robust Real-Time Face Detection

Classification of Bad Accounts in Credit Card Industry

Detecting Corporate Fraud: An Application of Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

WE DEFINE spam as an message that is unwanted basically

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Classification algorithm in Data mining: An Overview

Statistical Machine Learning

Introduction to Support Vector Machines. Colin Campbell, Bristol University

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Lecture 2: The SVM classifier

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Projektgruppe. Categorization of text documents via classification

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Beating the NCAA Football Point Spread

An Approach to Detect Spam s by Using Majority Voting

Content-Based Recommendation

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Homework Assignment 7

Predicting Flight Delays

Challenges of Cloud Scale Natural Language Processing

1. Classification problems

Anti Spamming Techniques

Combining Global and Personal Anti-Spam Filtering

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Author Gender Identification of English Novels

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Data Mining. Nonlinear Classification

A semi-supervised Spam mail detector

Support Vector Machines Explained

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Machine Learning and Pattern Recognition Logistic Regression

Simple Language Models for Spam Detection

A fast multi-class SVM learning method for huge databases

Server Load Prediction

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

An Overview of Knowledge Discovery Database and Data mining Techniques

Support Vector Machines with Clustering for Training with Very Large Datasets

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Survey of Spam Filtering Techniques and Tools, and MapReduce with SVM

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

Intrusion Detection via Machine Learning for SCADA System Protection

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Asymmetric Gradient Boosting with Application to Spam Filtering

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

An evolutionary learning spam filter system

Logistic Regression for Spam Filtering

A Logistic Regression Approach to Ad Click Prediction

Local features and matching. Image classification & object localization

Predict Influencers in the Social Network

Car Insurance. Havránek, Pokorný, Tomášek

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Linear Classification. Volker Tresp Summer 2015

Transcription:

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev

Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE VECTORS CONTENT... 4 2.4 FEATURES (WORDS) SELECTION... 4 2.4.1 Preface... 4 2.4.2 Spamicity... 4 2.4.3 Selection of Words... 5 2.5 SHUFFLING... 5 3. ADABOOST... 6 3.1 WEAK-LEARNER FAMILY... 6 3.2 PRECISION / RECALL TRADEOFF... 6 3.3 REFERENCES... 8 4. NAÏVE BAYES... 8 4.1 RESULTS AND DISCUSSION... 8 4.2 REFERENCES... 10 5. PERCEPTRON... 11 5.1 REFERENCES... 12 6. WINNOW... 12 6.1 REDUCING FALSE POSITIVE RATIO... 13 6.2 REFERENCES... 14 7. SUPPORT VECTOR MACHINE... 15 7.1 REFERENCES... 15 8. K NEAREST NEIGHBORS... 16 8.1 TESTING DIFFERENT TRAINING FRACTIONS... 16 8.2 TESTING DIFFERENT NUMBERS OF NEIGHBORS... 16 8.3 TESTING DIFFERENT DIMENSIONALITIES... 17 8.4 REDUCING FALSE POSITIVE RATIO... 19 8.5 REFERENCES... 20 9. CONCLUSION... 21 10. PROJECT WEBSITE AND FILES... 21

1. Overview In this project we were interested in examining spam email filtering. We used 6 different algorithms introduced in class to implement spam filtering in Matlab, and compared their performance. The algorithms we used were: Adaboost Naïve Bayes Perceptron Winnow SVM (Support Vector Machine) KNN (K Nearest Neighbors) In this document we use the following terminology: Ham message: legitimate message (i.e. non-spam) False positive ratio: (number of false positives) / (number of ham messages) (Note that this ratio may be higher than the error rate). We examine the performance of each algorithm in 2 aspects: False positive ratio False positive ratio is interesting because filtering out a ham message is a bad thing, worse than letting a spam message get through. 2. Dataset 2.1 Source We used part of the Enron Spam datasets, available at CSMining website: http://csmining.org/index.php/enron-spam-datasets.html Note that this dataset contains preprocessed email messages (basic "cleaning"), as described at the above link. The dataset size is 5172 messages. The ratio between spam and ham messages is: 29% spam, 71% ham.

2.2 Creation of Training and Test Sets We wrote a Python script to process these messages and create a feature vector out of each message. In the sequel it is described how the feature vectors look like. The script divides the feature vectors into training set and test set, while preserving the ham-spam ratio in each set. Actually, the script randomly creates 90 different pairs of training set and test set as follows: We used 9 different "training fractions", i.e. the percentage of training set size out of the entire dataset. The training fractions we used were:, 0.2,, 0.9. For each training fraction, we randomly created 10 different pairs of training set and test set, so we can examine the performance as an average of 10 runs. Note that for Adaboost and SVM we used only part of these pairs of sets due to long running time. The Python script's location is: Project\Data\enron1\process.py. In the last section (Project Website and Files) it is explained how to get there. 2.3 Feature Vectors Content The script selects 100 words out of all words which appear in messages in the dataset (in the sequel it is described how these 100 words are chosen). Each of these words yields a feature: the "within-document-frequency" of this word, i.e.: (number of appearances of the word in the email) / (total number of words in the email). A "word" in this case can be also one of the following characters: ;([!$# We chose the number of features (100) empirically. 2.4 Features (Words) Selection 2.4.1 Preface Given a training set of spam and ham messages, we want to choose these 100 words according to which we will prepare the feature vectors. 2.4.2 Spamicity Denote: Pr(w S) = the probability that a specific word appears in spam messages Pr(w H) = the probability that a specific word appears in ham messages.

We define Spamicity of a word: spamicity(w) = Pr(w S) Pr(w S) + Pr(w H) Each probability is estimated using the relevant proportion of messages in the training set: Pr(w S) is estimated by fraction of spam messages containing the specified word. Pr(w H) is estimated by fraction of ham messages containing the specified word. 2.4.3 Selection of Words In this process we rank each word, and finally choose the 100 words with highest ranks. The intuitive practice is to assign high rank to words whose spamicity is far away from 0.5 (higher or lower). If spamicity is close to 1 then this word should serve as a good spam indicator. If spamicity is close to 0 then it should be a good ham indicator. We found, however, that looking at spamicity only is not sufficient: words for which both Pr(w S) and Pr(w H) are too small cannot serve as good indicators even if their spamicity if far from 0.5. Therefore we also look at how big the following absolute difference is: Pr(w S) - Pr(w H) The process of words selection is as follows: Filter-out words with spamicity 0.5 < 0.05 Filter-out rare words, namely appearing (in ham and spam in total) less than a given threshold (1%). For each word which was not filtered-out, calculate Pr(w S) - Pr(w H) Choose the 100 words with biggest Pr(w S) - Pr(w H) 2.5 Shuffling Ham-spam ratio is dynamic over time, and it would be interesting to cope with that, but we chose not to due to the limited scope of the project. Authors of [2006] 1 studied this issue and found that Naïve Bayes classifier adapts quickly to the new ratio if trained gradually. 1 [2006] Spam Filtering with Naive Bayes -- Which Naive Bayes? (2006) by Vangelis Metsis Telecommunications, Vangelis Metsis http://www.aueb.gr/users/ion/docs/ceas2006_paper.pdf

We randomly shuffle both training and test sets before use so that its ham-spam ratio over time appears uniform for our algorithm. 3. AdaBoost We implemented the standard AdaBoost algorithm, see [1] or [2] in the references section below. 3.1 Weak-Learner Family Each classifier from the weak learner family uses a threshold over the "within-documentfrequency" of a specific word. It tries all frequencies appearing in the training set, and takes the one minimizing the current "weighted error" of misclassification. The weights are then updated to emphasize misclassified rows. This is done for a few tens of iterations. The final classifier is a linear combination of all those detected weaklearners. 3.2 Precision / Recall Tradeoff For general introduction to precision/recall, see [1]. The standard Adaboost algorithm minimizes total of classification error rate. Our domain is ham-spam filtering, so we prefer precision to recall (lower false-positive error rate). We would like more control over the trade-off of the two error types, in order to reduce false-positive rate. But the standard Adaboost algorithm is symmetric by design. We looked for ways to make it prefer false-positive over false-negative, perhaps by tweaking the weights in the iteration. But the discussion in [4], section 4, "Asymmetric AdaBoost", made us realize that it is not an easy task, and it is out of the scope of this project.

1 0.09 0.08 Average of 1 runs per training size X: 0.5 Y: 0.08662 Test Train false pos 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Training Fraction 0.25 0.2 50% of train-set, up to 40 iterations Testing Training false pos 5 0.05 0 0 5 10 15 20 25 30 35 40 Number of iterations

3.3 References 1. Wikipedia http://en.wikipedia.org/wiki/precision_and_recall 2. Class lecture scribe http://www.cs.tau.ac.il/~mansour/ml-course-12/scribe6_boosting.pdf 3. Wikipedia http://en.wikipedia.org/wiki/adaboost 4. Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade Paul Viola and Michael Jones http://research.microsoft.com/en-us/um/people/viola/pubs/detect/violajones_nips2002.pdf 4. Naïve Bayes We implemented binary-domain Naïve-Bayes classifier, as described in class, see [1]. Following is the final formula for classification. Denote: Xi = word #i in the message to be classified ws = frequency of given word in spam messages wh = frequency of given word in ham messages sp = proportion of spam messages in training-set, or prior belief about it in testing-set. ratio = log ( sp 1 sp ) + log (ws(x i) wh(x i ) ) The classification is "spam" if this ratio exceeds a threshold of zero. Note that in order to avoid relying on Matlab behavior for logarithm-of-zero and divisionby-zero, we first replace zero frequencies in ws, wh with a small number: 1 ( N 1 N ) 100 This number comes from estimating that there are N=250,000 English words, and about 100 words in a message. The N for spam dictionary is even larger, say 10 times than the ordinary ham dictionary. This is because spammers use twisted spelling to avoid getting blocked. If we take higher threshold, we obtain lower false-positive error ratio (0.5%), at the expense of higher total error rate of 20% (specifically, false-negative, of course.) 4.1 Results and Discussion In the Bayes-formula: Pr(S W) = Pr(W S) Pr(S) Pr(W s) Pr(S) + Pr(W H) Pr(H)

The best results were 7.6% error rate, 3% false positive rate, when assuming 1:1 hamspam ratio. If using the actual ham-spam ratio (71:29) of the data for Pr(S) and Pr(H), the error is about 1.5% worse than when assuming 1:1 ratio. The false-positive is hurt only by 0.8% when assuming 1:1 ratio: 1 Average of 10 runs per training size 0.09 0.08 0.07 0.06 0.05 0.04 Test Train false pos Test 2 Train 2 false pos 2 0.03 0.02 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Training Fraction In order to get near-zero false-positive, we tried a larger threshold, and then we got to

20%-25% error, as seen in the following graph: 0.35 Average of 10 runs per training size 0.3 0.25 0.2 5 Test Train false pos 0.05 0 0 2 4 6 8 10 12 14 16 18 20 Threshold 4.2 References 1. Class scribe http://www.cs.tau.ac.il/~mansour/ml-course-12/scribe2_bayes.pdf 2. Wikipedia http://en.wikipedia.org/wiki/bayesian_spam_filter (Different formulation but equivalent to the one presented in class) 3. SpamProbe - Bayesian Spam Filtering Tweaks by Brian Burton (Addresses "Use of within document frequency" of each word) http://spamprobe.sourceforge.net/paper.html 4. A Statistical Approach to the Spam Problem (Addresses "Dealing with Rare Words") http://www.linuxjournal.com/article/6467 5. Spam Filtering with Naive Bayes -- Which Naive Bayes? (2006) by Vangelis Metsis Telecommunications, Vangelis Metsis http://www.aueb.gr/users/ion/docs/ceas2006_paper.pdf

5. Perceptron We implemented the Perceptron algorithm as taught in class, see [1]. At the end of training stage the algorithm outputs a weight vector w, which represents a hyperplane (hopefully a separating hyperplane, if one exists). Then, we classified the vectors in the test set using this vector w. It was also interesting for us to check how better the results are if we continue updating w in the test set classification as well. The following graph shows the results: 0.5 0.45 0.4 0.35 Average of 10 runs per training size Train Test (const w) false pos (const w) Test (updated w) false pos (updated w) 0.3 0.25 0.2 5 0.05 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Training Fraction We see that, expectedly, the results are better when continuing updating w. The difference is not dramatic though. Perceptron actually had the worse error rate than all algorithms we checked. We saw in class that if a separating hyperplane exists, then Perceptron has a mistake bound which depends on the hyperplane's margin. It is hard to check if a separating hyperplane exists, and what the margin is. We can conclude that either there is no such hyperplane, or the training set size was too small for the algorithm to reach the mistake bound.

5.1 References 1. Class scribe http://www.cs.tau.ac.il/~mansour/ml-course-12/scribe4_online.pdf 6. Winnow In class we saw Winnow algorithm which learns the class of monotone disjunctions. In this case the feature vectors contain binary values. When the algorithm makes a mistake on a vector x, the "active features" (i.e. the indices where the weights should be modified) are chosen according to whether x i = 1. Our case is different: the feature vectors have real values rather than binary. We had to define how the active features are chosen. We used the following method: Given the training set, we calculate the average value of each feature. The active features are chosen according to whether x i > 10 AVG i The factor of 10 was chosen empirically it was the one which gave best results. This criterion gave better results than criterion which used the standard deviation of the feature. Another thing we had to choose was the threshold to which we compare the dot product <x,w> (the threshold we used in class, which was the dimension, is irrelevant in our case). The threshold was chosen empirically to be 20. However, all thresholds far enough from 0 gave pretty similar results. As discussed in Perceptron section, we also checked the performance in the case we let the algorithm update the weight vector during classification of the test set. The following results were obtained:

0.35 0.3 0.25 Increase factor = 2, Decrease factor = 0.5 Average of 10 runs per training size Train Test (const w) false pos (const w) Test (updated w) false pos (updated w) 0.2 5 0.05 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Training Fraction We see that Winnow has error rate around 17%, a bit better than Perceptron. False positive ratio is also better. Continuing updating the weight vector during classification of test set made almost no difference. Like Perceptron, Winnow has a mistake bound which depends on the separating hyperplane's margin. However it is hard to tell whether such hyperplane exists (and how big the margin is). 6.1 Reducing False Positive Ratio In the above we used symmetric increase and decrease factors, i.e.: When a weight should be increased, it is multiplied by 2. When a weight should be decreased, it is multiplied by 0.5. We tried to reduce the false positive ratio by using asymmetric factors. We tried to reduce the decrease factor, thus causing bias towards ham classification. We had the following result:

0.22 Training set size is 0.5 of all data Increase factor = 2 Average of 10 runs per training size 0.2 8 6 4 2 Train Test (const w) false pos (const w) Test (updated w) false pos (updated w) 0.08 0.06 0.04 5 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Decrease Factor With decrease factor = 25, the false positive ratio is 4.4%, indeed less than 6.8% obtained with decrease factor = 0.5. The penalty is 2.4% in error rate: 19.3% error comparing to 16.9%. 6.2 References 1. Notes by John McCullock http://mnemstudio.org/neural-networks-winnow-example-1.htm 2. Class scribe http://www.cs.tau.ac.il/~mansour/ml-course-12/scribe4_online.pdf

7. Support Vector Machine We used Matlab built-in SVM function, with QP (quadratic programming) method. We were able to test training fractions of up to 0.4, as Matlab crashed with larger training sets. For large training sets the running time was long: with training fraction = 0.4 (about 2000 samples), the training takes about 40 minutes. The results were very good though: Average of 10 runs per training size 0.09 0.08 Test false pos 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 5 0.2 0.25 0.3 0.35 0.4 Training Fraction With training fraction = 0.4, we obtained 5% error rate and 4.5% false positive ratio. It yields similar ratio of false positives and false negatives. We tried to use kernel functions which Matlab provides but got worse results with them. 7.1 References 1. Class scribe http://www.cs.tau.ac.il/~mansour/ml-course-12/scribe10_svm.pdf http://www.cs.tau.ac.il/~mansour/ml-course-12/scribe11_svm.pdf

8. K Nearest Neighbors 8.1 Testing Different Training Fractions It was interesting for us to check another algorithm which is not based on finding a separating hyperplane. First, we set K (number of neighbors) to 1 and checked the results for different training fractions. The following results were obtained: 6 5 K=1, Average of 10 runs per training size Test false pos 4 3 2 1 0.09 0.08 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Training Fraction For training fraction >= 0.4 we get less than 10% error rate, which is not bad compared to the other algorithms. 8.2 Testing Different Numbers of Neighbors The next stage was to test different values for K. We set the training fraction to 0.5 and checked the following values for K: 1, 2, 10, 20, 50, 100. The following results were obtained:

0.24 0.22 0.2 8 Training set size is 0.5 of all data Average of 10 runs per K value Test false pos 6 4 2 0.08 0.06 0.04 0 10 20 30 40 50 60 70 80 90 100 K value (# of neighbors) It was interesting to see that K=1 gave the best error rate. K=2 error rate was just a bit higher, but the false positive ratio was significantly lower. As K grew, the error rate and false positive ratio became significantly worse. 8.3 Testing Different Dimensionalities We wanted to check if the behavior shown in the previous section was due to the "curse of dimensionality" effect. We checked what happens if we reduce the dimensionality by using less features. We tested the following values for dimension: 20, 40, 70, 100 (at each case we chose the d features with highest rank). The following results were obtained:

: 8 7 6 5 Error Rate Training set size is 0.5 of all data Average of 10 runs per K value dim= 20 dim= 40 dim= 70 dim=100 4 3 2 1 0.09 0.08 0 10 20 30 40 50 60 70 80 90 100 K value (# of neighbors) False positive ratio: 0.24 0.22 0.2 8 False Positive Ratio Training set size is 0.5 of all data Average of 10 runs per K value dim= 20 dim= 40 dim= 70 dim=100 False pos 6 4 2 0.08 0.06 0.04 0 10 20 30 40 50 60 70 80 90 100 K value (# of neighbors)

We see that indeed, in KNN it is better to use less than 100 features. For the larger K values the best dimension was 40. For the smaller K values the best dimension was 70. 8.4 Reducing False Positive Ratio The last stage was to try reducing the false positive ratio. The idea was to classify a message as spam according to supermajority rather than regular majority. It means that a sample is classified as spam, if at least αk of its nearest neighbors are labeled as spam, where the "supermajority factor" α is between 0.5 and 1. We checked the following supermajority factors: 0.5, 0.6, 0.7, 0.8, 0.9. In this test we set the dimensionality (number of features) to 40. We had the following results: : 0.28 0.26 0.24 0.22 0.2 8 6 4 2 Error Rate Dim=40 Training set size is 0.5 of all data Average of 10 runs per K value X: 20 Y: 998 factor=0.5 factor=0.6 factor=0.7 factor=0.8 factor=0.9 0.08 0 10 20 30 40 50 60 70 80 90 100 K value (# of neighbors)

False positive ratio: 8 6 4 2 False Positive Ratio Dim=40 Training set size is 0.5 of all data Average of 10 runs per K value factor=0.5 factor=0.6 factor=0.7 factor=0.8 factor=0.9 False pos 0.08 0.06 0.04 0.02 X: 20 Y: 0.007734 0 0 10 20 30 40 50 60 70 80 90 100 K value (# of neighbors) We see that with K=20 and supermajority factor = 0.9, we can achieve 0.77% false positive ratio. The penalty is error rate of 20%. The lowest false positive ratio obtained in this test was 4%, with K=100 and supermajority factor = 0.9. However, it is not valuable, as the error rate in this case is 26.6%, very close to 29% which would be the error rate of the trivial algorithm which classifies every message as ham (29% is the proportion of spam messages in the dataset). 8.5 References 1. Class scribe http://www.cs.tau.ac.il/~mansour/ml-course-12/scribe7_nn.pdf

9. Conclusion The following table shows the best results of each algorithm using its optimal settings. Here all results are with training fraction of 0.5 except for SVM for which the training fraction is 0.4 (which was the maximum we could test). Minimizing Error (%) Minimizing False Positive Ratio (%) Error Rate FP Ratio Error Rate FP Ratio AdaBoost 8.7 4 NA NA Naïve Bayes 7.6 3 20 0.5 Perceptron 18.5 12.5 NA NA Winnow 16.9 6.8 19.3 4.4 SVM 5 4.5 NA NA KNN 8.8 9.5 20 0.77 We see that the best two algorithms are Naïve Bayes and SVM. Considering the importance of low false positive ratio in spam filtering problem, it is understandable why Naïve Bayes is most widely used in real life. Adding the facts that Naïve Bayes is much easier to implement and has much lower running-time, it becomes clear why Naïve Bayes makes a natural choice. SVM and AdaBoost don't lend themselves to adjusting the balance of the error types, so it's hard to reduce their false-positive ratio. 10. Project Website and Files The project website is at: http://www.cs.tau.ac.il/~shaharyi/ml_final_project_2013.html A zip file containing all project files (data & code) is available for download there. In addition, these files are available under TAU server, at: ~shaharyi/ml/project Under "Project" directory there is a read-me file (Readme.txt) which contains explanation about the directory tree and about all project files.