Machine Learning Final Project Spam Filtering
|
|
|
- Polly Parker
- 10 years ago
- Views:
Transcription
1 Machine Learning Final Project Spam Filtering March 2013 Shahar Yifrah Guy Lev
2 Table of Content 1. OVERVIEW DATASET SOURCE CREATION OF TRAINING AND TEST SETS FEATURE VECTORS CONTENT FEATURES (WORDS) SELECTION Preface Spamicity Selection of Words SHUFFLING ADABOOST WEAK-LEARNER FAMILY PRECISION / RECALL TRADEOFF REFERENCES NAÏVE BAYES RESULTS AND DISCUSSION REFERENCES PERCEPTRON REFERENCES WINNOW REDUCING FALSE POSITIVE RATIO REFERENCES SUPPORT VECTOR MACHINE REFERENCES K NEAREST NEIGHBORS TESTING DIFFERENT TRAINING FRACTIONS TESTING DIFFERENT NUMBERS OF NEIGHBORS TESTING DIFFERENT DIMENSIONALITIES REDUCING FALSE POSITIVE RATIO REFERENCES CONCLUSION PROJECT WEBSITE AND FILES... 21
3 1. Overview In this project we were interested in examining spam filtering. We used 6 different algorithms introduced in class to implement spam filtering in Matlab, and compared their performance. The algorithms we used were: Adaboost Naïve Bayes Perceptron Winnow SVM (Support Vector Machine) KNN (K Nearest Neighbors) In this document we use the following terminology: Ham message: legitimate message (i.e. non-spam) False positive ratio: (number of false positives) / (number of ham messages) (Note that this ratio may be higher than the error rate). We examine the performance of each algorithm in 2 aspects: False positive ratio False positive ratio is interesting because filtering out a ham message is a bad thing, worse than letting a spam message get through. 2. Dataset 2.1 Source We used part of the Enron Spam datasets, available at CSMining website: Note that this dataset contains preprocessed messages (basic "cleaning"), as described at the above link. The dataset size is 5172 messages. The ratio between spam and ham messages is: 29% spam, 71% ham.
4 2.2 Creation of Training and Test Sets We wrote a Python script to process these messages and create a feature vector out of each message. In the sequel it is described how the feature vectors look like. The script divides the feature vectors into training set and test set, while preserving the ham-spam ratio in each set. Actually, the script randomly creates 90 different pairs of training set and test set as follows: We used 9 different "training fractions", i.e. the percentage of training set size out of the entire dataset. The training fractions we used were:, 0.2,, 0.9. For each training fraction, we randomly created 10 different pairs of training set and test set, so we can examine the performance as an average of 10 runs. Note that for Adaboost and SVM we used only part of these pairs of sets due to long running time. The Python script's location is: Project\Data\enron1\process.py. In the last section (Project Website and Files) it is explained how to get there. 2.3 Feature Vectors Content The script selects 100 words out of all words which appear in messages in the dataset (in the sequel it is described how these 100 words are chosen). Each of these words yields a feature: the "within-document-frequency" of this word, i.e.: (number of appearances of the word in the ) / (total number of words in the ). A "word" in this case can be also one of the following characters: ;([!$# We chose the number of features (100) empirically. 2.4 Features (Words) Selection Preface Given a training set of spam and ham messages, we want to choose these 100 words according to which we will prepare the feature vectors Spamicity Denote: Pr(w S) = the probability that a specific word appears in spam messages Pr(w H) = the probability that a specific word appears in ham messages.
5 We define Spamicity of a word: spamicity(w) = Pr(w S) Pr(w S) + Pr(w H) Each probability is estimated using the relevant proportion of messages in the training set: Pr(w S) is estimated by fraction of spam messages containing the specified word. Pr(w H) is estimated by fraction of ham messages containing the specified word Selection of Words In this process we rank each word, and finally choose the 100 words with highest ranks. The intuitive practice is to assign high rank to words whose spamicity is far away from 0.5 (higher or lower). If spamicity is close to 1 then this word should serve as a good spam indicator. If spamicity is close to 0 then it should be a good ham indicator. We found, however, that looking at spamicity only is not sufficient: words for which both Pr(w S) and Pr(w H) are too small cannot serve as good indicators even if their spamicity if far from 0.5. Therefore we also look at how big the following absolute difference is: Pr(w S) - Pr(w H) The process of words selection is as follows: Filter-out words with spamicity 0.5 < 0.05 Filter-out rare words, namely appearing (in ham and spam in total) less than a given threshold (1%). For each word which was not filtered-out, calculate Pr(w S) - Pr(w H) Choose the 100 words with biggest Pr(w S) - Pr(w H) 2.5 Shuffling Ham-spam ratio is dynamic over time, and it would be interesting to cope with that, but we chose not to due to the limited scope of the project. Authors of [2006] 1 studied this issue and found that Naïve Bayes classifier adapts quickly to the new ratio if trained gradually. 1 [2006] Spam Filtering with Naive Bayes -- Which Naive Bayes? (2006) by Vangelis Metsis Telecommunications, Vangelis Metsis
6 We randomly shuffle both training and test sets before use so that its ham-spam ratio over time appears uniform for our algorithm. 3. AdaBoost We implemented the standard AdaBoost algorithm, see [1] or [2] in the references section below. 3.1 Weak-Learner Family Each classifier from the weak learner family uses a threshold over the "within-documentfrequency" of a specific word. It tries all frequencies appearing in the training set, and takes the one minimizing the current "weighted error" of misclassification. The weights are then updated to emphasize misclassified rows. This is done for a few tens of iterations. The final classifier is a linear combination of all those detected weaklearners. 3.2 Precision / Recall Tradeoff For general introduction to precision/recall, see [1]. The standard Adaboost algorithm minimizes total of classification error rate. Our domain is ham-spam filtering, so we prefer precision to recall (lower false-positive error rate). We would like more control over the trade-off of the two error types, in order to reduce false-positive rate. But the standard Adaboost algorithm is symmetric by design. We looked for ways to make it prefer false-positive over false-negative, perhaps by tweaking the weights in the iteration. But the discussion in [4], section 4, "Asymmetric AdaBoost", made us realize that it is not an easy task, and it is out of the scope of this project.
7 Average of 1 runs per training size X: 0.5 Y: Test Train false pos Training Fraction % of train-set, up to 40 iterations Testing Training false pos Number of iterations
8 3.3 References 1. Wikipedia 2. Class lecture scribe 3. Wikipedia 4. Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade Paul Viola and Michael Jones 4. Naïve Bayes We implemented binary-domain Naïve-Bayes classifier, as described in class, see [1]. Following is the final formula for classification. Denote: Xi = word #i in the message to be classified ws = frequency of given word in spam messages wh = frequency of given word in ham messages sp = proportion of spam messages in training-set, or prior belief about it in testing-set. ratio = log ( sp 1 sp ) + log (ws(x i) wh(x i ) ) The classification is "spam" if this ratio exceeds a threshold of zero. Note that in order to avoid relying on Matlab behavior for logarithm-of-zero and divisionby-zero, we first replace zero frequencies in ws, wh with a small number: 1 ( N 1 N ) 100 This number comes from estimating that there are N=250,000 English words, and about 100 words in a message. The N for spam dictionary is even larger, say 10 times than the ordinary ham dictionary. This is because spammers use twisted spelling to avoid getting blocked. If we take higher threshold, we obtain lower false-positive error ratio (0.5%), at the expense of higher total error rate of 20% (specifically, false-negative, of course.) 4.1 Results and Discussion In the Bayes-formula: Pr(S W) = Pr(W S) Pr(S) Pr(W s) Pr(S) + Pr(W H) Pr(H)
9 The best results were 7.6% error rate, 3% false positive rate, when assuming 1:1 hamspam ratio. If using the actual ham-spam ratio (71:29) of the data for Pr(S) and Pr(H), the error is about 1.5% worse than when assuming 1:1 ratio. The false-positive is hurt only by 0.8% when assuming 1:1 ratio: 1 Average of 10 runs per training size Test Train false pos Test 2 Train 2 false pos Training Fraction In order to get near-zero false-positive, we tried a larger threshold, and then we got to
10 20%-25% error, as seen in the following graph: 0.35 Average of 10 runs per training size Test Train false pos Threshold 4.2 References 1. Class scribe 2. Wikipedia (Different formulation but equivalent to the one presented in class) 3. SpamProbe - Bayesian Spam Filtering Tweaks by Brian Burton (Addresses "Use of within document frequency" of each word) 4. A Statistical Approach to the Spam Problem (Addresses "Dealing with Rare Words") 5. Spam Filtering with Naive Bayes -- Which Naive Bayes? (2006) by Vangelis Metsis Telecommunications, Vangelis Metsis
11 5. Perceptron We implemented the Perceptron algorithm as taught in class, see [1]. At the end of training stage the algorithm outputs a weight vector w, which represents a hyperplane (hopefully a separating hyperplane, if one exists). Then, we classified the vectors in the test set using this vector w. It was also interesting for us to check how better the results are if we continue updating w in the test set classification as well. The following graph shows the results: Average of 10 runs per training size Train Test (const w) false pos (const w) Test (updated w) false pos (updated w) Training Fraction We see that, expectedly, the results are better when continuing updating w. The difference is not dramatic though. Perceptron actually had the worse error rate than all algorithms we checked. We saw in class that if a separating hyperplane exists, then Perceptron has a mistake bound which depends on the hyperplane's margin. It is hard to check if a separating hyperplane exists, and what the margin is. We can conclude that either there is no such hyperplane, or the training set size was too small for the algorithm to reach the mistake bound.
12 5.1 References 1. Class scribe 6. Winnow In class we saw Winnow algorithm which learns the class of monotone disjunctions. In this case the feature vectors contain binary values. When the algorithm makes a mistake on a vector x, the "active features" (i.e. the indices where the weights should be modified) are chosen according to whether x i = 1. Our case is different: the feature vectors have real values rather than binary. We had to define how the active features are chosen. We used the following method: Given the training set, we calculate the average value of each feature. The active features are chosen according to whether x i > 10 AVG i The factor of 10 was chosen empirically it was the one which gave best results. This criterion gave better results than criterion which used the standard deviation of the feature. Another thing we had to choose was the threshold to which we compare the dot product <x,w> (the threshold we used in class, which was the dimension, is irrelevant in our case). The threshold was chosen empirically to be 20. However, all thresholds far enough from 0 gave pretty similar results. As discussed in Perceptron section, we also checked the performance in the case we let the algorithm update the weight vector during classification of the test set. The following results were obtained:
13 Increase factor = 2, Decrease factor = 0.5 Average of 10 runs per training size Train Test (const w) false pos (const w) Test (updated w) false pos (updated w) Training Fraction We see that Winnow has error rate around 17%, a bit better than Perceptron. False positive ratio is also better. Continuing updating the weight vector during classification of test set made almost no difference. Like Perceptron, Winnow has a mistake bound which depends on the separating hyperplane's margin. However it is hard to tell whether such hyperplane exists (and how big the margin is). 6.1 Reducing False Positive Ratio In the above we used symmetric increase and decrease factors, i.e.: When a weight should be increased, it is multiplied by 2. When a weight should be decreased, it is multiplied by 0.5. We tried to reduce the false positive ratio by using asymmetric factors. We tried to reduce the decrease factor, thus causing bias towards ham classification. We had the following result:
14 0.22 Training set size is 0.5 of all data Increase factor = 2 Average of 10 runs per training size Train Test (const w) false pos (const w) Test (updated w) false pos (updated w) Decrease Factor With decrease factor = 25, the false positive ratio is 4.4%, indeed less than 6.8% obtained with decrease factor = 0.5. The penalty is 2.4% in error rate: 19.3% error comparing to 16.9%. 6.2 References 1. Notes by John McCullock 2. Class scribe
15 7. Support Vector Machine We used Matlab built-in SVM function, with QP (quadratic programming) method. We were able to test training fractions of up to 0.4, as Matlab crashed with larger training sets. For large training sets the running time was long: with training fraction = 0.4 (about 2000 samples), the training takes about 40 minutes. The results were very good though: Average of 10 runs per training size Test false pos Training Fraction With training fraction = 0.4, we obtained 5% error rate and 4.5% false positive ratio. It yields similar ratio of false positives and false negatives. We tried to use kernel functions which Matlab provides but got worse results with them. 7.1 References 1. Class scribe
16 8. K Nearest Neighbors 8.1 Testing Different Training Fractions It was interesting for us to check another algorithm which is not based on finding a separating hyperplane. First, we set K (number of neighbors) to 1 and checked the results for different training fractions. The following results were obtained: 6 5 K=1, Average of 10 runs per training size Test false pos Training Fraction For training fraction >= 0.4 we get less than 10% error rate, which is not bad compared to the other algorithms. 8.2 Testing Different Numbers of Neighbors The next stage was to test different values for K. We set the training fraction to 0.5 and checked the following values for K: 1, 2, 10, 20, 50, 100. The following results were obtained:
17 Training set size is 0.5 of all data Average of 10 runs per K value Test false pos K value (# of neighbors) It was interesting to see that K=1 gave the best error rate. K=2 error rate was just a bit higher, but the false positive ratio was significantly lower. As K grew, the error rate and false positive ratio became significantly worse. 8.3 Testing Different Dimensionalities We wanted to check if the behavior shown in the previous section was due to the "curse of dimensionality" effect. We checked what happens if we reduce the dimensionality by using less features. We tested the following values for dimension: 20, 40, 70, 100 (at each case we chose the d features with highest rank). The following results were obtained:
18 : Error Rate Training set size is 0.5 of all data Average of 10 runs per K value dim= 20 dim= 40 dim= 70 dim= K value (# of neighbors) False positive ratio: False Positive Ratio Training set size is 0.5 of all data Average of 10 runs per K value dim= 20 dim= 40 dim= 70 dim=100 False pos K value (# of neighbors)
19 We see that indeed, in KNN it is better to use less than 100 features. For the larger K values the best dimension was 40. For the smaller K values the best dimension was Reducing False Positive Ratio The last stage was to try reducing the false positive ratio. The idea was to classify a message as spam according to supermajority rather than regular majority. It means that a sample is classified as spam, if at least αk of its nearest neighbors are labeled as spam, where the "supermajority factor" α is between 0.5 and 1. We checked the following supermajority factors: 0.5, 0.6, 0.7, 0.8, 0.9. In this test we set the dimensionality (number of features) to 40. We had the following results: : Error Rate Dim=40 Training set size is 0.5 of all data Average of 10 runs per K value X: 20 Y: 998 factor=0.5 factor=0.6 factor=0.7 factor=0.8 factor= K value (# of neighbors)
20 False positive ratio: False Positive Ratio Dim=40 Training set size is 0.5 of all data Average of 10 runs per K value factor=0.5 factor=0.6 factor=0.7 factor=0.8 factor=0.9 False pos X: 20 Y: K value (# of neighbors) We see that with K=20 and supermajority factor = 0.9, we can achieve 0.77% false positive ratio. The penalty is error rate of 20%. The lowest false positive ratio obtained in this test was 4%, with K=100 and supermajority factor = 0.9. However, it is not valuable, as the error rate in this case is 26.6%, very close to 29% which would be the error rate of the trivial algorithm which classifies every message as ham (29% is the proportion of spam messages in the dataset). 8.5 References 1. Class scribe
21 9. Conclusion The following table shows the best results of each algorithm using its optimal settings. Here all results are with training fraction of 0.5 except for SVM for which the training fraction is 0.4 (which was the maximum we could test). Minimizing Error (%) Minimizing False Positive Ratio (%) Error Rate FP Ratio Error Rate FP Ratio AdaBoost NA NA Naïve Bayes Perceptron NA NA Winnow SVM NA NA KNN We see that the best two algorithms are Naïve Bayes and SVM. Considering the importance of low false positive ratio in spam filtering problem, it is understandable why Naïve Bayes is most widely used in real life. Adding the facts that Naïve Bayes is much easier to implement and has much lower running-time, it becomes clear why Naïve Bayes makes a natural choice. SVM and AdaBoost don't lend themselves to adjusting the balance of the error types, so it's hard to reduce their false-positive ratio. 10. Project Website and Files The project website is at: A zip file containing all project files (data & code) is available for download there. In addition, these files are available under TAU server, at: ~shaharyi/ml/project Under "Project" directory there is a read-me file (Readme.txt) which contains explanation about the directory tree and about all project files.
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007
Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples
Question 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class
Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
A Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Spam Filtering based on Naive Bayes Classification. Tianhao Sun
Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Spam Filtering with Naive Bayesian Classification
Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011
Bayesian Spam Detection
Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional
Active Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
LCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
Bayes and Naïve Bayes. cs534-machine Learning
Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Less naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:[email protected] also CoSiNe Connectivity Systems
Recognizing Informed Option Trading
Recognizing Informed Option Trading Alex Bain, Prabal Tiwaree, Kari Okamoto 1 Abstract While equity (stock) markets are generally efficient in discounting public information into stock prices, we believe
Sentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
CSE 473: Artificial Intelligence Autumn 2010
CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Investigation of Support Vector Machines for Email Classification
Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software
Support Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
Tweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information
4.1 Introduction - Online Learning Model
Computational Learning Foundations Fall semester, 2010 Lecture 4: November 7, 2010 Lecturer: Yishay Mansour Scribes: Elad Liebman, Yuval Rochman & Allon Wagner 1 4.1 Introduction - Online Learning Model
Robust Real-Time Face Detection
Robust Real-Time Face Detection International Journal of Computer Vision 57(2), 137 154, 2004 Paul Viola, Michael Jones 授 課 教 授 : 林 信 志 博 士 報 告 者 : 林 宸 宇 報 告 日 期 :96.12.18 Outline Introduction The Boost
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Detecting Corporate Fraud: An Application of Machine Learning
Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
WE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Lecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto [email protected] @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto [email protected] @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
Projektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
Content-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Challenges of Cloud Scale Natural Language Processing
Challenges of Cloud Scale Natural Language Processing Mark Dredze Johns Hopkins University My Interests? Information Expressed in Human Language Machine Learning Natural Language Processing Intelligent
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
Anti Spamming Techniques
Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
A semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
A fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
Server Load Prediction
Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
Survey of Spam Filtering Techniques and Tools, and MapReduce with SVM
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 11, November 2013,
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz
AdaBoost Jiri Matas and Jan Šochman Centre for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Presentation Outline: AdaBoost algorithm Why is of interest? How it works? Why
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard
Asymmetric Gradient Boosting with Application to Spam Filtering
Asymmetric Gradient Boosting with Application to Spam Filtering Jingrui He Carnegie Mellon University 5 Forbes Avenue Pittsburgh, PA 523 USA [email protected] ABSTRACT In this paper, we propose a new
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
An evolutionary learning spam filter system
An evolutionary learning spam filter system Catalin Stoean 1, Ruxandra Gorunescu 2, Mike Preuss 3, D. Dumitrescu 4 1 University of Craiova, Romania, [email protected] 2 University of Craiova, Romania,
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Local features and matching. Image classification & object localization
Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Car Insurance. Havránek, Pokorný, Tomášek
Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought
Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
