Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model"

Transcription

1 AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam filters are a well-known and powerful type of filters. We construct different filters using three types of classification, including,, and. We compare the pros and cons between these three types and use some approaches to improve them to get a better spam filter. Index Terms Spam filter,,, 1 INTRODUCTION M ass unsolicited electrontic mail, often known as spam, has recently increased enormously and has become a serious threat to not only the Internet but also to society. Over the past few years, many approaches have been provided to filter spam. Some of them often use Naïve Bayes method to implement it. 2 PROPOSED APPROACHES 2.1 Problem Definition Construct a filter which can prevent spam from getting into mailbox. Input: A mail message Output: Spam or Ham 2.2 Proposed Solution Figure.1 System Flowchart We use method of machine learning to train a model, and decide the input message belonges ham or spam. We use three methods to implement spam filter, including Yun-Nung Chen Author is in the National Taiwan Universty. Che-An Lu Author is in the National Taiwan Universty. Chau-Yu Huang Author is in the National Taiwan Universty AI TERMPROJECT,, and. We will compare these three methods first. After that we will modify each method independently to improve the accuracy and compare each method itself. Finally make a conclusion above all the improvement on these three methods. Following are our methods details. From Bayesian theorem of total probability, given the vector x = x,x,,x of a mail d, the probability that d belongs to category C is: P c=c X =x = where k spam,ham. We can decide that mail belongs to a model with a higher probability. This uses unigram language model to compute the probability of class. In order to get better feature (word with more information), we preprocess testing message to remove some noise word and reserve words which are more important. We remove words whose lenth is longer than 50, and we also delete words which appear for less 20 times. A word which appears in more documents represents that it carries less information for classification. We use SRI Language Model Toolkit: to generate unigram language model, and compute the Bayesian Probability according to the model. We can decide the classification of testing message. The support vector machines () is easy to use and powerful for data classification. When generating filter model, we create a vector for each data in tha training corpus and then will map these vectors into hyperplan. Then will learn which kinds of vector are close to which class. Here is a good approaching method in precisely finding best classification hyper-plans to maximize the margin so that we could classify a new mail into spam or ham. (1)

2 2 AI TERM PROJECT GROUP 14 Figure.2 Version1 First we will select top 1000 document frequency terms as we do in. Secondly we will create TF-IDF for each training data. Thirdly we will use libsvm (a tool from Chih-Jen Lin) to train a model. With each unclassified mail we will create a TF-IDF vector too, finally predict it with the svm model we trained before. The fomula of TF-IDF is described as below: log +1 (2) Where N is number of terms in whole corpus, f is frequency of term t, and df is data frequency of term t. If a term appears in many documents, it represents that this term doesn t have significance. So we use IDF to diminish the weight of such term inorder to select better features. Why we use data frequency to select the feature is because the following figure (a reference from course Web Mining ). Figure.3 We can see that the performance between all the feature selection methods except mutual information and term strength makes no much difference. For our coding convinence, we choose to implement data frequency. Version2 It is similar to version 1, we only turn the uppercase into lowercase (case-insensitive), so that for example free will be the same term with FREE or Free Version3 It is similar to version 2, the only difference is that when training svm model, we will set the parameters cost and gamma to the best condition. Version1 First we will create a binary vector which maintain the information of each feature (term) exist or not (1 or 0) for each training data. Secondly, for each unclassified mail we will create a binary vector too, then using cosine distance to find out the top K closest training data from the corpus then find out which class is the unclassified mail belongs to. cosine distance: Version2 It is similar to version 1, one big difference is that in version 2, we will not use binary vector as before, we will use TF-IDF respectively. Version3 It is similar to previous version, we only turn the uppercase into lowercase (case-insensitive), so that for example free will be see as the same term with FREE or Free. 3 CONTRIBUTIONS 3.1 Compare each method independently We compare three methods independently, and we also can observe the difference between sizes in 1200, 3000, 9000, and When we trains language model, computing the probabilities of words doesn t care the case (case-insensitive), and we also remove the word with too small probability for message preprocessing. When the size of traing data is smaller (1200), the result still has good performance. We can see that accuracy is %. But when training set becomes a little larger (3000), the result is not good as smaller one. We can see that the result is improved up slightly. Compared to size, the improvement of accuracy is relatively small, and we can just improve accuracy to 3% ( % -> ). We believe feature selection is important to Naïve Bayes, and we can use a better feature to improve result. But we must spend much time to testing data to see whether the result will be improved up, time and accuracy is tradeoff. When the training set is small (1200), the result of model is much poor than others. When we use case-insensitive to create tf-idf vector, the accuracy can improve up to 20% (60.854% -> %), which means that it is important to combine the information of uppercase and lowercase together to increase the concept for a specific term (ex: free, Free, FREE). The other reason is that if we see free and Free as the same term, then the data frequency of free will increase, so that we won t throw away such important feature (since we only select top 1000 ranked by data frequency). (3)

3 AI TERM PROJECT GROUP 14 3 After we find the best gamma and cost value for each corpus, the performance can also improve up to 20% ( % -> %). But the process of finding such parameters is very time consuming, so it will be a tradeoff. When using large training set, the performance won t improve accordingly. We think that it is because when the training set grows up, the noise will also increase, which means that there will be more ham similar to spam. The performance of is really out of our expectation. At begin, we think won t be better than. But after our experiment, seems work well on spam classification. The other interesting thing is that when using binary vector and TF-IDF vector, the performance makes little difference (notice that TF-IDF still a little better than binary vector). The improvement of using TF-IDF vector is not that significant as we expect. We think that it is because in spam classification, some important features in spam is quite different from ham, so the weight (TF- IDF) of such feature will not be so important as whether there exist such feature. And we also found that the case-sensitive and caseinsensitive makes little difference in, not like in. We think the main reason is that we didn t do feature selection in, instead, we keep all the features. So we won t throw out some important features (ex: Money) such as feature selection does. And this may also be the reason why will beat. 3.2 Compare all these three methods We think that it is why is better than that the feature we select when we implement Naïve Bayes is not good enough to train an excellent model. We think can beat is because that we throw some information when doing feature selection in. And the other reason might be that spam classification problem is a binary decide problem (spam / ham), so can easily close to one side, and we think if there exists more class, the performance of will be better than. When training corpus is small, the performance of is still well, not like, with only 60% accuracy. We think it is because is a machine learning method, we can t expect it learns well with a little training data. can still find the top K similar data in a small corpus. We also think that is good method to filt spam when traing set is small, and because the word appearing in ham isn t too many, we can compute probabilities of words to decide a category by using a training set with small size. 4 EXPERIMENTAL RESULTS 4.1 Corpus We use the corpus which is provided by trec06. There are messages (12910 ham, spam). We separate the whole set into training data and testing data. The testing data are 2910 ham and 4912 spam which are randomly select from the corpus. Remaining data (10000 ham, spam) are used to be training data. These two set (testing data and training data) is independent. In our experiments, we will create four different sizes of training data which are randomly select from the training corpus. The ratio of spam and ham in these four training datas are all 2:1, and the corresponding size of them are 1200, 3000, 9000, and (Following we will use 800:400, 2000:1000, 6000:3000, 14000:7000 to represent four different training sets) 4.2 Result of Evaluation Following are three different methods accuracy table with different sizes of training data. Accuracy 800: % 2000: % 6000: % 14000: % Table. 1 Version1 Version2 Version3 800: % % % 2000: % % % 6000: % % % 14000: % % % Table. 2 Version1 Version2 Version3 800: % % % 2000: % % % 6000: % % % 14000: % % % Table. 3 Fig.4 is the accuracy plot of, first version of, and first version of with different size of training data % 80.00% 75.00% 70.00% 65.00% 60.00% Figure.4 Accuracy of ver1 ver1 We use to be the baselin. We can see that in the first version of, all the accuracy are less than. And in the first version of, the performance is already better than Naïve bayes. Fig.5 is the accuracy plot after we improve the

4 4 AI TERM PROJECT GROUP 14 accuracy % 96.00% 94.00% 85.00% 80.00% 75.00% 70.00% ver1 ver2 ver % 88.00% ver3 ver % 60.00% Figure.5 Accuracy of We can see that there has a big improvement between version 1 and version 2, the difference between these two versions is that we use case-insensitive in version 2. In version 2 we will much emphasize some important features such as Free, and we won t throw too many information away due to feature selection as we mentioned before. The improvement between version 2 and version 3 is also significant. By well selecting the parameter gamma and cost, the performance can really improve a lot as we can see. Fig.6 is the accuracy plot after we improve the accuracy % 98.00% 97.00% 96.00% 94.00% 93.00% Figure.6 Accuracy of ver1 ver2 ver3 We can see that there is a little improvement from version 1 to version 2, the difference of them is that we use TF-IDF vector instead of binary vector in version 2. And when we modified version 2 from case-sensitive to case-insensitive, the difference between them is not as significant as (no more than 0.4%). We also mentioned this before, it is because we didn t do feature selection in. Fig.7 is the accuracy plot of, and the best version of and with different size of training data % Figure.7 Comparison of 3 methods As we can see, after we improve, two of the middle size of training set is better than. And the other two is much more close to Naïve bayes compare with the first version. After we improve, the result is again better than. 5 CONCLUSION After experimenting three different methods, we found that has higher accuracy than other two approaches. Because we think is more suitable for classification of less catergories than, accuracy of is higher. We think is a good method for spam filter, and the time costs little on training (about 1-2 seconds). Testing an input message requires much time using, but the result is good enough. The training time of is very fast, but it takes lots of time on testing. We think it is because we didn implement any indexing algorithm such as KD-tree, R-Tree or Quad-Tree when finding the nearest top K neighbors. In our future work, we can implement these indexing methods to improve the efficiency of. The training time of compared with and is much longer, espically when we want to find out the best gamma and cost parameters for the training process. But the testing time of is much faster than the other two methods. In future work, we can focus on different feature selection methods to improve the performance of and, and the results of them might become better than. 6 JOB RESPONSIBILITY Yun-nung Chen (B ) (Training and Testing), Report Writing. Che-an Lu (B ) (Training and Testing), (Training and Testing), Report Writing. Chau-yu Huang (B ) Message Preprocessing, Report Writing. ACKNOWLEDGMENT The report uses lots of toolkits, including MIME-tools,

5 AI TERM PROJECT GROUP 14 5 SRILM, and. So we want to thanks about it. REFERENCES [1] SRI Language Model Toolkit [2] CJ Lin s Home Page

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Effective Email

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Detecting E-mail Spam Using Spam Word Associations

Detecting E-mail Spam Using Spam Word Associations Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

Content-Based Recommendation

Content-Based Recommendation Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

More information

Analysis of Tweets for Prediction of Indian Stock Markets

Analysis of Tweets for Prediction of Indian Stock Markets Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,

More information

Three-Way Decisions Solution to Filter Spam Email: An Empirical Study

Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

A Collaborative Approach to Anti-Spam

A Collaborative Approach to Anti-Spam A Collaborative Approach to Anti-Spam Chia-Mei Chen National Sun Yat-Sen University TWCERT/CC, Taiwan Agenda Introduction Proposed Approach System Demonstration Experiments Conclusion 1 Problems of Spam

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle

More information

Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

More information

CSE 473: Artificial Intelligence Autumn 2010

CSE 473: Artificial Intelligence Autumn 2010 CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Naïve Bayesian Anti-spam Filtering Technique for Malay Language

Naïve Bayesian Anti-spam Filtering Technique for Malay Language Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information

More information

Sentiment analysis using emoticons

Sentiment analysis using emoticons Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

A LVQ-based neural network anti-spam email approach

A LVQ-based neural network anti-spam email approach A LVQ-based neural network anti-spam email approach Zhan Chuan Lu Xianliang Hou Mengshu Zhou Xu College of Computer Science and Engineering of UEST of China, Chengdu, China 610054 zhanchuan@uestc.edu.cn

More information

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9 Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

Predicting outcome of soccer matches using machine learning

Predicting outcome of soccer matches using machine learning Saint-Petersburg State University Mathematics and Mechanics Faculty Albina Yezus Predicting outcome of soccer matches using machine learning Term paper Scientific adviser: Alexander Igoshkin, Yandex Mobile

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, chittiprolumounika@gmail.com; Third C.

More information

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks CS 188: Artificial Intelligence Naïve Bayes Machine Learning Up until now: how use a model to make optimal decisions Machine learning: how to acquire a model from data / experience Learning parameters

More information

Unmasking Spam in Email Messages

Unmasking Spam in Email Messages Unmasking Spam in Email Messages Anjali Sharma 1, Manisha 2, Dr. Manisha 3, Dr. Rekha Jain 4 Abstract: Today e-mails have become one of the most popular and economical forms of communication for Internet

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

More information

Applying Machine Learning to Stock Market Trading Bryce Taylor

Applying Machine Learning to Stock Market Trading Bryce Taylor Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,

More information

Online Active Learning Methods for Fast Label-Efficient Spam Filtering

Online Active Learning Methods for Fast Label-Efficient Spam Filtering Online Active Learning Methods for Fast Label-Efficient Spam Filtering D. Sculley Department of Computer Science Tufts University, Medford, MA USA dsculley@cs.tufts.edu ABSTRACT Active learning methods

More information

Data Pre-Processing in Spam Detection

Data Pre-Processing in Spam Detection IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 11 May 2015 ISSN (online): 2349-784X Data Pre-Processing in Spam Detection Anjali Sharma Dr. Manisha Manisha Dr. Rekha Jain

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Keywords data mining, prediction techniques, decision making.

Keywords data mining, prediction techniques, decision making. Volume 5, Issue 4, April 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analysis of Datamining

More information

Improving spam mail filtering using classification algorithms with discretization Filter

Improving spam mail filtering using classification algorithms with discretization Filter International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

lop Building Machine Learning Systems with Python en source

lop Building Machine Learning Systems with Python en source Building Machine Learning Systems with Python Master the art of machine learning with Python and build effective machine learning systems with this intensive handson guide Willi Richert Luis Pedro Coelho

More information

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Naive Bayes Spam Filtering Using Word-Position-Based Attributes Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Supervised Learning Evaluation (via Sentiment Analysis)!

Supervised Learning Evaluation (via Sentiment Analysis)! Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents

More information

A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Towards Effective Recommendation of Social Data across Social Networking Sites

Towards Effective Recommendation of Social Data across Social Networking Sites Towards Effective Recommendation of Social Data across Social Networking Sites Yuan Wang 1,JieZhang 2, and Julita Vassileva 1 1 Department of Computer Science, University of Saskatchewan, Canada {yuw193,jiv}@cs.usask.ca

More information

A Hybrid ACO Based Feature Selection Method for Email Spam Classification

A Hybrid ACO Based Feature Selection Method for Email Spam Classification A Hybrid ACO Based Feature Selection Method for Email Spam Classification KARTHIKA RENUKA D 1, VISALAKSHI P 2 1 Department of Information Technology 2 Department of Electronics and Communication Engineering

More information

WE DEFINE spam as an e-mail message that is unwanted basically

WE DEFINE spam as an e-mail message that is unwanted basically 1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

More information

Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015 Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation Linhao Zhang Department of Computer Science, The University of Texas at Austin (Dated: April 16, 2013) Abstract Though

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

agoweder@yahoo.com ** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya

agoweder@yahoo.com ** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya AN ANTI-SPAM SYSTEM USING ARTIFICIAL NEURAL NETWORKS AND GENETIC ALGORITHMS ABDUELBASET M. GOWEDER *, TARIK RASHED **, ALI S. ELBEKAIE ***, and HUSIEN A. ALHAMMI **** * The High Institute of Surman for

More information

Machine learning for algo trading

Machine learning for algo trading Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with

More information

and Hung-Wen Chang 1 Department of Human Resource Development, Hsiuping University of Science and Technology, Taichung City 412, Taiwan 3

and Hung-Wen Chang 1 Department of Human Resource Development, Hsiuping University of Science and Technology, Taichung City 412, Taiwan 3 A study using Genetic Algorithm and Support Vector Machine to find out how the attitude of training personnel affects the performance of the introduction of Taiwan TrainQuali System in an enterprise Tung-Shou

More information

Machine Learning for Naive Bayesian Spam Filter Tokenization

Machine Learning for Naive Bayesian Spam Filter Tokenization Machine Learning for Naive Bayesian Spam Filter Tokenization Michael Bevilacqua-Linn December 20, 2003 Abstract Background Traditional client level spam filters rely on rule based heuristics. While these

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

Active Learning with Boosting for Spam Detection

Active Learning with Boosting for Spam Detection Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Scalable Developments for Big Data Analytics in Remote Sensing

Scalable Developments for Big Data Analytics in Remote Sensing Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,

More information

Cross-Validation. Synonyms Rotation estimation

Cross-Validation. Synonyms Rotation estimation Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical

More information

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and

More information

Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

More information

Email Classification Using Data Reduction Method

Email Classification Using Data Reduction Method Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user

More information

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Spam Filtering based on Naive Bayes Classification. Tianhao Sun Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous

More information

Automated Content Analysis of Discussion Transcripts

Automated Content Analysis of Discussion Transcripts Automated Content Analysis of Discussion Transcripts Vitomir Kovanović v.kovanovic@ed.ac.uk Dragan Gašević dgasevic@acm.org School of Informatics, University of Edinburgh Edinburgh, United Kingdom v.kovanovic@ed.ac.uk

More information

SVM-Based Spam Filter with Active and Online Learning

SVM-Based Spam Filter with Active and Online Learning SVM-Based Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,

More information

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts

More information

Learning to classify e-mail

Learning to classify e-mail Information Sciences 177 (2007) 2167 2187 www.elsevier.com/locate/ins Learning to classify e-mail Irena Koprinska *, Josiah Poon, James Clark, Jason Chan School of Information Technologies, The University

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Crowdfunding Support Tools: Predicting Success & Failure

Crowdfunding Support Tools: Predicting Success & Failure Crowdfunding Support Tools: Predicting Success & Failure Michael D. Greenberg Bryan Pardo mdgreenb@u.northwestern.edu pardo@northwestern.edu Karthic Hariharan karthichariharan2012@u.northwes tern.edu Elizabeth

More information