A Bayesian Topic Model for Spam Filtering

Size: px
Start display at page:

Download "A Bayesian Topic Model for Spam Filtering"

Transcription

1 Journal of Information & Computational Science 1:12 (213) August 1, 213 Available at A Bayesian Topic Model for Spam Filtering Zhiying Zhang, Xu Yu, Lixiang Shi, Li Peng, Zhixing Huang School of Computer and Information Science, Southwest University, Chongqing 4715, China Abstract Spam is one of the major problems of today s Internet because it brings financial damage to companies and annoys individual users. Among those approaches developed to detect spam, the content-based machine learning algorithms are important and popular. However, these algorithms are trained using statistical representations of the terms that usually appear in the s. Additionally, these methods are unable to account for the underlying semantics of terms within the messages. In this paper, we present a Bayesian topic model to address these limitations. We explore the use of semantics in spam filtering by representing s as vectors of topics with a topic model: the Latent Dirichlet Allocation (LDA). Based upon this representation, the relationship between the topics and spam can be discovered by using a Bayesian method. We test this model on the Enron-Spam datasets and results show that the proposed model performs better than the baseline and can detect the internal semantics of spam messages. Keywords: Spam Detection; Latent Dirichlet Allocation; Bayesian Topic Model 1 Introduction Electronic mail ( ) is one of the most important and powerful means of modern communication. However, in the past decade users have always been plagued by spam, which is also known as junk or Unsolicited Bulk (UBE). Spam triggers a lot of problems, such as making users waste their time on looking through and sorting out additional s [1], causing financial loss to the companies by misusing of traffic, storage space and computational power [1], bringing security and legal problems by spreading malicious software, advertising pornography, pyramid schemes, etc [2]. Many techniques have been proposed to deal with spam. The content-based machine learning algorithms are important and popular, including algorithms that are considered top-performers in text classification, like Boosting [3], Support Vector Machines [4, 5, 6], and Bayesian method [7, 8]. Project supported by Natural Science Foundation Project of CQ CSTC (No. CSTC212JJB412), Scientific Research Foundation for the Returned Overseas Chinese Scholars (No. 2911) and Fundamental Research Funds for the Central Universities (No. SWU139265). Corresponding author. address: huangzx@swu.edu.cn (Zhixing Huang) / Copyright 213 Binary Information Press DOI: /jics212279

2 372 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) Despite the fact that s are usually represented as a sequence of words, there are relationships between words on a semantic level that also affect s [9]. However, the content-based machine learning algorithms are trained using statistical representations of the terms that usually appear in the s and are unable to account for the underlying semantics of the s. To address these limitations, Santos [9] proposed to represent the s with the enhanced Topicbased Vector Space Model (etvsm) and achieved a satisfactory result on Ling-Spam dataset. However, etvsm is a ontology-based method which may limit its effect when encounters more complicated unseen messages. Furthermore, the Ling-Spam has the disadvantage that its ham messages are more topic-specific which could lead to over optimistic estimates of the performance of learning-based spam filters. In contrast, we present a Bayesian topic model by introducing the topic model Latent Dirichlet Allocation (LDA) [1] to mine the semantics of s. LDA is a generative probabilistic model of a corpus which will not be limited by the weakness of ontology. LDA models every document as a distribution over the topics, and every topic as a distribution over the words. These topics could better reflect the semantics of the document than terms. The basic idea of our approach is that: we use a previously estimated LDA model to make inference on the new unseen s to get the topics distribution of each . Hence, each could be treated as a vector of topics not terms. As the topics have deeper relationship with the content of a , we can then use a Bayesian method to discover the relationship between the topics and spam. More detailed descriptions are shown in Section 3. Our model may be similar in the sense with the method proposed by Bíró [11] because we also use LDA, however, the model we present is completely different with their method. The remainder of this paper is organized as follows. Section 2 introduces the basic theory. Section 3 describes the proposed methodology. Section 4 details the performed experiments and presents the results. Finally, Section 5 concludes and outlines avenues for future work. 2 Basic Theory The basic theory includes LDA topic model which is used to get the topics distribution of E- mails and Bayesian method which could discover the relationship between words and spam. A modification of this Bayesian method is used to discover the relationship between topics and spam in our approach. 2.1 Latent Dirichlet Allocation There are D documents of arbitrary length. A document d is a vector of N d words, W d, where each word w id is chosen from a vocabulary of size V. Then the generation of a document collection in LDA is modeled as a three step processes. First, for each document, a distribution over topics is sampled from a Dirichlet distribution. Second, for each word in the document, a single topic is chosen according to this distribution. Finally, each word is sampled from a multinomial distribution over words specific to the sampled topic. This generative process corresponds to the hierarchical Bayesian model shown (using plate notation) in Fig. 1.

3 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) α θ z β φ T w N d D Fig. 1: The hierarchical Bayesian model for LDA In this model, ϕ denotes the matrix of topic distributions, with a multinomial distribution over V vocabulary items for each of T topics being drawn independently from a symmetric Dirichlet(β) prior. θ is the matrix of document-specific mixture weights for these T topics, each being drawn independently from a symmetric Dirichlet(α) prior. For each word, z denotes the topic responsible for generating that word, drawn from the θ distribution for that document, and w is the word itself, drawn from the topic distribution ϕ corresponding to z. Estimating ϕ and θ provides information about the topics that participate in a corpus and the weights of those topics in each document respectively. A variety of algorithms have been used to estimate these parameters, from basic expecation-maximization [12] to approximate inference methods like variational EM [1], expectation propagation [13], and Gibbs sampling [14]. 2.2 The Bayesian Method The Bayesian method proposed by Paul Graham [7] is very different from any form of Naive Bayes classifiers [15, 16, 17, 18] and is able to greatly improve the false positive rate. In this paper, this method is referred as PG Bayesian classifier. PG Bayesian classifier could discover the relationship between words and spam. Each word in the contributes to the s spam probability, or only the most interesting words. This contribution of one word, which also can be called the spamicity of one word, is calculated using Bayes theorem: p(s w) = p(w s)p(s) p(w s)p(s) + p(w h)p(h) (1) In Eq. (1), p(s) is the overall probability that any given is spam. p(h) is the overall probability that any given is ham. p(w s) is the probability that the given word appears in spam training s, which can be estimated by dividing the number of spam training s that contain this word by the total number of spam training s. p(w h) is the probability that the given word appears in ham training s, which can be estimated by dividing the number of ham training s that contain this word by the total number of ham training s. PG Bayesian classifier makes the assumption that there is no a priori reason for any incoming message to be spam rather than ham, considers p(s) = p(h) =.5. This assumption permits simplifying the Eq. (1) to: p(w s) p(s w) = (2) p(w s) + p(w h)

4 3722 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) PG Bayesian classifier also makes the assumption that the words present in a are independent events. With that assumption, one can derive another equation from Bayes theorem to calculate the probability that the is spam by taking into consideration N words of the p 1 p 2...p N p = (3) p 1 p 2...p N + (1 p 1 )(1 p 2 )...(1 p N ) In Eq. (3), p indicates how sure the filter is that the is spam. p n (n = {1,..., N}) is the probability p(s w n ). The result p is usually compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam. 3 Proposed Methodology Assume that there is a training set consisting of S spam s and H ham s and there is a test set consisting of N new unseen s. A Unified LDA model with T topics of the overall training set can be estimated using LDA firstly. Then, this Unified LDA model is used to make inference separately for the spam training set, ham training set and the new unseen s. Here three LDA models could be got: the Spam LDA model with the topics distribution θ (s) of spam s, the Ham LDA model with the topics distribution θ (h) of ham s and New- s LDA model with the topics distribution θ (n) of new s. All these three models have T topics which are consistent with the topics of the Unified LDA model. The third step, each e can be represented as a vector e = z 1,..., z T. Each z i (i = {1,..., T }) has a value which is the probability of the topic z i occurs in this i.e. p(z i e). This value can be directly got from the corresponding matrix θ. What can be naturally thought of is that some topics of the T topics are more relevant to spam s and some are more relevant to ham s. In other words, the topics which are more relevant to spam s will have a higher probability in each spam training , and the topics which are more relevant to ham training s will have a higher probability in each ham . That means each of the T topic has the spamicity just like words. According to the Eq. (2), the following equation to calculate the probability of the spamicity of one topic z i (i = {1,..., T }) could be naturally got: p(s z i ) = + The challenge is how to calculate the probability of and. The spam training set has total S spam s, given each spam training e j (j = {1,..., S}), the probability of each topic z i could be got from the matrix θ (s), i.e. p(z i e j, s) = θ (s) j,i. We define the probability of each e j as: p(e j s) = 1. Hence we could compute the probability of p(z S i s) by using the law of total probability: = = 1 S S p(z i e j, s)p(e j s) j=1 S j=1 (4) θ (s) j,i (5)

5 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) The probability of can be calculated using the same method, and then we update the Eq. (4) into: p(s z i ) = 1 S 1 S S j=1 θ(s) j,i S j=1 θ(s) j,i + 1 H H j=1 θ(h) j,i Each e j (j = {1,..., N}) of N new unseen s has also been represented as e j = z 1,..., z T. And the value of each z i is p(z i e j ). Apparently p(z i e j ) = θ (n) ji. Then, we need to select top k most representative topics of e j to calculate the probability that e j is spam. This could be achieved by using the following algorithm in each e j : (1) For each topic z i of T topics, if.45 < p(s z i ) <.55, add topic z i into CandidateTopicSet (CTS). (2) Rearrange the topics of e j according to the descending order of values p(z i e j ), save the results as TopicOrderList (TOL). (3) For each topic in TOL, orderly add those topics which also in CTS into the AvailableTopicList (ATL). (4) Select top k topics from ATL. Then, the probability that the e j is spam can be computed by taking into consideration all of the top k topics. According to the form of Eq. (3), we can derive the final equation, which is: k i=1 p(s e j ) = p(s z i) k i=1 p(s z i) + k i=1 (1 p(s z (7) i)) 4 Experiment and Evalution 4.1 Datasets and Experimental Setup We use six datasets collectively called Enron-Spam datasets, which is developed by Metsis et al in paper [8] and is also a publicly available, non-encoded datasets just like Ling-Spam and SpamAssassin. Each of the six Enron-Spam datasets consists of a ham set and a spam set and each message is in a separate text file. The ham collections of these six datasets were got from six Enron users, and were each paired with a spam collection. Hereafter, we refer the six Enron-Spam datasets as Enron 1, Enron 2,..., Enron 6 respectively. Phan s GibbsLDA [19] is used to do estimations and inferences on the datasets. The Dirichlet parameter β is chosen to be constant.1 throughout, while α = 5/T throughout. T is the number of topics of the LDA model, we experiment with T = {1, 2, 5, 1, 2}. The Gibbs sampling is stopped after 2 steps for estimation on the Unified training set, and after 1 steps for inference on the Ham training set, Spam training set, and Test set. In the testing phase, top k most representative topics are selected for each to calculate the probability of the test is spam. We experiment with k = 1,..., Length(AT L) 1. Each learning model of each dataset is denoted as M T,k. The threshold is.5. If the probability of the (6)

6 3724 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) is lower than the threshold, it is considered as likely ham, otherwise it is considered as likely spam. 1-fold cross validation is applied in our experiments. During the above experiments, the curves of the topics probability distribution on different model of each dataset are learned. A group of curves with T = 2 are shown in Fig. 2. These curves clearly reveals that the probability of the same topic occurs in different categories is also different, which is a direct proof of the correctness and feasibility of our approach (a) Curves on Enron 1 (d) Curves on Enron (b) Curves on Enron 2 (e) Curves on Enron (b) Curves on Enron 3 (f) Curves on Enron 6 Fig. 2: The curves of topics probability occurs in different model for each dateset, with T=2 4.2 Evaluation and Comparison We first make an evaluation on each model M T,k. Because different datasets are for different person and each one has a different ham-spam ratio, the best performing learning model may be also different. To evaluate each model M T,k of each dataset, we present the evaluation results by means of curves. For the k of the best model M T,k are all less then 7, the F-measure curves are drawn within the scope of k = {1,..., 7} for facilitating the contrast. By observing the F-measure curves which are shown in Fig. 3, the best performing model could be selected for each datesets. Table 1 shows these selected models as well as the corresponding F-measure values. Our method achieves a best result on Enron 4 which demonstrates Metsis s [8] view that some datasets (e.g., Enron 4) are easier than others (e.g., Enron 1). We just use 1 topic to do detection on Enron 4, in contrast use 5 topics to do detection on Enron 1. We also find that models with T = 1 and T = 2 all not reach the best performing which shows too small or too large T are not appropriate.

7 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) T=5 T=5 T=5 T=2 T=2 T= (a) curves of Enron 1 (b) curves of Enron 2 (c) curves of Enron 3 T=5 T=5 T=5 T=2 T=2 T= (d) curves of Enron 4 (e) curves of Enron 5 (f) curves of Enron 6 Fig. 3: The curves of each model for each dateset Table 1: Best model for each dataset Datesets Best model (%) Enron 1 M 1, Enron 2 M 2, Enron 3 M 5, Enron 4 M 5, Enron 5 M 5, Enron 6 M 1, The prediction results of the best models are viewed as the best results of our method. To evaluate the filtering capability of our method, we compare it with two term-based spam filtering method. One is the PG Bayesian classifier which is used as a baseline and another is the Multinomial Naive Bayes with Boolean attributes (MN Bool) which is demonstrated as the best Naive Bayes classifier in paper [8]. The best spam and ham recall of PG Bayesian method are selected as baseline. Metsis et al experiment MN Bool also on Enron-Spam datasets, therefore, the experimental results in paper [8] are directly used as another reference. Threshold in the three methods are all.5. Tables 2 and 3 list the spam and ham recall, respectively, of the three method on the six datasets. The tables show that both PG Bayesian and our model are better than MN Bool method which has used 3 attributes. And our model is the best one. Although PG Bayesian also reaches a better result than MN Bool method, it fails exploring the semantic relationships. However, our model can not only explore the semantic relationships but also get a relative better prediction result using just maximum of 5 topics. These results have demonstrated the superiority of our model.

8 3726 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) Table 2: Spam recall (%) comparisons Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg MN Bool PG Bayesian Our Model Table 3: Ham recall (%) comparisons Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg MN Bool PG Bayesian Our Model Conclusion and Future Work In this paper, a Bayesian topic model is proposed for spam filtering. By using LDA, each is represented as a vector of topics, and based upon this representation a Bayesian method is used to discover the relationship between the topics and spam. By testing our method on Enron-Spam datasets, we get the conclusion that our model is better than the baseline and it can detect the internal semantics of spam messages. In the future work, we will test the Bayesian topic model in other application fields, such as document classification. References [1] Mikko T. Siponen, Carl Stucke, Effective anti-spam strategies in companies: An international study, In 39th Hawaii International International Conference on Systems Science, Kauai, HI, USA, 26 [2] Evangelos Moustakas, C. Ranganathan, Penny Duquenoy, Combating spam through legislation: A comparative analysis of us and european approaches, In CEAS 25 - Second Conference on and Anti-Spam, July 21-22, 25, Stanford University, California, USA, 25 [3] Xavier Carreras, Lluís Màrquez, Boosting trees for anti-spam filtering, In Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 21 [4] Harris Drucker, Donghui Wu, Vladimir Vapnik, Support vector machines for spam categorization, IEEE Transactions on Neural Networks, 1(5), 1999, [5] A. Kolcz, J. Alspector, Svm-based filtering of spam with content-specific misclassification costs, In Proceedings of the ICDM Workshop on Text Mining, 21 [6] Yuewu Shen, Guanglu Sun, Haoliang Qi, Xiaoning He, Using feature selection to speed up online svm based spam filtering, In International Conference on Asian Language Processing, IALP 21, Harbin, Heilongjiang, China, 21, [7] Paul Graham, A plan for spam, Available at: August 23

9 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) [8] Vangelis Metsis, Ion Androutsopoulos, Georgios Paliouras, Spam filtering with naive bayes-which naive bayes? In CEAS 26 - The Third Conference on and Anti-Spam, Mountain View, California, USA, July 27-28, 26 [9] Igor Santos, Carlos Laorden, Borja Sanz, Pablo Garcia Bringas, Enhanced topic-based vector space model for semantics-aware spam filtering, Expert Syst. Appl., 39(1), 212, [1] David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., 3 March, 23, [11] István Bíró, Jácint Szabó, András A. Benczúr, Latent dirichlet allocation in web spam filtering, In AIRWeb 8: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing, China, 28, [12] Thomas Hofmann, Probabilistic latent semantic indexing, In Proceedings of the 22th International Conference on Research and Development in Information Retrieval, 1999, 5-57 [13] Thomas Minka Department, Thomas Minka, John Lafferty, Expectation-propagation for the generative aspect model, In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 22, [14] T. L. Griffiths, M. Steyvers, Finding scientific topics, Proceedings of the National Academy of Science, 11 (Suppl. 1), April 24, [15] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A Bayesian approach to filtering junk , In AAAI-98 Workshop on Learning for Text Categorization, 1998, [16] P. Pantel, D. Lin, Spamcop: A spam classification and organization program, In Proceedings of the AAAI Workshop on Learning for Text Categorization, 1998 [17] George H. John, Pat Langley, Estimating continuous distributions in bayesian classifiers, In UAI, Morgan Kaufmann, 1995, [18] Karl-Michael Schneider, On word frequency information and negative evidence in naive bayes text classification, In EsTAL, Lecture Notes in Computer Science, Vol. 323, 24, [19] Xuan-Hieu Phan, Cam-Tu Nguyen, Gibbslda++: A c/c++ implementation of latent dirichlet allocation (lda) using gibbs sampling for parameter estimation and inference, Available at:

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

Three-Way Decisions Solution to Filter Spam Email: An Empirical Study

Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University

More information

Using Biased Discriminant Analysis for Email Filtering

Using Biased Discriminant Analysis for Email Filtering Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and Marie-Francine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico juancarlos.gomez@invitados.itesm.mx 2

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Abstract. The efforts of anti-spammers

More information

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

More information

2. NAIVE BAYES CLASSIFIERS

2. NAIVE BAYES CLASSIFIERS Spam Filtering with Naive Bayes Which Naive Bayes? Vangelis Metsis Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Athens, Greece Ion Androutsopoulos Department of Informatics, Athens

More information

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering 2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

Abstract. Find out if your mortgage rate is too high, NOW. Free Search Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back

More information

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

Single-Class Learning for Spam Filtering: An Ensemble Approach

Single-Class Learning for Spam Filtering: An Ensemble Approach Single-Class Learning for Spam Filtering: An Ensemble Approach Tsang-Hsiang Cheng Department of Business Administration Southern Taiwan University of Technology Tainan, Taiwan, R.O.C. Chih-Ping Wei Institute

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Bayesian Spam Detection

Bayesian Spam Detection Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional

More information

Tweaking Naïve Bayes classifier for intelligent spam detection

Tweaking Naïve Bayes classifier for intelligent spam detection 682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. araturi@uci.edu 2 School of Computing, Information

More information

Detecting E-mail Spam Using Spam Word Associations

Detecting E-mail Spam Using Spam Word Associations Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in

More information

WE DEFINE spam as an e-mail message that is unwanted basically

WE DEFINE spam as an e-mail message that is unwanted basically 1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

Online Courses Recommendation based on LDA

Online Courses Recommendation based on LDA Online Courses Recommendation based on LDA Rel Guzman Apaza, Elizabeth Vera Cervantes, Laura Cruz Quispe, José Ochoa Luna National University of St. Agustin Arequipa - Perú {r.guzmanap,elizavvc,lvcruzq,eduardo.ol}@gmail.com

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Machine Learning for Naive Bayesian Spam Filter Tokenization

Machine Learning for Naive Bayesian Spam Filter Tokenization Machine Learning for Naive Bayesian Spam Filter Tokenization Michael Bevilacqua-Linn December 20, 2003 Abstract Background Traditional client level spam filters rely on rule based heuristics. While these

More information

6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME

6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME INTERNATIONAL International Journal of Computer JOURNAL Engineering OF COMPUTER and Technology ENGINEERING (IJCET), ISSN 0976-6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET)

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Impact of Feature Selection Technique on Email Classification

Impact of Feature Selection Technique on Email Classification Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity

More information

Combining SVM classifiers for email anti-spam filtering

Combining SVM classifiers for email anti-spam filtering Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and

More information

Document Classification with Latent Dirichlet Allocation

Document Classification with Latent Dirichlet Allocation Document Classification with Latent Dirichlet Allocation Ph.D. Thesis Summary István Bíró Supervisor: András Lukács Ph.D. Eötvös Loránd University Faculty of Informatics Department of Information Sciences

More information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization

An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization International Journal of Network Security, Vol.9, No., PP.34 43, July 29 34 An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization Jyh-Jian Sheu Department of Information Management,

More information

Spam or Not Spam That is the question

Spam or Not Spam That is the question Spam or Not Spam That is the question Ravi Kiran S S and Indriyati Atmosukarto {kiran,indria}@cs.washington.edu Abstract Unsolicited commercial email, commonly known as spam has been known to pollute the

More information

SVM-Based Spam Filter with Active and Online Learning

SVM-Based Spam Filter with Active and Online Learning SVM-Based Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,

More information

Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

More information

Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages

Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages Tunga Güngör and Ali Çıltık Boğaziçi University, Computer Engineering Department, Bebek, 34342 İstanbul, Turkey

More information

Effectiveness and Limitations of Statistical Spam Filters

Effectiveness and Limitations of Statistical Spam Filters Effectiveness and Limitations of Statistical Spam Filters M. Tariq Banday, Lifetime Member, CSI P.G. Department of Electronics and Instrumentation Technology University of Kashmir, Srinagar, India Abstract

More information

On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle

More information

Filtering Junk Mail with A Maximum Entropy Model

Filtering Junk Mail with A Maximum Entropy Model Filtering Junk Mail with A Maximum Entropy Model ZHANG Le and YAO Tian-shun Institute of Computer Software & Theory. School of Information Science & Engineering, Northeastern University Shenyang, 110004

More information

Efficient Spam Email Filtering using Adaptive Ontology

Efficient Spam Email Filtering using Adaptive Ontology Efficient Spam Email Filtering using Adaptive Ontology Seongwook Youn and Dennis McLeod Computer Science Department, University of Southern California Los Angeles, CA 90089, USA {syoun, mcleod}@usc.edu

More information

Not So Naïve Online Bayesian Spam Filter

Not So Naïve Online Bayesian Spam Filter Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China freizsu@gmail.com Congfu Xu Institute of Artificial

More information

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier Spam Detection System Combining Cellular Automata and Naive Bayes Classifier F. Barigou*, N. Barigou**, B. Atmani*** Computer Science Department, Faculty of Sciences, University of Oran BP 1524, El M Naouer,

More information

Latent Dirichlet Markov Allocation for Sentiment Analysis

Latent Dirichlet Markov Allocation for Sentiment Analysis Latent Dirichlet Markov Allocation for Sentiment Analysis Ayoub Bagheri Isfahan University of Technology, Isfahan, Iran Intelligent Database, Data Mining and Bioinformatics Lab, Electrical and Computer

More information

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Naive Bayes Spam Filtering Using Word-Position-Based Attributes Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper

More information

An Efficient Spam Filtering Techniques for Email Account

An Efficient Spam Filtering Techniques for Email Account American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email

More information

Journal of Information Technology Impact

Journal of Information Technology Impact Journal of Information Technology Impact Vol. 8, No., pp. -0, 2008 Probability Modeling for Improving Spam Filtering Parameters S. C. Chiemeke University of Benin Nigeria O. B. Longe 2 University of Ibadan

More information

Spam Filter Optimality Based on Signal Detection Theory

Spam Filter Optimality Based on Signal Detection Theory Spam Filter Optimality Based on Signal Detection Theory ABSTRACT Singh Kuldeep NTNU, Norway HUT, Finland kuldeep@unik.no Md. Sadek Ferdous NTNU, Norway University of Tartu, Estonia sadek@unik.no Unsolicited

More information

agoweder@yahoo.com ** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya

agoweder@yahoo.com ** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya AN ANTI-SPAM SYSTEM USING ARTIFICIAL NEURAL NETWORKS AND GENETIC ALGORITHMS ABDUELBASET M. GOWEDER *, TARIK RASHED **, ALI S. ELBEKAIE ***, and HUSIEN A. ALHAMMI **** * The High Institute of Surman for

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

An Imbalanced Spam Mail Filtering Method

An Imbalanced Spam Mail Filtering Method , pp. 119-126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia

More information

Immunity from spam: an analysis of an artificial immune system for junk email detection

Immunity from spam: an analysis of an artificial immune system for junk email detection Immunity from spam: an analysis of an artificial immune system for junk email detection Terri Oda and Tony White Carleton University, Ottawa ON, Canada terri@zone12.com, arpwhite@scs.carleton.ca Abstract.

More information

How To Develop An Anti Spam Software For Information Attacks

How To Develop An Anti Spam Software For Information Attacks I.J. Intelligent Systems and Applications, 2012, 10, 25-34 Published Online September 2012 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijisa.2012.10.03 Anti-Spam Software for Detecting Information

More information

Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques

Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques 52 The International Arab Journal of Information Technology, Vol. 6, No. 1, January 2009 Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques Alaa El-Halees

More information

Representation of Electronic Mail Filtering Profiles: A User Study

Representation of Electronic Mail Filtering Profiles: A User Study Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu

More information

Email Filters that use Spammy Words Only

Email Filters that use Spammy Words Only Email Filters that use Spammy Words Only Vasanth Elavarasan Department of Computer Science University of Texas at Austin Advisors: Mohamed Gouda Department of Computer Science University of Texas at Austin

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Effective Email

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

An Efficient Three-phase Email Spam Filtering Technique

An Efficient Three-phase Email Spam Filtering Technique An Efficient Three-phase Email Filtering Technique Tarek M. Mahmoud 1 *, Alaa Ismail El-Nashar 2 *, Tarek Abd-El-Hafeez 3 *, Marwa Khairy 4 * 1, 2, 3 Faculty of science, Computer Sci. Dept., Minia University,

More information

A Three-Way Decision Approach to Email Spam Filtering

A Three-Way Decision Approach to Email Spam Filtering A Three-Way Decision Approach to Email Spam Filtering Bing Zhou, Yiyu Yao, and Jigang Luo Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {zhou200b,yyao,luo226}@cs.uregina.ca

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Constructing a User Preference Ontology for Anti-spam Mail Systems

Constructing a User Preference Ontology for Anti-spam Mail Systems Constructing a User Preference Ontology for Anti-spam Mail Systems Jongwan Kim 1,, Dejing Dou 2, Haishan Liu 2, and Donghwi Kwak 2 1 School of Computer and Information Technology, Daegu University Gyeonsan,

More information

A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

More information

Naïve Bayesian Anti-spam Filtering Technique for Malay Language

Naïve Bayesian Anti-spam Filtering Technique for Malay Language Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

Savita Teli 1, Santoshkumar Biradar 2

Savita Teli 1, Santoshkumar Biradar 2 Effective Spam Detection Method for Email Savita Teli 1, Santoshkumar Biradar 2 1 (Student, Dept of Computer Engg, Dr. D. Y. Patil College of Engg, Ambi, University of Pune, M.S, India) 2 (Asst. Proff,

More information

Spam Mail Detection through Data Mining A Comparative Performance Analysis

Spam Mail Detection through Data Mining A Comparative Performance Analysis I.J. Modern Education and Computer Science, 2013, 12, 31-39 Published Online December 2013 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijmecs.2013.12.05 Spam Mail Detection through Data Mining A

More information

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

Increasing the Accuracy of a Spam-Detecting Artificial Immune System Increasing the Accuracy of a Spam-Detecting Artificial Immune System Terri Oda Carleton University terri@zone12.com Tony White Carleton University arpwhite@scs.carleton.ca Abstract- Spam, the electronic

More information

A LVQ-based neural network anti-spam email approach

A LVQ-based neural network anti-spam email approach A LVQ-based neural network anti-spam email approach Zhan Chuan Lu Xianliang Hou Mengshu Zhou Xu College of Computer Science and Engineering of UEST of China, Chengdu, China 610054 zhanchuan@uestc.edu.cn

More information

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Non-Parametric Spam Filtering based on knn and LSA

Non-Parametric Spam Filtering based on knn and LSA Non-Parametric Spam Filtering based on knn and LSA Preslav Ivanov Nakov Panayot Markov Dobrikov Abstract. The paper proposes a non-parametric approach to filtering of unsolicited commercial e-mail messages,

More information

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, APRIL 212, VOLUME: 2, ISSUE: 3 AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM S. Arun Mozhi Selvi 1 and R.S. Rajesh 2 1 Department

More information

Filtering Spams using the Minimum Description Length Principle

Filtering Spams using the Minimum Description Length Principle Filtering Spams using the Minimum Description Length Principle ABSTRACT Tiago A. Almeida, Akebo Yamakami School of Electrical and Computer Engineering University of Campinas UNICAMP 13083 970, Campinas,

More information

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools Spam Track Wednesday 1 March, 2006 APRICOT Perth, Australia James Carpinter & Ray Hunt Dept. of Computer Science and Software

More information

Shafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam

Shafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam An Improved AIS Based E-mail Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State Idris.ismaila95@gmail.com Abdulhamid Shafi i

More information

Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

More information

Hoodwinking Spam Email Filters

Hoodwinking Spam Email Filters Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 533 Hoodwinking Spam Email Filters WANLI MA, DAT TRAN, DHARMENDRA

More information

Spam Filter: VSM based Intelligent Fuzzy Decision Maker

Spam Filter: VSM based Intelligent Fuzzy Decision Maker IJCST Vo l. 1, Is s u e 1, Se p te m b e r 2010 ISSN : 0976-8491(Online Spam Filter: VSM based Intelligent Fuzzy Decision Maker Dr. Sonia YMCA University of Science and Technology, Faridabad, India E-mail

More information

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 and Barry Smyth 3 Abstract. While text classification has been identified for some time

More information

Partitioned Logistic Regression for Spam Filtering

Partitioned Logistic Regression for Spam Filtering Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois 201 N Goodwin Ave Urbana, IL, USA mchang21@uiuc.edu Wen-tau Yih Microsoft Research One Microsoft Way Redmond, WA,

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

An Evaluation of Statistical Spam Filtering Techniques

An Evaluation of Statistical Spam Filtering Techniques An Evaluation of Statistical Spam Filtering Techniques Le Zhang, Jingbo Zhu, Tianshun Yao Natural Language Processing Laboratory Institute of Computer Software & Theory Northeastern University, China ejoy@xinhuanet.com,

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Bayes and Naïve Bayes. cs534-machine Learning

Bayes and Naïve Bayes. cs534-machine Learning Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

More information

Topic models for Sentiment analysis: A Literature Survey

Topic models for Sentiment analysis: A Literature Survey Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.

More information

A Novel Approach of Investigating Deceptive Activities of Developer for Ranking apps

A Novel Approach of Investigating Deceptive Activities of Developer for Ranking apps International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 A Novel Approach of Investigating Deceptive Activities of Developer for Ranking apps

More information

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br

More information

Twitter Content-based Spam Filtering

Twitter Content-based Spam Filtering Twitter Content-based Spam Filtering Igor Santos, Igor Miñambres-Marcos, Carlos Laorden, Patxi Galán-García, Aitor Santamaría-Ibirika, and Pablo G. Bringas DeustoTech-Computing, Deusto Institute of Technology

More information

Spam Filtering with Naive Bayesian Classification

Spam Filtering with Naive Bayesian Classification Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

SpamNet Spam Detection Using PCA and Neural Networks

SpamNet Spam Detection Using PCA and Neural Networks SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India abhimanyulad@iiita.ac.in

More information