A Bayesian Topic Model for Spam Filtering
|
|
- Iris Barton
- 8 years ago
- Views:
Transcription
1 Journal of Information & Computational Science 1:12 (213) August 1, 213 Available at A Bayesian Topic Model for Spam Filtering Zhiying Zhang, Xu Yu, Lixiang Shi, Li Peng, Zhixing Huang School of Computer and Information Science, Southwest University, Chongqing 4715, China Abstract Spam is one of the major problems of today s Internet because it brings financial damage to companies and annoys individual users. Among those approaches developed to detect spam, the content-based machine learning algorithms are important and popular. However, these algorithms are trained using statistical representations of the terms that usually appear in the s. Additionally, these methods are unable to account for the underlying semantics of terms within the messages. In this paper, we present a Bayesian topic model to address these limitations. We explore the use of semantics in spam filtering by representing s as vectors of topics with a topic model: the Latent Dirichlet Allocation (LDA). Based upon this representation, the relationship between the topics and spam can be discovered by using a Bayesian method. We test this model on the Enron-Spam datasets and results show that the proposed model performs better than the baseline and can detect the internal semantics of spam messages. Keywords: Spam Detection; Latent Dirichlet Allocation; Bayesian Topic Model 1 Introduction Electronic mail ( ) is one of the most important and powerful means of modern communication. However, in the past decade users have always been plagued by spam, which is also known as junk or Unsolicited Bulk (UBE). Spam triggers a lot of problems, such as making users waste their time on looking through and sorting out additional s [1], causing financial loss to the companies by misusing of traffic, storage space and computational power [1], bringing security and legal problems by spreading malicious software, advertising pornography, pyramid schemes, etc [2]. Many techniques have been proposed to deal with spam. The content-based machine learning algorithms are important and popular, including algorithms that are considered top-performers in text classification, like Boosting [3], Support Vector Machines [4, 5, 6], and Bayesian method [7, 8]. Project supported by Natural Science Foundation Project of CQ CSTC (No. CSTC212JJB412), Scientific Research Foundation for the Returned Overseas Chinese Scholars (No. 2911) and Fundamental Research Funds for the Central Universities (No. SWU139265). Corresponding author. address: huangzx@swu.edu.cn (Zhixing Huang) / Copyright 213 Binary Information Press DOI: /jics212279
2 372 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) Despite the fact that s are usually represented as a sequence of words, there are relationships between words on a semantic level that also affect s [9]. However, the content-based machine learning algorithms are trained using statistical representations of the terms that usually appear in the s and are unable to account for the underlying semantics of the s. To address these limitations, Santos [9] proposed to represent the s with the enhanced Topicbased Vector Space Model (etvsm) and achieved a satisfactory result on Ling-Spam dataset. However, etvsm is a ontology-based method which may limit its effect when encounters more complicated unseen messages. Furthermore, the Ling-Spam has the disadvantage that its ham messages are more topic-specific which could lead to over optimistic estimates of the performance of learning-based spam filters. In contrast, we present a Bayesian topic model by introducing the topic model Latent Dirichlet Allocation (LDA) [1] to mine the semantics of s. LDA is a generative probabilistic model of a corpus which will not be limited by the weakness of ontology. LDA models every document as a distribution over the topics, and every topic as a distribution over the words. These topics could better reflect the semantics of the document than terms. The basic idea of our approach is that: we use a previously estimated LDA model to make inference on the new unseen s to get the topics distribution of each . Hence, each could be treated as a vector of topics not terms. As the topics have deeper relationship with the content of a , we can then use a Bayesian method to discover the relationship between the topics and spam. More detailed descriptions are shown in Section 3. Our model may be similar in the sense with the method proposed by Bíró [11] because we also use LDA, however, the model we present is completely different with their method. The remainder of this paper is organized as follows. Section 2 introduces the basic theory. Section 3 describes the proposed methodology. Section 4 details the performed experiments and presents the results. Finally, Section 5 concludes and outlines avenues for future work. 2 Basic Theory The basic theory includes LDA topic model which is used to get the topics distribution of E- mails and Bayesian method which could discover the relationship between words and spam. A modification of this Bayesian method is used to discover the relationship between topics and spam in our approach. 2.1 Latent Dirichlet Allocation There are D documents of arbitrary length. A document d is a vector of N d words, W d, where each word w id is chosen from a vocabulary of size V. Then the generation of a document collection in LDA is modeled as a three step processes. First, for each document, a distribution over topics is sampled from a Dirichlet distribution. Second, for each word in the document, a single topic is chosen according to this distribution. Finally, each word is sampled from a multinomial distribution over words specific to the sampled topic. This generative process corresponds to the hierarchical Bayesian model shown (using plate notation) in Fig. 1.
3 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) α θ z β φ T w N d D Fig. 1: The hierarchical Bayesian model for LDA In this model, ϕ denotes the matrix of topic distributions, with a multinomial distribution over V vocabulary items for each of T topics being drawn independently from a symmetric Dirichlet(β) prior. θ is the matrix of document-specific mixture weights for these T topics, each being drawn independently from a symmetric Dirichlet(α) prior. For each word, z denotes the topic responsible for generating that word, drawn from the θ distribution for that document, and w is the word itself, drawn from the topic distribution ϕ corresponding to z. Estimating ϕ and θ provides information about the topics that participate in a corpus and the weights of those topics in each document respectively. A variety of algorithms have been used to estimate these parameters, from basic expecation-maximization [12] to approximate inference methods like variational EM [1], expectation propagation [13], and Gibbs sampling [14]. 2.2 The Bayesian Method The Bayesian method proposed by Paul Graham [7] is very different from any form of Naive Bayes classifiers [15, 16, 17, 18] and is able to greatly improve the false positive rate. In this paper, this method is referred as PG Bayesian classifier. PG Bayesian classifier could discover the relationship between words and spam. Each word in the contributes to the s spam probability, or only the most interesting words. This contribution of one word, which also can be called the spamicity of one word, is calculated using Bayes theorem: p(s w) = p(w s)p(s) p(w s)p(s) + p(w h)p(h) (1) In Eq. (1), p(s) is the overall probability that any given is spam. p(h) is the overall probability that any given is ham. p(w s) is the probability that the given word appears in spam training s, which can be estimated by dividing the number of spam training s that contain this word by the total number of spam training s. p(w h) is the probability that the given word appears in ham training s, which can be estimated by dividing the number of ham training s that contain this word by the total number of ham training s. PG Bayesian classifier makes the assumption that there is no a priori reason for any incoming message to be spam rather than ham, considers p(s) = p(h) =.5. This assumption permits simplifying the Eq. (1) to: p(w s) p(s w) = (2) p(w s) + p(w h)
4 3722 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) PG Bayesian classifier also makes the assumption that the words present in a are independent events. With that assumption, one can derive another equation from Bayes theorem to calculate the probability that the is spam by taking into consideration N words of the p 1 p 2...p N p = (3) p 1 p 2...p N + (1 p 1 )(1 p 2 )...(1 p N ) In Eq. (3), p indicates how sure the filter is that the is spam. p n (n = {1,..., N}) is the probability p(s w n ). The result p is usually compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam. 3 Proposed Methodology Assume that there is a training set consisting of S spam s and H ham s and there is a test set consisting of N new unseen s. A Unified LDA model with T topics of the overall training set can be estimated using LDA firstly. Then, this Unified LDA model is used to make inference separately for the spam training set, ham training set and the new unseen s. Here three LDA models could be got: the Spam LDA model with the topics distribution θ (s) of spam s, the Ham LDA model with the topics distribution θ (h) of ham s and New- s LDA model with the topics distribution θ (n) of new s. All these three models have T topics which are consistent with the topics of the Unified LDA model. The third step, each e can be represented as a vector e = z 1,..., z T. Each z i (i = {1,..., T }) has a value which is the probability of the topic z i occurs in this i.e. p(z i e). This value can be directly got from the corresponding matrix θ. What can be naturally thought of is that some topics of the T topics are more relevant to spam s and some are more relevant to ham s. In other words, the topics which are more relevant to spam s will have a higher probability in each spam training , and the topics which are more relevant to ham training s will have a higher probability in each ham . That means each of the T topic has the spamicity just like words. According to the Eq. (2), the following equation to calculate the probability of the spamicity of one topic z i (i = {1,..., T }) could be naturally got: p(s z i ) = + The challenge is how to calculate the probability of and. The spam training set has total S spam s, given each spam training e j (j = {1,..., S}), the probability of each topic z i could be got from the matrix θ (s), i.e. p(z i e j, s) = θ (s) j,i. We define the probability of each e j as: p(e j s) = 1. Hence we could compute the probability of p(z S i s) by using the law of total probability: = = 1 S S p(z i e j, s)p(e j s) j=1 S j=1 (4) θ (s) j,i (5)
5 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) The probability of can be calculated using the same method, and then we update the Eq. (4) into: p(s z i ) = 1 S 1 S S j=1 θ(s) j,i S j=1 θ(s) j,i + 1 H H j=1 θ(h) j,i Each e j (j = {1,..., N}) of N new unseen s has also been represented as e j = z 1,..., z T. And the value of each z i is p(z i e j ). Apparently p(z i e j ) = θ (n) ji. Then, we need to select top k most representative topics of e j to calculate the probability that e j is spam. This could be achieved by using the following algorithm in each e j : (1) For each topic z i of T topics, if.45 < p(s z i ) <.55, add topic z i into CandidateTopicSet (CTS). (2) Rearrange the topics of e j according to the descending order of values p(z i e j ), save the results as TopicOrderList (TOL). (3) For each topic in TOL, orderly add those topics which also in CTS into the AvailableTopicList (ATL). (4) Select top k topics from ATL. Then, the probability that the e j is spam can be computed by taking into consideration all of the top k topics. According to the form of Eq. (3), we can derive the final equation, which is: k i=1 p(s e j ) = p(s z i) k i=1 p(s z i) + k i=1 (1 p(s z (7) i)) 4 Experiment and Evalution 4.1 Datasets and Experimental Setup We use six datasets collectively called Enron-Spam datasets, which is developed by Metsis et al in paper [8] and is also a publicly available, non-encoded datasets just like Ling-Spam and SpamAssassin. Each of the six Enron-Spam datasets consists of a ham set and a spam set and each message is in a separate text file. The ham collections of these six datasets were got from six Enron users, and were each paired with a spam collection. Hereafter, we refer the six Enron-Spam datasets as Enron 1, Enron 2,..., Enron 6 respectively. Phan s GibbsLDA [19] is used to do estimations and inferences on the datasets. The Dirichlet parameter β is chosen to be constant.1 throughout, while α = 5/T throughout. T is the number of topics of the LDA model, we experiment with T = {1, 2, 5, 1, 2}. The Gibbs sampling is stopped after 2 steps for estimation on the Unified training set, and after 1 steps for inference on the Ham training set, Spam training set, and Test set. In the testing phase, top k most representative topics are selected for each to calculate the probability of the test is spam. We experiment with k = 1,..., Length(AT L) 1. Each learning model of each dataset is denoted as M T,k. The threshold is.5. If the probability of the (6)
6 3724 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) is lower than the threshold, it is considered as likely ham, otherwise it is considered as likely spam. 1-fold cross validation is applied in our experiments. During the above experiments, the curves of the topics probability distribution on different model of each dataset are learned. A group of curves with T = 2 are shown in Fig. 2. These curves clearly reveals that the probability of the same topic occurs in different categories is also different, which is a direct proof of the correctness and feasibility of our approach (a) Curves on Enron 1 (d) Curves on Enron (b) Curves on Enron 2 (e) Curves on Enron (b) Curves on Enron 3 (f) Curves on Enron 6 Fig. 2: The curves of topics probability occurs in different model for each dateset, with T=2 4.2 Evaluation and Comparison We first make an evaluation on each model M T,k. Because different datasets are for different person and each one has a different ham-spam ratio, the best performing learning model may be also different. To evaluate each model M T,k of each dataset, we present the evaluation results by means of curves. For the k of the best model M T,k are all less then 7, the F-measure curves are drawn within the scope of k = {1,..., 7} for facilitating the contrast. By observing the F-measure curves which are shown in Fig. 3, the best performing model could be selected for each datesets. Table 1 shows these selected models as well as the corresponding F-measure values. Our method achieves a best result on Enron 4 which demonstrates Metsis s [8] view that some datasets (e.g., Enron 4) are easier than others (e.g., Enron 1). We just use 1 topic to do detection on Enron 4, in contrast use 5 topics to do detection on Enron 1. We also find that models with T = 1 and T = 2 all not reach the best performing which shows too small or too large T are not appropriate.
7 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) T=5 T=5 T=5 T=2 T=2 T= (a) curves of Enron 1 (b) curves of Enron 2 (c) curves of Enron 3 T=5 T=5 T=5 T=2 T=2 T= (d) curves of Enron 4 (e) curves of Enron 5 (f) curves of Enron 6 Fig. 3: The curves of each model for each dateset Table 1: Best model for each dataset Datesets Best model (%) Enron 1 M 1, Enron 2 M 2, Enron 3 M 5, Enron 4 M 5, Enron 5 M 5, Enron 6 M 1, The prediction results of the best models are viewed as the best results of our method. To evaluate the filtering capability of our method, we compare it with two term-based spam filtering method. One is the PG Bayesian classifier which is used as a baseline and another is the Multinomial Naive Bayes with Boolean attributes (MN Bool) which is demonstrated as the best Naive Bayes classifier in paper [8]. The best spam and ham recall of PG Bayesian method are selected as baseline. Metsis et al experiment MN Bool also on Enron-Spam datasets, therefore, the experimental results in paper [8] are directly used as another reference. Threshold in the three methods are all.5. Tables 2 and 3 list the spam and ham recall, respectively, of the three method on the six datasets. The tables show that both PG Bayesian and our model are better than MN Bool method which has used 3 attributes. And our model is the best one. Although PG Bayesian also reaches a better result than MN Bool method, it fails exploring the semantic relationships. However, our model can not only explore the semantic relationships but also get a relative better prediction result using just maximum of 5 topics. These results have demonstrated the superiority of our model.
8 3726 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) Table 2: Spam recall (%) comparisons Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg MN Bool PG Bayesian Our Model Table 3: Ham recall (%) comparisons Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg MN Bool PG Bayesian Our Model Conclusion and Future Work In this paper, a Bayesian topic model is proposed for spam filtering. By using LDA, each is represented as a vector of topics, and based upon this representation a Bayesian method is used to discover the relationship between the topics and spam. By testing our method on Enron-Spam datasets, we get the conclusion that our model is better than the baseline and it can detect the internal semantics of spam messages. In the future work, we will test the Bayesian topic model in other application fields, such as document classification. References [1] Mikko T. Siponen, Carl Stucke, Effective anti-spam strategies in companies: An international study, In 39th Hawaii International International Conference on Systems Science, Kauai, HI, USA, 26 [2] Evangelos Moustakas, C. Ranganathan, Penny Duquenoy, Combating spam through legislation: A comparative analysis of us and european approaches, In CEAS 25 - Second Conference on and Anti-Spam, July 21-22, 25, Stanford University, California, USA, 25 [3] Xavier Carreras, Lluís Màrquez, Boosting trees for anti-spam filtering, In Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 21 [4] Harris Drucker, Donghui Wu, Vladimir Vapnik, Support vector machines for spam categorization, IEEE Transactions on Neural Networks, 1(5), 1999, [5] A. Kolcz, J. Alspector, Svm-based filtering of spam with content-specific misclassification costs, In Proceedings of the ICDM Workshop on Text Mining, 21 [6] Yuewu Shen, Guanglu Sun, Haoliang Qi, Xiaoning He, Using feature selection to speed up online svm based spam filtering, In International Conference on Asian Language Processing, IALP 21, Harbin, Heilongjiang, China, 21, [7] Paul Graham, A plan for spam, Available at: August 23
9 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) [8] Vangelis Metsis, Ion Androutsopoulos, Georgios Paliouras, Spam filtering with naive bayes-which naive bayes? In CEAS 26 - The Third Conference on and Anti-Spam, Mountain View, California, USA, July 27-28, 26 [9] Igor Santos, Carlos Laorden, Borja Sanz, Pablo Garcia Bringas, Enhanced topic-based vector space model for semantics-aware spam filtering, Expert Syst. Appl., 39(1), 212, [1] David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., 3 March, 23, [11] István Bíró, Jácint Szabó, András A. Benczúr, Latent dirichlet allocation in web spam filtering, In AIRWeb 8: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing, China, 28, [12] Thomas Hofmann, Probabilistic latent semantic indexing, In Proceedings of the 22th International Conference on Research and Development in Information Retrieval, 1999, 5-57 [13] Thomas Minka Department, Thomas Minka, John Lafferty, Expectation-propagation for the generative aspect model, In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 22, [14] T. L. Griffiths, M. Steyvers, Finding scientific topics, Proceedings of the National Academy of Science, 11 (Suppl. 1), April 24, [15] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A Bayesian approach to filtering junk , In AAAI-98 Workshop on Learning for Text Categorization, 1998, [16] P. Pantel, D. Lin, Spamcop: A spam classification and organization program, In Proceedings of the AAAI Workshop on Learning for Text Categorization, 1998 [17] George H. John, Pat Langley, Estimating continuous distributions in bayesian classifiers, In UAI, Morgan Kaufmann, 1995, [18] Karl-Michael Schneider, On word frequency information and negative evidence in naive bayes text classification, In EsTAL, Lecture Notes in Computer Science, Vol. 323, 24, [19] Xuan-Hieu Phan, Cam-Tu Nguyen, Gibbslda++: A c/c++ implementation of latent dirichlet allocation (lda) using gibbs sampling for parameter estimation and inference, Available at:
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
More informationBayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
More informationThree-Way Decisions Solution to Filter Spam Email: An Empirical Study
Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University
More informationUsing Biased Discriminant Analysis for Email Filtering
Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and Marie-Francine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico juancarlos.gomez@invitados.itesm.mx 2
More informationCAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
More informationOn Attacking Statistical Spam Filters
On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Abstract. The efforts of anti-spammers
More informationA Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng
More information2. NAIVE BAYES CLASSIFIERS
Spam Filtering with Naive Bayes Which Naive Bayes? Vangelis Metsis Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Athens, Greece Ion Androutsopoulos Department of Informatics, Athens
More informationPSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering
2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University
More informationSimple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
More informationIMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical
More informationLasso-based Spam Filtering with Chinese Emails
Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationAbstract. Find out if your mortgage rate is too high, NOW. Free Search
Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back
More informationA MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
More informationHow To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
More informationSingle-Class Learning for Spam Filtering: An Ensemble Approach
Single-Class Learning for Spam Filtering: An Ensemble Approach Tsang-Hsiang Cheng Department of Business Administration Southern Taiwan University of Technology Tainan, Taiwan, R.O.C. Chih-Ping Wei Institute
More informationA Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
More informationData Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority
More informationBayesian Spam Detection
Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional
More informationTweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. araturi@uci.edu 2 School of Computing, Information
More informationDetecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in
More informationWE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
More informationSpam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
More informationOnline Courses Recommendation based on LDA
Online Courses Recommendation based on LDA Rel Guzman Apaza, Elizabeth Vera Cervantes, Laura Cruz Quispe, José Ochoa Luna National University of St. Agustin Arequipa - Perú {r.guzmanap,elizavvc,lvcruzq,eduardo.ol}@gmail.com
More informationT-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationMachine Learning for Naive Bayesian Spam Filter Tokenization
Machine Learning for Naive Bayesian Spam Filter Tokenization Michael Bevilacqua-Linn December 20, 2003 Abstract Background Traditional client level spam filters rely on rule based heuristics. While these
More information6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME
INTERNATIONAL International Journal of Computer JOURNAL Engineering OF COMPUTER and Technology ENGINEERING (IJCET), ISSN 0976-6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET)
More informationDetecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationImpact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
More informationCombining SVM classifiers for email anti-spam filtering
Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and
More informationDocument Classification with Latent Dirichlet Allocation
Document Classification with Latent Dirichlet Allocation Ph.D. Thesis Summary István Bíró Supervisor: András Lukács Ph.D. Eötvös Loránd University Faculty of Informatics Department of Information Sciences
More informationLan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
More informationAn Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
More informationAn Efficient Two-phase Spam Filtering Method Based on E-mails Categorization
International Journal of Network Security, Vol.9, No., PP.34 43, July 29 34 An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization Jyh-Jian Sheu Department of Information Management,
More informationSpam or Not Spam That is the question
Spam or Not Spam That is the question Ravi Kiran S S and Indriyati Atmosukarto {kiran,indria}@cs.washington.edu Abstract Unsolicited commercial email, commonly known as spam has been known to pollute the
More informationSVM-Based Spam Filter with Active and Online Learning
SVM-Based Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,
More informationSpam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
More informationDeveloping Methods and Heuristics with Low Time Complexities for Filtering Spam Messages
Developing Methods and Heuristics with Low Time Complexities for Filtering Spam Messages Tunga Güngör and Ali Çıltık Boğaziçi University, Computer Engineering Department, Bebek, 34342 İstanbul, Turkey
More informationEffectiveness and Limitations of Statistical Spam Filters
Effectiveness and Limitations of Statistical Spam Filters M. Tariq Banday, Lifetime Member, CSI P.G. Department of Electronics and Instrumentation Technology University of Kashmir, Srinagar, India Abstract
More informationOn Attacking Statistical Spam Filters
On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle
More informationFiltering Junk Mail with A Maximum Entropy Model
Filtering Junk Mail with A Maximum Entropy Model ZHANG Le and YAO Tian-shun Institute of Computer Software & Theory. School of Information Science & Engineering, Northeastern University Shenyang, 110004
More informationEfficient Spam Email Filtering using Adaptive Ontology
Efficient Spam Email Filtering using Adaptive Ontology Seongwook Youn and Dennis McLeod Computer Science Department, University of Southern California Los Angeles, CA 90089, USA {syoun, mcleod}@usc.edu
More informationNot So Naïve Online Bayesian Spam Filter
Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China freizsu@gmail.com Congfu Xu Institute of Artificial
More informationSpam Detection System Combining Cellular Automata and Naive Bayes Classifier
Spam Detection System Combining Cellular Automata and Naive Bayes Classifier F. Barigou*, N. Barigou**, B. Atmani*** Computer Science Department, Faculty of Sciences, University of Oran BP 1524, El M Naouer,
More informationLatent Dirichlet Markov Allocation for Sentiment Analysis
Latent Dirichlet Markov Allocation for Sentiment Analysis Ayoub Bagheri Isfahan University of Technology, Isfahan, Iran Intelligent Database, Data Mining and Bioinformatics Lab, Electrical and Computer
More informationNaive Bayes Spam Filtering Using Word-Position-Based Attributes
Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper
More informationAn Efficient Spam Filtering Techniques for Email Account
American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email
More informationJournal of Information Technology Impact
Journal of Information Technology Impact Vol. 8, No., pp. -0, 2008 Probability Modeling for Improving Spam Filtering Parameters S. C. Chiemeke University of Benin Nigeria O. B. Longe 2 University of Ibadan
More informationSpam Filter Optimality Based on Signal Detection Theory
Spam Filter Optimality Based on Signal Detection Theory ABSTRACT Singh Kuldeep NTNU, Norway HUT, Finland kuldeep@unik.no Md. Sadek Ferdous NTNU, Norway University of Tartu, Estonia sadek@unik.no Unsolicited
More informationagoweder@yahoo.com ** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya
AN ANTI-SPAM SYSTEM USING ARTIFICIAL NEURAL NETWORKS AND GENETIC ALGORITHMS ABDUELBASET M. GOWEDER *, TARIK RASHED **, ALI S. ELBEKAIE ***, and HUSIEN A. ALHAMMI **** * The High Institute of Surman for
More informationEmail Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
More informationArtificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
More informationAn Imbalanced Spam Mail Filtering Method
, pp. 119-126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia
More informationImmunity from spam: an analysis of an artificial immune system for junk email detection
Immunity from spam: an analysis of an artificial immune system for junk email detection Terri Oda and Tony White Carleton University, Ottawa ON, Canada terri@zone12.com, arpwhite@scs.carleton.ca Abstract.
More informationHow To Develop An Anti Spam Software For Information Attacks
I.J. Intelligent Systems and Applications, 2012, 10, 25-34 Published Online September 2012 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijisa.2012.10.03 Anti-Spam Software for Detecting Information
More informationFiltering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques
52 The International Arab Journal of Information Technology, Vol. 6, No. 1, January 2009 Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques Alaa El-Halees
More informationRepresentation of Electronic Mail Filtering Profiles: A User Study
Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu
More informationEmail Filters that use Spammy Words Only
Email Filters that use Spammy Words Only Vasanth Elavarasan Department of Computer Science University of Texas at Austin Advisors: Mohamed Gouda Department of Computer Science University of Texas at Austin
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationDr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India
Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Effective Email
More informationAnti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
More informationAn Efficient Three-phase Email Spam Filtering Technique
An Efficient Three-phase Email Filtering Technique Tarek M. Mahmoud 1 *, Alaa Ismail El-Nashar 2 *, Tarek Abd-El-Hafeez 3 *, Marwa Khairy 4 * 1, 2, 3 Faculty of science, Computer Sci. Dept., Minia University,
More informationA Three-Way Decision Approach to Email Spam Filtering
A Three-Way Decision Approach to Email Spam Filtering Bing Zhou, Yiyu Yao, and Jigang Luo Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {zhou200b,yyao,luo226}@cs.uregina.ca
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
More informationConstructing a User Preference Ontology for Anti-spam Mail Systems
Constructing a User Preference Ontology for Anti-spam Mail Systems Jongwan Kim 1,, Dejing Dou 2, Haishan Liu 2, and Donghwi Kwak 2 1 School of Computer and Information Technology, Daegu University Gyeonsan,
More informationA Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
More informationNaïve Bayesian Anti-spam Filtering Technique for Malay Language
Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information
More informationDomain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
More informationSavita Teli 1, Santoshkumar Biradar 2
Effective Spam Detection Method for Email Savita Teli 1, Santoshkumar Biradar 2 1 (Student, Dept of Computer Engg, Dr. D. Y. Patil College of Engg, Ambi, University of Pune, M.S, India) 2 (Asst. Proff,
More informationSpam Mail Detection through Data Mining A Comparative Performance Analysis
I.J. Modern Education and Computer Science, 2013, 12, 31-39 Published Online December 2013 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijmecs.2013.12.05 Spam Mail Detection through Data Mining A
More informationIncreasing the Accuracy of a Spam-Detecting Artificial Immune System
Increasing the Accuracy of a Spam-Detecting Artificial Immune System Terri Oda Carleton University terri@zone12.com Tony White Carleton University arpwhite@scs.carleton.ca Abstract- Spam, the electronic
More informationA LVQ-based neural network anti-spam email approach
A LVQ-based neural network anti-spam email approach Zhan Chuan Lu Xianliang Hou Mengshu Zhou Xu College of Computer Science and Engineering of UEST of China, Chengdu, China 610054 zhanchuan@uestc.edu.cn
More informationIDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION
http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationNon-Parametric Spam Filtering based on knn and LSA
Non-Parametric Spam Filtering based on knn and LSA Preslav Ivanov Nakov Panayot Markov Dobrikov Abstract. The paper proposes a non-parametric approach to filtering of unsolicited commercial e-mail messages,
More informationAN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM
ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, APRIL 212, VOLUME: 2, ISSUE: 3 AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM S. Arun Mozhi Selvi 1 and R.S. Rajesh 2 1 Department
More informationFiltering Spams using the Minimum Description Length Principle
Filtering Spams using the Minimum Description Length Principle ABSTRACT Tiago A. Almeida, Akebo Yamakami School of Electrical and Computer Engineering University of Campinas UNICAMP 13083 970, Campinas,
More informationTightening the Net: A Review of Current and Next Generation Spam Filtering Tools
Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools Spam Track Wednesday 1 March, 2006 APRICOT Perth, Australia James Carpinter & Ray Hunt Dept. of Computer Science and Software
More informationShafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam
An Improved AIS Based E-mail Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State Idris.ismaila95@gmail.com Abdulhamid Shafi i
More informationDirichlet Processes A gentle tutorial
Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.
More informationHoodwinking Spam Email Filters
Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 533 Hoodwinking Spam Email Filters WANLI MA, DAT TRAN, DHARMENDRA
More informationSpam Filter: VSM based Intelligent Fuzzy Decision Maker
IJCST Vo l. 1, Is s u e 1, Se p te m b e r 2010 ISSN : 0976-8491(Online Spam Filter: VSM based Intelligent Fuzzy Decision Maker Dr. Sonia YMCA University of Science and Technology, Faridabad, India E-mail
More informationECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 and Barry Smyth 3 Abstract. While text classification has been identified for some time
More informationPartitioned Logistic Regression for Spam Filtering
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois 201 N Goodwin Ave Urbana, IL, USA mchang21@uiuc.edu Wen-tau Yih Microsoft Research One Microsoft Way Redmond, WA,
More informationThree types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationAn Evaluation of Statistical Spam Filtering Techniques
An Evaluation of Statistical Spam Filtering Techniques Le Zhang, Jingbo Zhu, Tianshun Yao Natural Language Processing Laboratory Institute of Computer Software & Theory Northeastern University, China ejoy@xinhuanet.com,
More informationFacilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
More informationCombining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
More informationVCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
More informationBayes and Naïve Bayes. cs534-machine Learning
Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
More informationTopic models for Sentiment analysis: A Literature Survey
Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.
More informationA Novel Approach of Investigating Deceptive Activities of Developer for Ranking apps
International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 A Novel Approach of Investigating Deceptive Activities of Developer for Ranking apps
More informationAUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br
More informationTwitter Content-based Spam Filtering
Twitter Content-based Spam Filtering Igor Santos, Igor Miñambres-Marcos, Carlos Laorden, Patxi Galán-García, Aitor Santamaría-Ibirika, and Pablo G. Bringas DeustoTech-Computing, Deusto Institute of Technology
More informationSpam Filtering with Naive Bayesian Classification
Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011
More informationThe Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
More informationSpamNet Spam Detection Using PCA and Neural Networks
SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India abhimanyulad@iiita.ac.in
More information