A Bayesian Topic Model for Spam Filtering



Similar documents
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Bayesian Spam Filtering

Three-Way Decisions Solution to Filter Spam An Empirical Study

Using Biased Discriminant Analysis for Filtering

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

On Attacking Statistical Spam Filters

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

2. NAIVE BAYES CLASSIFIERS

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

Simple Language Models for Spam Detection

IMPROVING SPAM FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

Lasso-based Spam Filtering with Chinese s

Towards better accuracy for Spam predictions

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

How To Filter Spam Image From A Picture By Color Or Color

A Proposed Algorithm for Spam Filtering s by Hash Table Approach

Data Mining Yelp Data - Predicting rating stars from review text

Bayesian Spam Detection

Tweaking Naïve Bayes classifier for intelligent spam detection

Detecting Spam Using Spam Word Associations

WE DEFINE spam as an message that is unwanted basically

Spam detection with data mining method:

Online Courses Recommendation based on LDA

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

6367(Print), ISSN (Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Machine Learning Final Project Spam Filtering

Impact of Feature Selection Technique on Classification

Combining SVM classifiers for anti-spam filtering

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

An Approach to Detect Spam s by Using Majority Voting

An Efficient Two-phase Spam Filtering Method Based on s Categorization

Spam or Not Spam That is the question

Spam Filtering using Naïve Bayesian Classification

Filtering Junk Mail with A Maximum Entropy Model

Efficient Spam Filtering using Adaptive Ontology

Not So Naïve Online Bayesian Spam Filter

Latent Dirichlet Markov Allocation for Sentiment Analysis

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

An Efficient Spam Filtering Techniques for Account

Journal of Information Technology Impact

Spam Filter Optimality Based on Signal Detection Theory

** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya

Spam Detection A Machine Learning Approach

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

An Imbalanced Spam Mail Filtering Method

Immunity from spam: an analysis of an artificial immune system for junk detection

Filtering Spam from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques

Filters that use Spammy Words Only

Data Mining - Evaluation of Classifiers

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

An Efficient Three-phase Spam Filtering Technique

A Three-Way Decision Approach to Spam Filtering

A Content based Spam Filtering Using Optical Back Propagation Technique

A Game Theoretical Framework for Adversarial Learning

Domain Classification of Technical Terms Using the Web

Savita Teli 1, Santoshkumar Biradar 2

Spam Mail Detection through Data Mining A Comparative Performance Analysis

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

A LVQ-based neural network anti-spam approach

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

Categorical Data Visualization and Clustering Using Subjective Factors

Non-Parametric Spam Filtering based on knn and LSA

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM

Filtering Spams using the Minimum Description Length Principle

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Keywords - Algorithm, Artificial immune system, Classification, Non-Spam, Spam

Dirichlet Processes A gentle tutorial

Spam Filter: VSM based Intelligent Fuzzy Decision Maker

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

Partitioned Logistic Regression for Spam Filtering

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Machine Learning in Spam Filtering

Facilitating Business Process Discovery using Analysis

Combining Global and Personal Anti-Spam Filtering

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Bayes and Naïve Bayes. cs534-machine Learning

Topic models for Sentiment analysis: A Literature Survey

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Twitter Content-based Spam Filtering

Spam Filtering with Naive Bayesian Classification

The Optimality of Naive Bayes

SpamNet Spam Detection Using PCA and Neural Networks

Transcription:

Journal of Information & Computational Science 1:12 (213) 3719 3727 August 1, 213 Available at http://www.joics.com A Bayesian Topic Model for Spam Filtering Zhiying Zhang, Xu Yu, Lixiang Shi, Li Peng, Zhixing Huang School of Computer and Information Science, Southwest University, Chongqing 4715, China Abstract Spam is one of the major problems of today s Internet because it brings financial damage to companies and annoys individual users. Among those approaches developed to detect spam, the content-based machine learning algorithms are important and popular. However, these algorithms are trained using statistical representations of the terms that usually appear in the e-mails. Additionally, these methods are unable to account for the underlying semantics of terms within the messages. In this paper, we present a Bayesian topic model to address these limitations. We explore the use of semantics in spam filtering by representing e-mails as vectors of topics with a topic model: the Latent Dirichlet Allocation (LDA). Based upon this representation, the relationship between the topics and spam can be discovered by using a Bayesian method. We test this model on the Enron-Spam datasets and results show that the proposed model performs better than the baseline and can detect the internal semantics of spam messages. Keywords: Spam Detection; Latent Dirichlet Allocation; Bayesian Topic Model 1 Introduction Electronic mail (E-mail) is one of the most important and powerful means of modern communication. However, in the past decade E-mail users have always been plagued by spam, which is also known as junk email or Unsolicited Bulk Email (UBE). Spam triggers a lot of problems, such as making users waste their time on looking through and sorting out additional emails [1], causing financial loss to the companies by misusing of traffic, storage space and computational power [1], bringing security and legal problems by spreading malicious software, advertising pornography, pyramid schemes, etc [2]. Many techniques have been proposed to deal with spam. The content-based machine learning algorithms are important and popular, including algorithms that are considered top-performers in text classification, like Boosting [3], Support Vector Machines [4, 5, 6], and Bayesian method [7, 8]. Project supported by Natural Science Foundation Project of CQ CSTC (No. CSTC212JJB412), Scientific Research Foundation for the Returned Overseas Chinese Scholars (No. 2911) and Fundamental Research Funds for the Central Universities (No. SWU139265). Corresponding author. Email address: huangzx@swu.edu.cn (Zhixing Huang). 1548 7741 / Copyright 213 Binary Information Press DOI: 1.12733/jics212279

372 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) 3719 3727 Despite the fact that E-mails are usually represented as a sequence of words, there are relationships between words on a semantic level that also affect E-mails [9]. However, the content-based machine learning algorithms are trained using statistical representations of the terms that usually appear in the E-mails and are unable to account for the underlying semantics of the E-mails. To address these limitations, Santos [9] proposed to represent the E-mails with the enhanced Topicbased Vector Space Model (etvsm) and achieved a satisfactory result on Ling-Spam dataset. However, etvsm is a ontology-based method which may limit its effect when encounters more complicated unseen messages. Furthermore, the Ling-Spam has the disadvantage that its ham messages are more topic-specific which could lead to over optimistic estimates of the performance of learning-based spam filters. In contrast, we present a Bayesian topic model by introducing the topic model Latent Dirichlet Allocation (LDA) [1] to mine the semantics of E-mails. LDA is a generative probabilistic model of a corpus which will not be limited by the weakness of ontology. LDA models every document as a distribution over the topics, and every topic as a distribution over the words. These topics could better reflect the semantics of the document than terms. The basic idea of our approach is that: we use a previously estimated LDA model to make inference on the new unseen E-mails to get the topics distribution of each E-mail. Hence, each E-mail could be treated as a vector of topics not terms. As the topics have deeper relationship with the content of a E-mail, we can then use a Bayesian method to discover the relationship between the topics and spam. More detailed descriptions are shown in Section 3. Our model may be similar in the sense with the method proposed by Bíró [11] because we also use LDA, however, the model we present is completely different with their method. The remainder of this paper is organized as follows. Section 2 introduces the basic theory. Section 3 describes the proposed methodology. Section 4 details the performed experiments and presents the results. Finally, Section 5 concludes and outlines avenues for future work. 2 Basic Theory The basic theory includes LDA topic model which is used to get the topics distribution of E- mails and Bayesian method which could discover the relationship between words and spam. A modification of this Bayesian method is used to discover the relationship between topics and spam in our approach. 2.1 Latent Dirichlet Allocation There are D documents of arbitrary length. A document d is a vector of N d words, W d, where each word w id is chosen from a vocabulary of size V. Then the generation of a document collection in LDA is modeled as a three step processes. First, for each document, a distribution over topics is sampled from a Dirichlet distribution. Second, for each word in the document, a single topic is chosen according to this distribution. Finally, each word is sampled from a multinomial distribution over words specific to the sampled topic. This generative process corresponds to the hierarchical Bayesian model shown (using plate notation) in Fig. 1.

Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) 3719 3727 3721 α θ z β φ T w N d D Fig. 1: The hierarchical Bayesian model for LDA In this model, ϕ denotes the matrix of topic distributions, with a multinomial distribution over V vocabulary items for each of T topics being drawn independently from a symmetric Dirichlet(β) prior. θ is the matrix of document-specific mixture weights for these T topics, each being drawn independently from a symmetric Dirichlet(α) prior. For each word, z denotes the topic responsible for generating that word, drawn from the θ distribution for that document, and w is the word itself, drawn from the topic distribution ϕ corresponding to z. Estimating ϕ and θ provides information about the topics that participate in a corpus and the weights of those topics in each document respectively. A variety of algorithms have been used to estimate these parameters, from basic expecation-maximization [12] to approximate inference methods like variational EM [1], expectation propagation [13], and Gibbs sampling [14]. 2.2 The Bayesian Method The Bayesian method proposed by Paul Graham [7] is very different from any form of Naive Bayes classifiers [15, 16, 17, 18] and is able to greatly improve the false positive rate. In this paper, this method is referred as PG Bayesian classifier. PG Bayesian classifier could discover the relationship between words and spam. Each word in the E-mail contributes to the E-mail s spam probability, or only the most interesting words. This contribution of one word, which also can be called the spamicity of one word, is calculated using Bayes theorem: p(s w) = p(w s)p(s) p(w s)p(s) + p(w h)p(h) (1) In Eq. (1), p(s) is the overall probability that any given E-mail is spam. p(h) is the overall probability that any given E-mail is ham. p(w s) is the probability that the given word appears in spam training E-mails, which can be estimated by dividing the number of spam training E-mails that contain this word by the total number of spam training E-mails. p(w h) is the probability that the given word appears in ham training E-mails, which can be estimated by dividing the number of ham training E-mails that contain this word by the total number of ham training E-mails. PG Bayesian classifier makes the assumption that there is no a priori reason for any incoming message to be spam rather than ham, considers p(s) = p(h) =.5. This assumption permits simplifying the Eq. (1) to: p(w s) p(s w) = (2) p(w s) + p(w h)

3722 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) 3719 3727 PG Bayesian classifier also makes the assumption that the words present in a E-mail are independent events. With that assumption, one can derive another equation from Bayes theorem to calculate the probability that the E-mail is spam by taking into consideration N words of the E-mail: p 1 p 2...p N p = (3) p 1 p 2...p N + (1 p 1 )(1 p 2 )...(1 p N ) In Eq. (3), p indicates how sure the filter is that the E-mail is spam. p n (n = {1,..., N}) is the probability p(s w n ). The result p is usually compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam. 3 Proposed Methodology Assume that there is a training set consisting of S spam E-mails and H ham E-mails and there is a test set consisting of N new unseen E-mails. A Unified LDA model with T topics of the overall training set can be estimated using LDA firstly. Then, this Unified LDA model is used to make inference separately for the spam training set, ham training set and the new unseen E-mails. Here three LDA models could be got: the Spam LDA model with the topics distribution θ (s) of spam E-mails, the Ham LDA model with the topics distribution θ (h) of ham E-mails and New-emails LDA model with the topics distribution θ (n) of new E-mails. All these three models have T topics which are consistent with the topics of the Unified LDA model. The third step, each E-mail e can be represented as a vector e = z 1,..., z T. Each z i (i = {1,..., T }) has a value which is the probability of the topic z i occurs in this E-mail i.e. p(z i e). This value can be directly got from the corresponding matrix θ. What can be naturally thought of is that some topics of the T topics are more relevant to spam E-mails and some are more relevant to ham E-mails. In other words, the topics which are more relevant to spam E-mails will have a higher probability in each spam training E-mail, and the topics which are more relevant to ham training E-mails will have a higher probability in each ham E-mail. That means each of the T topic has the spamicity just like words. According to the Eq. (2), the following equation to calculate the probability of the spamicity of one topic z i (i = {1,..., T }) could be naturally got: p(s z i ) = + The challenge is how to calculate the probability of and. The spam training set has total S spam E-mails, given each spam training E-mail e j (j = {1,..., S}), the probability of each topic z i could be got from the matrix θ (s), i.e. p(z i e j, s) = θ (s) j,i. We define the probability of each E-mail e j as: p(e j s) = 1. Hence we could compute the probability of p(z S i s) by using the law of total probability: = = 1 S S p(z i e j, s)p(e j s) j=1 S j=1 (4) θ (s) j,i (5)

Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) 3719 3727 3723 The probability of can be calculated using the same method, and then we update the Eq. (4) into: p(s z i ) = 1 S 1 S S j=1 θ(s) j,i S j=1 θ(s) j,i + 1 H H j=1 θ(h) j,i Each E-mail e j (j = {1,..., N}) of N new unseen E-mails has also been represented as e j = z 1,..., z T. And the value of each z i is p(z i e j ). Apparently p(z i e j ) = θ (n) ji. Then, we need to select top k most representative topics of e j to calculate the probability that E-mail e j is spam. This could be achieved by using the following algorithm in each E-mail e j : (1) For each topic z i of T topics, if.45 < p(s z i ) <.55, add topic z i into CandidateTopicSet (CTS). (2) Rearrange the topics of e j according to the descending order of values p(z i e j ), save the results as TopicOrderList (TOL). (3) For each topic in TOL, orderly add those topics which also in CTS into the AvailableTopicList (ATL). (4) Select top k topics from ATL. Then, the probability that the E-mail e j is spam can be computed by taking into consideration all of the top k topics. According to the form of Eq. (3), we can derive the final equation, which is: k i=1 p(s e j ) = p(s z i) k i=1 p(s z i) + k i=1 (1 p(s z (7) i)) 4 Experiment and Evalution 4.1 Datasets and Experimental Setup We use six datasets collectively called Enron-Spam datasets, which is developed by Metsis et al in paper [8] and is also a publicly available, non-encoded datasets just like Ling-Spam and SpamAssassin. Each of the six Enron-Spam datasets consists of a ham set and a spam set and each message is in a separate text file. The ham collections of these six datasets were got from six Enron users, and were each paired with a spam collection. Hereafter, we refer the six Enron-Spam datasets as Enron 1, Enron 2,..., Enron 6 respectively. Phan s GibbsLDA [19] is used to do estimations and inferences on the datasets. The Dirichlet parameter β is chosen to be constant.1 throughout, while α = 5/T throughout. T is the number of topics of the LDA model, we experiment with T = {1, 2, 5, 1, 2}. The Gibbs sampling is stopped after 2 steps for estimation on the Unified training set, and after 1 steps for inference on the Ham training set, Spam training set, and Test set. In the testing phase, top k most representative topics are selected for each E-mail to calculate the probability of the test E-mail is spam. We experiment with k = 1,..., Length(AT L) 1. Each learning model of each dataset is denoted as M T,k. The threshold is.5. If the probability of the (6)

3724 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) 3719 3727 E-mail is lower than the threshold, it is considered as likely ham, otherwise it is considered as likely spam. 1-fold cross validation is applied in our experiments. During the above experiments, the curves of the topics probability distribution on different model of each dataset are learned. A group of curves with T = 2 are shown in Fig. 2. These curves clearly reveals that the probability of the same topic occurs in different categories is also different, which is a direct proof of the correctness and feasibility of our approach..12.1.8.6.4.2.2.18.16.14.12.1.8.6.4.2 (a) Curves on Enron 1 (d) Curves on Enron 4.14.12.1.8.6.4.2.16.14.12.1.8.6.4.2 (b) Curves on Enron 2 (e) Curves on Enron 5.18.16.14.12.1.8.6.4.2.18.16.14.12.1.8.6.4.2 (b) Curves on Enron 3 (f) Curves on Enron 6 Fig. 2: The curves of topics probability occurs in different model for each dateset, with T=2 4.2 Evaluation and Comparison We first make an evaluation on each model M T,k. Because different datasets are for different person and each one has a different ham-spam ratio, the best performing learning model may be also different. To evaluate each model M T,k of each dataset, we present the evaluation results by means of curves. For the k of the best model M T,k are all less then 7, the F-measure curves are drawn within the scope of k = {1,..., 7} for facilitating the contrast. By observing the F-measure curves which are shown in Fig. 3, the best performing model could be selected for each datesets. Table 1 shows these selected models as well as the corresponding F-measure values. Our method achieves a best result on Enron 4 which demonstrates Metsis s [8] view that some datasets (e.g., Enron 4) are easier than others (e.g., Enron 1). We just use 1 topic to do detection on Enron 4, in contrast use 5 topics to do detection on Enron 1. We also find that models with T = 1 and T = 2 all not reach the best performing which shows too small or too large T are not appropriate.

Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) 3719 3727 3725 T=5 T=5 T=5 T=2 T=2 T=2.5.5.5 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 (a) curves of Enron 1 (b) curves of Enron 2 (c) curves of Enron 3 T=5 T=5 T=5 T=2 T=2 T=2.5.5.5 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 (d) curves of Enron 4 (e) curves of Enron 5 (f) curves of Enron 6 Fig. 3: The curves of each model for each dateset Table 1: Best model for each dataset Datesets Best model (%) Enron 1 M 1,5 97.52 Enron 2 M 2,3 98.1 Enron 3 M 5,1 98.59 Enron 4 M 5,1 99.47 Enron 5 M 5,1 98.53 Enron 6 M 1,1 98.46 The prediction results of the best models are viewed as the best results of our method. To evaluate the filtering capability of our method, we compare it with two term-based spam filtering method. One is the PG Bayesian classifier which is used as a baseline and another is the Multinomial Naive Bayes with Boolean attributes (MN Bool) which is demonstrated as the best Naive Bayes classifier in paper [8]. The best spam and ham recall of PG Bayesian method are selected as baseline. Metsis et al experiment MN Bool also on Enron-Spam datasets, therefore, the experimental results in paper [8] are directly used as another reference. Threshold in the three methods are all.5. Tables 2 and 3 list the spam and ham recall, respectively, of the three method on the six datasets. The tables show that both PG Bayesian and our model are better than MN Bool method which has used 3 attributes. And our model is the best one. Although PG Bayesian also reaches a better result than MN Bool method, it fails exploring the semantic relationships. However, our model can not only explore the semantic relationships but also get a relative better prediction result using just maximum of 5 topics. These results have demonstrated the superiority of our model.

3726 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) 3719 3727 Table 2: Spam recall (%) comparisons Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg MN Bool 96. 96.68 96.94 97.79 99.69 98.1 97.53 PG Bayesian 96.53 95.1 96.82 98.91 99.2 98.83 97.52 Our Model 97.67 98.12 98.65 99.52 99.54 99.26 98.79 Table 3: Ham recall (%) comparisons Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg MN Bool 95.25 97.83 98.88 99.5 95.65 96.88 97.26 PG Bayesian 96.83 97.16 98.24 99.19 96.46 96.23 97.35 Our Model 97.36 97.89 98.53 99.42 97.48 97.64 98.5 5 Conclusion and Future Work In this paper, a Bayesian topic model is proposed for spam filtering. By using LDA, each E-mail is represented as a vector of topics, and based upon this representation a Bayesian method is used to discover the relationship between the topics and spam. By testing our method on Enron-Spam datasets, we get the conclusion that our model is better than the baseline and it can detect the internal semantics of spam messages. In the future work, we will test the Bayesian topic model in other application fields, such as document classification. References [1] Mikko T. Siponen, Carl Stucke, Effective anti-spam strategies in companies: An international study, In 39th Hawaii International International Conference on Systems Science, Kauai, HI, USA, 26 [2] Evangelos Moustakas, C. Ranganathan, Penny Duquenoy, Combating spam through legislation: A comparative analysis of us and european approaches, In CEAS 25 - Second Conference on Email and Anti-Spam, July 21-22, 25, Stanford University, California, USA, 25 [3] Xavier Carreras, Lluís Màrquez, Boosting trees for anti-spam email filtering, In Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 21 [4] Harris Drucker, Donghui Wu, Vladimir Vapnik, Support vector machines for spam categorization, IEEE Transactions on Neural Networks, 1(5), 1999, 148-154 [5] A. Kolcz, J. Alspector, Svm-based filtering of E-mail spam with content-specific misclassification costs, In Proceedings of the ICDM Workshop on Text Mining, 21 [6] Yuewu Shen, Guanglu Sun, Haoliang Qi, Xiaoning He, Using feature selection to speed up online svm based spam filtering, In International Conference on Asian Language Processing, IALP 21, Harbin, Heilongjiang, China, 21, 142-145 [7] Paul Graham, A plan for spam, Available at: http://paulgraham.com/spam.html, August 23

Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) 3719 3727 3727 [8] Vangelis Metsis, Ion Androutsopoulos, Georgios Paliouras, Spam filtering with naive bayes-which naive bayes? In CEAS 26 - The Third Conference on Email and Anti-Spam, Mountain View, California, USA, July 27-28, 26 [9] Igor Santos, Carlos Laorden, Borja Sanz, Pablo Garcia Bringas, Enhanced topic-based vector space model for semantics-aware spam filtering, Expert Syst. Appl., 39(1), 212, 437-444 [1] David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., 3 March, 23, 993-122 [11] István Bíró, Jácint Szabó, András A. Benczúr, Latent dirichlet allocation in web spam filtering, In AIRWeb 8: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing, China, 28, 29-32 [12] Thomas Hofmann, Probabilistic latent semantic indexing, In Proceedings of the 22th International Conference on Research and Development in Information Retrieval, 1999, 5-57 [13] Thomas Minka Department, Thomas Minka, John Lafferty, Expectation-propagation for the generative aspect model, In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 22, 352-359 [14] T. L. Griffiths, M. Steyvers, Finding scientific topics, Proceedings of the National Academy of Science, 11 (Suppl. 1), April 24, 5228-5235 [15] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A Bayesian approach to filtering junk E-mail, In AAAI-98 Workshop on Learning for Text Categorization, 1998, 55-62 [16] P. Pantel, D. Lin, Spamcop: A spam classification and organization program, In Proceedings of the AAAI Workshop on Learning for Text Categorization, 1998 [17] George H. John, Pat Langley, Estimating continuous distributions in bayesian classifiers, In UAI, Morgan Kaufmann, 1995, 338-345 [18] Karl-Michael Schneider, On word frequency information and negative evidence in naive bayes text classification, In EsTAL, Lecture Notes in Computer Science, Vol. 323, 24, 474-486 [19] Xuan-Hieu Phan, Cam-Tu Nguyen, Gibbslda++: A c/c++ implementation of latent dirichlet allocation (lda) using gibbs sampling for parameter estimation and inference, Available at: http://gibbslda.sourceforge.net/