A Bayesian Topic Model for Spam Filtering

Transcription

1 Journal of Information & Computational Science 1:12 (213) August 1, 213 Available at A Bayesian Topic Model for Spam Filtering Zhiying Zhang, Xu Yu, Lixiang Shi, Li Peng, Zhixing Huang School of Computer and Information Science, Southwest University, Chongqing 4715, China Abstract Spam is one of the major problems of today s Internet because it brings financial damage to companies and annoys individual users. Among those approaches developed to detect spam, the content-based machine learning algorithms are important and popular. However, these algorithms are trained using statistical representations of the terms that usually appear in the s. Additionally, these methods are unable to account for the underlying semantics of terms within the messages. In this paper, we present a Bayesian topic model to address these limitations. We explore the use of semantics in spam filtering by representing s as vectors of topics with a topic model: the Latent Dirichlet Allocation (LDA). Based upon this representation, the relationship between the topics and spam can be discovered by using a Bayesian method. We test this model on the Enron-Spam datasets and results show that the proposed model performs better than the baseline and can detect the internal semantics of spam messages. Keywords: Spam Detection; Latent Dirichlet Allocation; Bayesian Topic Model 1 Introduction Electronic mail ( ) is one of the most important and powerful means of modern communication. However, in the past decade users have always been plagued by spam, which is also known as junk or Unsolicited Bulk (UBE). Spam triggers a lot of problems, such as making users waste their time on looking through and sorting out additional s [1], causing financial loss to the companies by misusing of traffic, storage space and computational power [1], bringing security and legal problems by spreading malicious software, advertising pornography, pyramid schemes, etc [2]. Many techniques have been proposed to deal with spam. The content-based machine learning algorithms are important and popular, including algorithms that are considered top-performers in text classification, like Boosting [3], Support Vector Machines [4, 5, 6], and Bayesian method [7, 8]. Project supported by Natural Science Foundation Project of CQ CSTC (No. CSTC212JJB412), Scientific Research Foundation for the Returned Overseas Chinese Scholars (No. 2911) and Fundamental Research Funds for the Central Universities (No. SWU139265). Corresponding author. address: huangzx@swu.edu.cn (Zhixing Huang) / Copyright 213 Binary Information Press DOI: /jics212279

2 372 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) Despite the fact that s are usually represented as a sequence of words, there are relationships between words on a semantic level that also affect s [9]. However, the content-based machine learning algorithms are trained using statistical representations of the terms that usually appear in the s and are unable to account for the underlying semantics of the s. To address these limitations, Santos [9] proposed to represent the s with the enhanced Topicbased Vector Space Model (etvsm) and achieved a satisfactory result on Ling-Spam dataset. However, etvsm is a ontology-based method which may limit its effect when encounters more complicated unseen messages. Furthermore, the Ling-Spam has the disadvantage that its ham messages are more topic-specific which could lead to over optimistic estimates of the performance of learning-based spam filters. In contrast, we present a Bayesian topic model by introducing the topic model Latent Dirichlet Allocation (LDA) [1] to mine the semantics of s. LDA is a generative probabilistic model of a corpus which will not be limited by the weakness of ontology. LDA models every document as a distribution over the topics, and every topic as a distribution over the words. These topics could better reflect the semantics of the document than terms. The basic idea of our approach is that: we use a previously estimated LDA model to make inference on the new unseen s to get the topics distribution of each . Hence, each could be treated as a vector of topics not terms. As the topics have deeper relationship with the content of a , we can then use a Bayesian method to discover the relationship between the topics and spam. More detailed descriptions are shown in Section 3. Our model may be similar in the sense with the method proposed by Bíró [11] because we also use LDA, however, the model we present is completely different with their method. The remainder of this paper is organized as follows. Section 2 introduces the basic theory. Section 3 describes the proposed methodology. Section 4 details the performed experiments and presents the results. Finally, Section 5 concludes and outlines avenues for future work. 2 Basic Theory The basic theory includes LDA topic model which is used to get the topics distribution of E- mails and Bayesian method which could discover the relationship between words and spam. A modification of this Bayesian method is used to discover the relationship between topics and spam in our approach. 2.1 Latent Dirichlet Allocation There are D documents of arbitrary length. A document d is a vector of N d words, W d, where each word w id is chosen from a vocabulary of size V. Then the generation of a document collection in LDA is modeled as a three step processes. First, for each document, a distribution over topics is sampled from a Dirichlet distribution. Second, for each word in the document, a single topic is chosen according to this distribution. Finally, each word is sampled from a multinomial distribution over words specific to the sampled topic. This generative process corresponds to the hierarchical Bayesian model shown (using plate notation) in Fig. 1.

3 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) α θ z β φ T w N d D Fig. 1: The hierarchical Bayesian model for LDA In this model, ϕ denotes the matrix of topic distributions, with a multinomial distribution over V vocabulary items for each of T topics being drawn independently from a symmetric Dirichlet(β) prior. θ is the matrix of document-specific mixture weights for these T topics, each being drawn independently from a symmetric Dirichlet(α) prior. For each word, z denotes the topic responsible for generating that word, drawn from the θ distribution for that document, and w is the word itself, drawn from the topic distribution ϕ corresponding to z. Estimating ϕ and θ provides information about the topics that participate in a corpus and the weights of those topics in each document respectively. A variety of algorithms have been used to estimate these parameters, from basic expecation-maximization [12] to approximate inference methods like variational EM [1], expectation propagation [13], and Gibbs sampling [14]. 2.2 The Bayesian Method The Bayesian method proposed by Paul Graham [7] is very different from any form of Naive Bayes classifiers [15, 16, 17, 18] and is able to greatly improve the false positive rate. In this paper, this method is referred as PG Bayesian classifier. PG Bayesian classifier could discover the relationship between words and spam. Each word in the contributes to the s spam probability, or only the most interesting words. This contribution of one word, which also can be called the spamicity of one word, is calculated using Bayes theorem: p(s w) = p(w s)p(s) p(w s)p(s) + p(w h)p(h) (1) In Eq. (1), p(s) is the overall probability that any given is spam. p(h) is the overall probability that any given is ham. p(w s) is the probability that the given word appears in spam training s, which can be estimated by dividing the number of spam training s that contain this word by the total number of spam training s. p(w h) is the probability that the given word appears in ham training s, which can be estimated by dividing the number of ham training s that contain this word by the total number of ham training s. PG Bayesian classifier makes the assumption that there is no a priori reason for any incoming message to be spam rather than ham, considers p(s) = p(h) =.5. This assumption permits simplifying the Eq. (1) to: p(w s) p(s w) = (2) p(w s) + p(w h)

4 3722 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) PG Bayesian classifier also makes the assumption that the words present in a are independent events. With that assumption, one can derive another equation from Bayes theorem to calculate the probability that the is spam by taking into consideration N words of the p 1 p 2...p N p = (3) p 1 p 2...p N + (1 p 1 )(1 p 2 )...(1 p N ) In Eq. (3), p indicates how sure the filter is that the is spam. p n (n = {1,..., N}) is the probability p(s w n ). The result p is usually compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam. 3 Proposed Methodology Assume that there is a training set consisting of S spam s and H ham s and there is a test set consisting of N new unseen s. A Unified LDA model with T topics of the overall training set can be estimated using LDA firstly. Then, this Unified LDA model is used to make inference separately for the spam training set, ham training set and the new unseen s. Here three LDA models could be got: the Spam LDA model with the topics distribution θ (s) of spam s, the Ham LDA model with the topics distribution θ (h) of ham s and New- s LDA model with the topics distribution θ (n) of new s. All these three models have T topics which are consistent with the topics of the Unified LDA model. The third step, each e can be represented as a vector e = z 1,..., z T. Each z i (i = {1,..., T }) has a value which is the probability of the topic z i occurs in this i.e. p(z i e). This value can be directly got from the corresponding matrix θ. What can be naturally thought of is that some topics of the T topics are more relevant to spam s and some are more relevant to ham s. In other words, the topics which are more relevant to spam s will have a higher probability in each spam training , and the topics which are more relevant to ham training s will have a higher probability in each ham . That means each of the T topic has the spamicity just like words. According to the Eq. (2), the following equation to calculate the probability of the spamicity of one topic z i (i = {1,..., T }) could be naturally got: p(s z i ) = + The challenge is how to calculate the probability of and. The spam training set has total S spam s, given each spam training e j (j = {1,..., S}), the probability of each topic z i could be got from the matrix θ (s), i.e. p(z i e j, s) = θ (s) j,i. We define the probability of each e j as: p(e j s) = 1. Hence we could compute the probability of p(z S i s) by using the law of total probability: = = 1 S S p(z i e j, s)p(e j s) j=1 S j=1 (4) θ (s) j,i (5)

5 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) The probability of can be calculated using the same method, and then we update the Eq. (4) into: p(s z i ) = 1 S 1 S S j=1 θ(s) j,i S j=1 θ(s) j,i + 1 H H j=1 θ(h) j,i Each e j (j = {1,..., N}) of N new unseen s has also been represented as e j = z 1,..., z T. And the value of each z i is p(z i e j ). Apparently p(z i e j ) = θ (n) ji. Then, we need to select top k most representative topics of e j to calculate the probability that e j is spam. This could be achieved by using the following algorithm in each e j : (1) For each topic z i of T topics, if.45 < p(s z i ) <.55, add topic z i into CandidateTopicSet (CTS). (2) Rearrange the topics of e j according to the descending order of values p(z i e j ), save the results as TopicOrderList (TOL). (3) For each topic in TOL, orderly add those topics which also in CTS into the AvailableTopicList (ATL). (4) Select top k topics from ATL. Then, the probability that the e j is spam can be computed by taking into consideration all of the top k topics. According to the form of Eq. (3), we can derive the final equation, which is: k i=1 p(s e j ) = p(s z i) k i=1 p(s z i) + k i=1 (1 p(s z (7) i)) 4 Experiment and Evalution 4.1 Datasets and Experimental Setup We use six datasets collectively called Enron-Spam datasets, which is developed by Metsis et al in paper [8] and is also a publicly available, non-encoded datasets just like Ling-Spam and SpamAssassin. Each of the six Enron-Spam datasets consists of a ham set and a spam set and each message is in a separate text file. The ham collections of these six datasets were got from six Enron users, and were each paired with a spam collection. Hereafter, we refer the six Enron-Spam datasets as Enron 1, Enron 2,..., Enron 6 respectively. Phan s GibbsLDA [19] is used to do estimations and inferences on the datasets. The Dirichlet parameter β is chosen to be constant.1 throughout, while α = 5/T throughout. T is the number of topics of the LDA model, we experiment with T = {1, 2, 5, 1, 2}. The Gibbs sampling is stopped after 2 steps for estimation on the Unified training set, and after 1 steps for inference on the Ham training set, Spam training set, and Test set. In the testing phase, top k most representative topics are selected for each to calculate the probability of the test is spam. We experiment with k = 1,..., Length(AT L) 1. Each learning model of each dataset is denoted as M T,k. The threshold is.5. If the probability of the (6)

6 3724 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) is lower than the threshold, it is considered as likely ham, otherwise it is considered as likely spam. 1-fold cross validation is applied in our experiments. During the above experiments, the curves of the topics probability distribution on different model of each dataset are learned. A group of curves with T = 2 are shown in Fig. 2. These curves clearly reveals that the probability of the same topic occurs in different categories is also different, which is a direct proof of the correctness and feasibility of our approach (a) Curves on Enron 1 (d) Curves on Enron (b) Curves on Enron 2 (e) Curves on Enron (b) Curves on Enron 3 (f) Curves on Enron 6 Fig. 2: The curves of topics probability occurs in different model for each dateset, with T=2 4.2 Evaluation and Comparison We first make an evaluation on each model M T,k. Because different datasets are for different person and each one has a different ham-spam ratio, the best performing learning model may be also different. To evaluate each model M T,k of each dataset, we present the evaluation results by means of curves. For the k of the best model M T,k are all less then 7, the F-measure curves are drawn within the scope of k = {1,..., 7} for facilitating the contrast. By observing the F-measure curves which are shown in Fig. 3, the best performing model could be selected for each datesets. Table 1 shows these selected models as well as the corresponding F-measure values. Our method achieves a best result on Enron 4 which demonstrates Metsis s [8] view that some datasets (e.g., Enron 4) are easier than others (e.g., Enron 1). We just use 1 topic to do detection on Enron 4, in contrast use 5 topics to do detection on Enron 1. We also find that models with T = 1 and T = 2 all not reach the best performing which shows too small or too large T are not appropriate.

7 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) T=5 T=5 T=5 T=2 T=2 T= (a) curves of Enron 1 (b) curves of Enron 2 (c) curves of Enron 3 T=5 T=5 T=5 T=2 T=2 T= (d) curves of Enron 4 (e) curves of Enron 5 (f) curves of Enron 6 Fig. 3: The curves of each model for each dateset Table 1: Best model for each dataset Datesets Best model (%) Enron 1 M 1, Enron 2 M 2, Enron 3 M 5, Enron 4 M 5, Enron 5 M 5, Enron 6 M 1, The prediction results of the best models are viewed as the best results of our method. To evaluate the filtering capability of our method, we compare it with two term-based spam filtering method. One is the PG Bayesian classifier which is used as a baseline and another is the Multinomial Naive Bayes with Boolean attributes (MN Bool) which is demonstrated as the best Naive Bayes classifier in paper [8]. The best spam and ham recall of PG Bayesian method are selected as baseline. Metsis et al experiment MN Bool also on Enron-Spam datasets, therefore, the experimental results in paper [8] are directly used as another reference. Threshold in the three methods are all.5. Tables 2 and 3 list the spam and ham recall, respectively, of the three method on the six datasets. The tables show that both PG Bayesian and our model are better than MN Bool method which has used 3 attributes. And our model is the best one. Although PG Bayesian also reaches a better result than MN Bool method, it fails exploring the semantic relationships. However, our model can not only explore the semantic relationships but also get a relative better prediction result using just maximum of 5 topics. These results have demonstrated the superiority of our model.

8 3726 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) Table 2: Spam recall (%) comparisons Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg MN Bool PG Bayesian Our Model Table 3: Ham recall (%) comparisons Method Enron 1 Enron 2 Enron 3 Enron 4 Enron 5 Enron 6 Avg MN Bool PG Bayesian Our Model Conclusion and Future Work In this paper, a Bayesian topic model is proposed for spam filtering. By using LDA, each is represented as a vector of topics, and based upon this representation a Bayesian method is used to discover the relationship between the topics and spam. By testing our method on Enron-Spam datasets, we get the conclusion that our model is better than the baseline and it can detect the internal semantics of spam messages. In the future work, we will test the Bayesian topic model in other application fields, such as document classification. References [1] Mikko T. Siponen, Carl Stucke, Effective anti-spam strategies in companies: An international study, In 39th Hawaii International International Conference on Systems Science, Kauai, HI, USA, 26 [2] Evangelos Moustakas, C. Ranganathan, Penny Duquenoy, Combating spam through legislation: A comparative analysis of us and european approaches, In CEAS 25 - Second Conference on and Anti-Spam, July 21-22, 25, Stanford University, California, USA, 25 [3] Xavier Carreras, Lluís Màrquez, Boosting trees for anti-spam filtering, In Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG, 21 [4] Harris Drucker, Donghui Wu, Vladimir Vapnik, Support vector machines for spam categorization, IEEE Transactions on Neural Networks, 1(5), 1999, [5] A. Kolcz, J. Alspector, Svm-based filtering of spam with content-specific misclassification costs, In Proceedings of the ICDM Workshop on Text Mining, 21 [6] Yuewu Shen, Guanglu Sun, Haoliang Qi, Xiaoning He, Using feature selection to speed up online svm based spam filtering, In International Conference on Asian Language Processing, IALP 21, Harbin, Heilongjiang, China, 21, [7] Paul Graham, A plan for spam, Available at: August 23

9 Z. Zhang et al. / Journal of Information & Computational Science 1:12 (213) [8] Vangelis Metsis, Ion Androutsopoulos, Georgios Paliouras, Spam filtering with naive bayes-which naive bayes? In CEAS 26 - The Third Conference on and Anti-Spam, Mountain View, California, USA, July 27-28, 26 [9] Igor Santos, Carlos Laorden, Borja Sanz, Pablo Garcia Bringas, Enhanced topic-based vector space model for semantics-aware spam filtering, Expert Syst. Appl., 39(1), 212, [1] David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., 3 March, 23, [11] István Bíró, Jácint Szabó, András A. Benczúr, Latent dirichlet allocation in web spam filtering, In AIRWeb 8: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing, China, 28, [12] Thomas Hofmann, Probabilistic latent semantic indexing, In Proceedings of the 22th International Conference on Research and Development in Information Retrieval, 1999, 5-57 [13] Thomas Minka Department, Thomas Minka, John Lafferty, Expectation-propagation for the generative aspect model, In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, 22, [14] T. L. Griffiths, M. Steyvers, Finding scientific topics, Proceedings of the National Academy of Science, 11 (Suppl. 1), April 24, [15] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A Bayesian approach to filtering junk , In AAAI-98 Workshop on Learning for Text Categorization, 1998, [16] P. Pantel, D. Lin, Spamcop: A spam classification and organization program, In Proceedings of the AAAI Workshop on Learning for Text Categorization, 1998 [17] George H. John, Pat Langley, Estimating continuous distributions in bayesian classifiers, In UAI, Morgan Kaufmann, 1995, [18] Karl-Michael Schneider, On word frequency information and negative evidence in naive bayes text classification, In EsTAL, Lecture Notes in Computer Science, Vol. 323, 24, [19] Xuan-Hieu Phan, Cam-Tu Nguyen, Gibbslda++: A c/c++ implementation of latent dirichlet allocation (lda) using gibbs sampling for parameter estimation and inference, Available at: