Filtering Spams using the Minimum Description Length Principle

Transcription

1 Filtering Spams using the Minimum Description Length Principle ABSTRACT Tiago A. Almeida, Akebo Yamakami School of Electrical and Computer Engineering University of Campinas UNICAMP , Campinas, SP, Brazil {tiago, Spam has become an increasingly important problem with a big economic impact in society. Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the minimum description length principle. The proposed model is fast to construct and incrementally updateable. Additionally, we offer an analysis concerning the measurements usually employed to evaluate the quality of the anti-spam classifiers. In this sense, we present a new measure in order to provide a fairer comparison. Furthermore, we conducted an empirical experiment using six well-known, large and public databases. Finally, the results indicate that our approach outperforms the state-of-the-art spam filters. Categories and Subject Descriptors I.5 [Pattern Recognition]: Applications; I.2.7 [Artificial Intelligence]: Natural Language Processing Text analysis General Terms Anti-spam Filtering Keywords Minimum description length, spam filter, machine learning 1. INTRODUCTION is one of the most popular, fastest and cheapest means of communication. It has become a part of everyday life for millions of people, changing the way we work and collaborate. is not only used to support conversation but also as a task manager, document delivery system and archive. The downside of this success is the constantly growing volume of spam we receive. The problem of spams can be quantified in economical terms since many Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and thefull citation on the firstpage. Tocopy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 10 March 22-26, 2010, Sierre, Switzerland. Copyright 2010 ACM /10/03...$ Jurandy Almeida Institute of Computing University of Campinas UNICAMP , Campinas, SP, Brazil jurandy.almeida@ic.unicamp.br hours are wasted everyday by workers. It is not just the time they waste reading the spam but also the time they spend deleting those messages. Fortunately, many solutions are being proposed to avoid this plague and one of more promising is the use of machine learning techniques for automatically filtering messages [8]. These methods include approaches that are considered top-performers in text categorization like Rocchio [13, 18], Boosting [7], Support Vector Machines (SVM) [10, 14, 12], and naive Bayes classifiers [2, 16]. The two latter currently appear to be the best anti-spam filters presented in the literature [1, 2, 4, 6, 8, 9, 20]. A relatively recent method for inductive inference which is still rarely employed in text categorization tasks is the Minimum Description Length (MDL) principle. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models can be arbitrarily complex, and overfitting the data is a serious concern [3, 11, 17]. In this paper, we present a novel anti-spam classifier based on the minimum description length principle [4]. We conducted an empirical experiment using six well-known, large and public databases. The results indicate that our approach outperforms currently established spam filters. Furthermore, we investigate the most used performance measurements applied for comparing the quality of the antispam filters. In this way, we analyze the advantages of using the Matthews correlation coefficient. The remainder of this paper is organized as follows: Section 2 presents details of our proposed approach. In Section 3, we introduce a brief discussion about the benefits of using the Matthews correlation coefficient as a measure of quality of anti-spam classifiers. Experimental results are showed in Section 4. Finally, Section 5 offers conclusions and directions for future works. 2. SPAM FILTERING BASED ON MINIMUM DESCRIPTION LENGTH The MDL principle is a formalization of Occam s Razor in which the best hypothesis for a given set of data is the one that yields compact representations. The traditional MDL principle states that the preferred model results in the shortest description of the model and the data, given this model. In other words, the model that best compresses the data is selected. This model selection criterion naturally balances the complexity of the model and the degree to which this model fits the data. Let Z be a finite or countable set and let P be a probability distribution on Z. Then there exists a prefix code C for Z such that for all z Z, L C(z) = log 2P(z). 1854

2 C is called the code corresponding to P. Similarly, let C be a prefix code for Z. Then there exists a (possibly defective) probability distribution P such that for all z Z, log 2P (z) = L C (z). P is called the probability distribution corresponding to C. Thus, large probability according to P means small code length according to the code corresponding to P and vice versa [3, 11, 17]. The goal of statistical inference may be cast as trying to find regularity in the data. Regularity may be identified with ability to compress. MDL combines these two insights by viewing learning as data compression: it tells us that, for a given set of hypotheses H and data set D, we should try to find the hypothesis or combination of hypotheses in H that compresses D most [3, 11, 17]. This idea can be applied to all sorts of inductive inference problems, but it turns out to be most fruitful in problems of model selection and, more generally, dealing with overfitting [11]. An important property of MDL methods is that they provide automatically and inherently protect against overfitting and can be used to estimate both the parameters and the structure of a model. In contrast, to avoid overfitting when estimating the structure of a model, traditional methods such as maximum likelihood must be modified and extended with additional, typically adhoc principles [11]. Consider the following example. Suppose we flip a coin 1,000 times and we observe the numbers of heads and tails. We consider two model classes: the first consists of a code that represents each outcome with a 0 for heads or a 1 for tails. This code represents the hypothesis that the coin is fair. The code length according to this code is always exactly 1,000 bits. The second model class consists of all codes that are efficient for a coin with some specific bias, representing the hypothesis that the coin is not fair. Say that we observe 510 heads and 490 tails. Then the code length according to the best code in the second model class is shorter than 1,000 bits. For this reason a naive statistical method might put forward this second hypothesis as a better explanation for the data. However, in an MDL approach we would have to construct a single code based on the hypothesis, we can not just use the best one. A simple way to do it would be to use a two-part code, in which we first specify which element of the model class has the best performance, and then we specify the data using that code. We will need quite a lot of bits to specify which code to use; thus the total codelength based on the second model class would be larger than 1,000 bits. Thus if we follow an MDL approach the conclusion has to be that there is not enough evidence in support of the hypothesis that the coin is biased, even though the best element of the second model class provides better fit to the data. Consult Grunwald [11] for further details. In essence, compression algorithms can be applied to text categorization by building one compression model from the training documents of each class and using these models to evaluate the target document. The minimum description length principle was first employed in the anti-spam filtering task by Bratko et al. [4]. We extended their original work offering improvements in the classification criteria, training process, and especially, proposing a different manner to build the spam and legitimate models. Assuming that each message m is composed by a set of terms m = t 1,..., t n, where each term t k corresponds to a word ( adult, for example), a set of words ( to be removed ), or a single character ( $ ), we can represent each message by a vector x = x 1,..., x n, where x 1,..., x n are values of the attributes X 1,..., X n associated with the terms t 1,..., t n. In the simplest case, each term represents a single word and all attributes are Boolean: X i = 1 if the message contains t i or X i = 0, otherwise. Given a set of pre-classified training messages M, the task is to assign a target m with an unknown label to one of the classes c {spam,legitimate}. So, the method measures the increase of the description length of the data set as a result of the addition of the target document. Finally, it chooses the class for which the description length increase is minimal. Unlike the algorithm proposed by Bratko et al. [4], we consider in this work, each class (model) c as a sequence of terms extracted from the messages and inserted into the training set. Each term t from m has a code length L t based on the sequence of terms presented in the messages of the training set of c. The length of m when assigned to the class c corresponds to the sum of all code lengths associated with each term of m, Lm = P m i=1 Lt i. We calculate L ti = log 2P ti, where P is a probability distribution related with the terms of class. Let n c(t i) the number of times that t i appears in messages of class c, then the probability that any term belongs to c is given by the maximum likelihood estimation: P ti = nc(ti) + 1 n c + 1 where n c corresponds to the sum of n c(t i) for all terms which appear in messages that belongs to c and is the vocabulary size. In this work, we assume that = 2 32, that is, each term in an uncompress mode is a symbol with 32 bits. This estimation reserves a portion of probability to words which the classifier has never seen before [5]. Briefly, the proposed MDL anti-spam filter classify a message by following these steps: 1. Tokenization: the classifier extract all terms of the new message m = {t 1,..., t m }; 2. Compute the increase of the description length when m is assigned to each class c {spam,legitimate}: & m X L m(spam) = i=1 i=1! n spam(t i) + 1 log 2 n spam + 1! m X n legitimate (t i) + L m(legitimate) = & log 1 2 n legitimate if L m(spam) > L m(legitimate), then m is classified as spam; otherwise, m is labeled as legitimate. 4. Training method. In the following, we offer more details about the steps 1 and Preprocessing and Tokenization We did not perform language-specific preprocessing techniques such as word stemming, stop word removal, or case folding, since other researchers found that such techniques tend to hurt spam-filtering accuracy [2, 16, 20]. However, we use an -specific preprocessing before the classification task. In this way, we employ the Jaakko Hyvattis normalizemime 1. This program converts the character set to UTF-8, decoding Base64, Quoted-Printable and URL encoding and adding warn tokens in case of encoding errors. It also appends a copy of HTML/XML message bodies with 1 Available at

3 most tags removed, decodes HTML entities and limits the size of attached binary files. Tokenization is the first stage in the classification pipeline; it involves breaking the text stream into terms ( words ), usually by means of a regular expression. We consider in this work that terms start with a printable character; followed by any number of alphanumeric characters, excluding dots, commas and colons from the middle of the pattern. With this pattern, domain names and mail addresses will be split at dots, so the classifier can recognize a domain even if subdomains vary [19]. As proposed by Drucker et al. [10] and Metsis et al. [16], we do not consider the number of times a term appears in each message. In this way, each term is computed only one time per message it appears. 2.2 Training Method Anti-spam filters generally build their predicting models by learning from examples. A basic training method is to start with an empty model, classify each new sample and train it in the right class if the classification is wrong. This is known as train on error (TOE). An improvement to this method is to train also when the classification is right, but the score is near the boundary that is, train on or near error (TONE) [19]. The advantage of TONE over TOE is that it accelerates the learning process by exposing the filter to additional hardto-classify samples in the same training period. Therefore, we employ the TONE as training method used by the proposed MDL anti-spam filter. A good point of the MDL classifier is that we can start with an empty training set and according to the user feedback the classifier builds the models for each class. Moreover, it is not necessary to keep the messages used for training since the models are incrementally building by the term frequencies [5]. 3. PERFORMANCE MEASUREMENTS According to Cormack [8], the filters should be judged along four dimensions: autonomy, immediacy, spam identification, and non-spam identification. However, it is not obvious how to measure any of these dimensions separately, nor how to combine these measurements into a single one for the purpose of comparing filters. Reasonable standard measures are useful to facilitate comparison, given that the goal of optimizing them does not replace that of finding the most suitable filter for the purpose of spam filtering. Let S and L sets of spam and legitimate messages, the possible prediction results are: true positives (T P) corresponding to the set of spam messages correctly classified, true negatives (T N) the set of legitimate messages correctly classified, false negatives (FN) the set of spam messages incorrectly classified as legitimate, and false positives (FP) the set of legitimate messages incorrectly classified as spam. Some well-known evaluation measurements are: True positive rate (Tpr), True negative rate (Tnr), True negative rate (T nr), Spam precision (Spr), Spam recall (Sre), Legitimate precision (Lpr), Legitimate recall (Lre), area under the ROC curves (1-AUC), LAM [8], precision recall [16], Accuracy rate (Acc) and Total Cost Ratio (TCR) [2]. Nevertheless, failures to identify legitimate and spam messages have materially different consequences. Misclassified non-spam substantially increases the risk that the information contained in the message will be lost, or at least delayed. Exactly how much risk and delay are incurred is difficult to quantify, as are the consequences, which depend on the nature of the message. On the other hand, failures to identify spam also vary in importance, but are generally less important than failures to identify non-spam. Viruses, worms, and phishing messages may be an exception, as they pose 1856 significant risks to the user [8]. In order to take into consideration the asymmetry in the misclassification costs, Androutsopoulos et al. [2] proposed a refinement based on spam recall and precision, to allow the performance evaluation based on a single measure. They consider a false positive as being λ times more costly than false negatives, with λ equals to 1 or 9. Hence, each false positive is accounted as λ mistakes, with the weighted accuracy (Acc w) being given by Acc w = T P + λ T N. S + λ L The total cost ratio (TCR) can be calculated by TCR = S λ FP + FN. It offers an indication of the improvement provided by the filter. Greater T CR indicates better performance, and for TCR < 1, not using the filter is better. 3.1 Matthews correlation coefficient According to Carpinter and Hunt [6] and Cormack and Lynam [9] the value of λ is very difficult to be determined. In particular, it depends on the message once some messages are more important than others, as previously discussed. Furthermore, the problem of using TCR is that it does not return a value inside a predefined range. Consider, for example, two classifiers A and B employed to filter 600 messages (450 spams legitimates, λ = 1). Suppose that A attained a perfect prediction with FP A = FN A = 0, and B classified incorrectly only 3 spam messages as legitimate, thus FP B = 0 and FN B = 3. In this way, TCR A = + and TCR B = 150. Intuitively, we can observe that both classifiers achieved similar performance with a small advantage for A. However, if we analyze only the TCR, we would wrong conclude that A was much better than B. Furthermore, a TCR is not a representative value which we can make strong assumptions about the performance achieved by a single classifier, because it gives us only the information about the improvement provided by using the filter, but not provide an information about how much the classifier could be improved. In order to avoid these characteristics, we propose the use of the Matthews correlation coefficient (MCC) [15]. MCC is used in machine learning as a measure of the quality of binary classifications which provides much more information than TCR. It returns a real value between -1 and +1. A coefficient equals to +1 indicates a perfect prediction; 0, an average random prediction; and -1, an inverse prediction. MCC provides a more balanced evaluation of the prediction than other measures, such as the proportion of correct predictions, especially if the two classes are of different sizes. MCC = ( T P. T N ) ( FP. FN ) p ( T P + FP ).( T P + FN ).( T N + FP ).( T N + FN ), It provides a fairer evaluation since in a real situation the number of spams we receive is much higher than the number of legitimate messages, therefore MCC tends to automatically adjust the how much a false positive error is worst than a false negative one. As the proportionality between the number of spams and legitimate messages increases a false positive tends to be much worst than a false negative. Using the previous example, the classifier A would achieve MCC A = and MCC B = Thus, we can make

4 correct conclusions about the classifiers predictions as much as each performance achieved individually. Furthermore, we can combine MCC with other measures, as precision recall rates, for instance, in order to provide a fairer comparison [1]. 4. EXPERIMENTAL RESULTS We use the six well-known Enron corpora [16] in our experiments. All corpus are composed by real legitimate messages extracted from the mailboxes of six former employees of the Enron Corporation and selected spam messages from different sources. Enron corpora tries to keep the same characteristics of a real user mailbox. The composition of each dataset is shown in Table 1. Table 1: Enron datasets Dataset N o of Legitimate N o of Spam Total Enron 1 3,672 1,500 5,172 Enron 2 4,361 1,496 5,857 Enron 3 4,012 1,500 5,512 Enron 4 1,500 4,500 6,000 Enron 5 1,500 3,675 5,175 Enron 6 1,500 4,500 6,000 Total 16,545 17,171 33,716 Tables 2, 3, 4, 5, 6, and 7 present the performance achieved by each classifier for each Enron dataset. Bold values indicate the highest score. In order to provide a fairer evaluation, we consider the most important measures the spam recall rate (Spr), legitimate recall rate (Lre) and Matthews correlation coefficient (MCC) achieved by each filter. Additionally, we present other measures as legitimate spam precision rates, weighted accuracy (Acc w) and total cost ratio (TCR). In this way, we employed λ = 1 in order to calculate Acc w and TCR, as described in Section 3. The results achieved by the proposed MDL classifier are compared with the ones attained by methods considered the actual top-performers in anti-spam filtering: the Boolean naive Bayes (NB) classifier [2, 16] and linear support vector machine (SVM) with Boolean attributes [10, 12, 14]. Due to paper limitations, we present the results achieved by each evaluated classifier. A comprehensive set of results, including all tables and figures, is available at dt.fee.unicamp.br/~tiago/research/spam/spam.htm. Table 2: Enron 1 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Regarding the results achieved by the filters, the proposed MDL anti-spam classifier outperforms the other classifiers for the majority corpus used in our empirical evaluation. It is important to realize that in some situations the MDL performs much better than SVM or NB. For instance, for 1857 Table 3: Enron 2 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Table 4: Enron 3 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Table 5: Enron 4 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Table 6: Enron 5 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Table 7: Enron 6 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Enron 1 (Table 2) MDL achieved spam recall rate equal to 92% while SVM attained 83.33%, even thought MDL presented better legitimate recall. It means that for Enron 1 MDL was able to recognize more than 8% of spams than SVM, representing an improvement of 10.40%. In real applications this difference is extremely important. Note that,

5 the same result can be found for Enron 5 (Table 6) and Enron 6 (Table 7). Both methods, MDL and SVM, achieved similar performance with no significant statistical difference just for Enron 4 (Table 5). The results indicate that our approach is more efficient to distinguish messages as spams or legitimates. It attained an accuracy rate higher than 95% and high precision recall rates for all datasets indicating that the proposed filter makes few mistakes. We also verify that the MDL classifier achieved high MCC score ( 0.878) for all tested corpus. It indicates that the proposed filter almost accomplished a perfect prediction (MCC = 1.000) and it is much better than not use a filter (MCC = 0.000). Nevertheless, it is important to note that TCR is really not an informative measurement. For instance, for Enron 4 (Table 5), MDL and SVM achieved similar performances (MDL MCC = and SV M MCC = 0.978). However, their TCR are very different (MDL TCR = and SV M TCR = ), besides their precision recall rates are very close. 5. CONCLUSIONS AND FURTHER WORK In this paper, we have presented an anti-spam classifier based on the minimum description length principles. Particularly, we based our research on the filter proposed by Bratko et al. [4]. We have extended their original work offering several improvements mainly in the classification criteria, training process, and proposing a different manner to build the spam and legitimate models. We have conducted empirical experiments using six wellknown, large and public corpora. In addition, we have compared the results achieved by methods considered the actual top-performers in spam filtering and they indicate that our approach outperforms the state-of-the-art spam filters. Furthermore, we have proposed the use of Matthews correlation coefficient (MCC) as the evaluation measurement in order to provide a fairer comparison. We have showed that MCC provides a balanced evaluation of the prediction, especially if the two classes are of different sizes. Moreover, MCC returns a value inside a predefined range which provides more information about the classifiers performance than other measures. Actually, we are conducting more experiments using larger datasets as TREC05, TREC06 and TREC07 corpora [8] in order to reinforce the validation. We also aim to compare our approach with other commercial and open-source anti-spam filters, as Bogofilter, SpamAssassin, OSBF-Lua, among others. Future works should take into consideration that spam filtering is a coevolutionary problem, because while the filter tries to evolve its prediction capacity, the spammers try to evolve their spam messages in order to overreach the classifiers. Hence, an efficient approach should have an effective way to adjust its rules in order to detect the changes of spam features. In this way, collaborative filters could be used to assist the classifier by accelerating the adaptation of the rules and increasing the classifiers performance. 6. ACKNOWLEDGMENTS This work is supported by the Brazilian funding agencies CNPq, CAPES and FAPESP REFERENCES [1] T. Almeida, A. Yamakami, and J. Almeida. Evaluation of approaches for dimensionality reduction applied with naive bayes anti-spam filters. In Proc. of the 8th IEEE Int. Conf. on Machine Learning and Applications, pp. 1 6, Miami, FL, USA, [2] I. Androutsopoulos, G. Paliouras, and E. Michelakis. Learning to filter unsolicited commercial . Technical Report 2004/2, National Centre for Scientific Research, Athens, Greece, [3] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Trans. on Inf. Theory, 44(6): , [4] A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B. Zupan. Spam filtering using statistical data compression models. JMLR, 7: , [5] I.A Braga, M Ladeira. Filtragem Adaptativa de Spam com o Principio Minimum Description Length. In Proc. of the 28th Brazilian Computer Society, pp , Belem, Brazil, 2008 (In Portuguese). [6] J. Carpinter and R. Hunt. Tightening the net: A review of current and next generation spam filtering tools. Computers and Security, 25(8): , [7] X. Carreras and L. Marquez. Boosting trees for anti-spam filtering. In Proc. of the 4th Int. Conf. on Recent Advances in Natural Language Processing, pages 58 64, Tzigov Chark, Bulgaria, [8] G. Cormack. spam filtering: A systematic review. Foundations and Trends in Information Retrieval, 1(4): , [9] G. Cormack and T. Lynam. Online supervised spam filter evaluation. ACM Trans. on Information Systems, 25(3):1 11, [10] H. Drucker, D. Wu, and V. Vapnik. Support vector machines for spam categorization. IEEE Trans. on Neural Networks, 10(5): , September [11] P. Grünwald. A tutorial introduction to the minimum description length principle. In P. Grünwald, I. Myung, and M. Pitt, editors, Advances in Minimum Description Length: Theory and Applications, pp MIT Press, [12] J. Hidalgo. Evaluating cost-sensitive unsolicited bulk categorization. In Proc. of the 17th ACM SAC, pp , Madrid, Spain, [13] T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In Proc. of 14th ICML, pp , Nashville, TN, USA, [14] A. Kolcz and J. Alspector. Svm-based filtering of spam with content-specific misclassification costs. In Proc. of the 1st IEEE ICDM, pp. 1 14, San Jose, CA, USA, [15] B. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta, 405(2): , [16] V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? In Proc. of the 3rd CEAS, pp. 1 5, Mountain View, CA, USA, [17] J. Rissanen. Modeling by shortest data description. Automatica, 14: , [18] R. Schapire, Y. Singer, and A. Singhal. Boosting and rocchio applied to text filtering. In Proc. of the 21st ACM SIGIR, pp , Melbourne, [19] C. Siefkes, F. Assis, S. Chhabra, and W. Yerazunis. Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In Proc. of the 8th PKDD, pp , Pisa, Italy, [20] L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques. ACM Trans. on Asian Lang. Information Proc., 3(4): , 2004.