Filtering Spams using the Minimum Description Length Principle

Size: px
Start display at page:

Download "Filtering Spams using the Minimum Description Length Principle"

Transcription

1 Filtering Spams using the Minimum Description Length Principle ABSTRACT Tiago A. Almeida, Akebo Yamakami School of Electrical and Computer Engineering University of Campinas UNICAMP , Campinas, SP, Brazil {tiago, Spam has become an increasingly important problem with a big economic impact in society. Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on the minimum description length principle. The proposed model is fast to construct and incrementally updateable. Additionally, we offer an analysis concerning the measurements usually employed to evaluate the quality of the anti-spam classifiers. In this sense, we present a new measure in order to provide a fairer comparison. Furthermore, we conducted an empirical experiment using six well-known, large and public databases. Finally, the results indicate that our approach outperforms the state-of-the-art spam filters. Categories and Subject Descriptors I.5 [Pattern Recognition]: Applications; I.2.7 [Artificial Intelligence]: Natural Language Processing Text analysis General Terms Anti-spam Filtering Keywords Minimum description length, spam filter, machine learning 1. INTRODUCTION is one of the most popular, fastest and cheapest means of communication. It has become a part of everyday life for millions of people, changing the way we work and collaborate. is not only used to support conversation but also as a task manager, document delivery system and archive. The downside of this success is the constantly growing volume of spam we receive. The problem of spams can be quantified in economical terms since many Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and thefull citation on the firstpage. Tocopy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 10 March 22-26, 2010, Sierre, Switzerland. Copyright 2010 ACM /10/03...$ Jurandy Almeida Institute of Computing University of Campinas UNICAMP , Campinas, SP, Brazil jurandy.almeida@ic.unicamp.br hours are wasted everyday by workers. It is not just the time they waste reading the spam but also the time they spend deleting those messages. Fortunately, many solutions are being proposed to avoid this plague and one of more promising is the use of machine learning techniques for automatically filtering messages [8]. These methods include approaches that are considered top-performers in text categorization like Rocchio [13, 18], Boosting [7], Support Vector Machines (SVM) [10, 14, 12], and naive Bayes classifiers [2, 16]. The two latter currently appear to be the best anti-spam filters presented in the literature [1, 2, 4, 6, 8, 9, 20]. A relatively recent method for inductive inference which is still rarely employed in text categorization tasks is the Minimum Description Length (MDL) principle. It holds that the best explanation, given a limited set of observed data, is the one that permits the greatest compression of the data. MDL methods are particularly well-suited for dealing with model selection, prediction, and estimation problems in situations where the models can be arbitrarily complex, and overfitting the data is a serious concern [3, 11, 17]. In this paper, we present a novel anti-spam classifier based on the minimum description length principle [4]. We conducted an empirical experiment using six well-known, large and public databases. The results indicate that our approach outperforms currently established spam filters. Furthermore, we investigate the most used performance measurements applied for comparing the quality of the antispam filters. In this way, we analyze the advantages of using the Matthews correlation coefficient. The remainder of this paper is organized as follows: Section 2 presents details of our proposed approach. In Section 3, we introduce a brief discussion about the benefits of using the Matthews correlation coefficient as a measure of quality of anti-spam classifiers. Experimental results are showed in Section 4. Finally, Section 5 offers conclusions and directions for future works. 2. SPAM FILTERING BASED ON MINIMUM DESCRIPTION LENGTH The MDL principle is a formalization of Occam s Razor in which the best hypothesis for a given set of data is the one that yields compact representations. The traditional MDL principle states that the preferred model results in the shortest description of the model and the data, given this model. In other words, the model that best compresses the data is selected. This model selection criterion naturally balances the complexity of the model and the degree to which this model fits the data. Let Z be a finite or countable set and let P be a probability distribution on Z. Then there exists a prefix code C for Z such that for all z Z, L C(z) = log 2P(z). 1854

2 C is called the code corresponding to P. Similarly, let C be a prefix code for Z. Then there exists a (possibly defective) probability distribution P such that for all z Z, log 2P (z) = L C (z). P is called the probability distribution corresponding to C. Thus, large probability according to P means small code length according to the code corresponding to P and vice versa [3, 11, 17]. The goal of statistical inference may be cast as trying to find regularity in the data. Regularity may be identified with ability to compress. MDL combines these two insights by viewing learning as data compression: it tells us that, for a given set of hypotheses H and data set D, we should try to find the hypothesis or combination of hypotheses in H that compresses D most [3, 11, 17]. This idea can be applied to all sorts of inductive inference problems, but it turns out to be most fruitful in problems of model selection and, more generally, dealing with overfitting [11]. An important property of MDL methods is that they provide automatically and inherently protect against overfitting and can be used to estimate both the parameters and the structure of a model. In contrast, to avoid overfitting when estimating the structure of a model, traditional methods such as maximum likelihood must be modified and extended with additional, typically adhoc principles [11]. Consider the following example. Suppose we flip a coin 1,000 times and we observe the numbers of heads and tails. We consider two model classes: the first consists of a code that represents each outcome with a 0 for heads or a 1 for tails. This code represents the hypothesis that the coin is fair. The code length according to this code is always exactly 1,000 bits. The second model class consists of all codes that are efficient for a coin with some specific bias, representing the hypothesis that the coin is not fair. Say that we observe 510 heads and 490 tails. Then the code length according to the best code in the second model class is shorter than 1,000 bits. For this reason a naive statistical method might put forward this second hypothesis as a better explanation for the data. However, in an MDL approach we would have to construct a single code based on the hypothesis, we can not just use the best one. A simple way to do it would be to use a two-part code, in which we first specify which element of the model class has the best performance, and then we specify the data using that code. We will need quite a lot of bits to specify which code to use; thus the total codelength based on the second model class would be larger than 1,000 bits. Thus if we follow an MDL approach the conclusion has to be that there is not enough evidence in support of the hypothesis that the coin is biased, even though the best element of the second model class provides better fit to the data. Consult Grunwald [11] for further details. In essence, compression algorithms can be applied to text categorization by building one compression model from the training documents of each class and using these models to evaluate the target document. The minimum description length principle was first employed in the anti-spam filtering task by Bratko et al. [4]. We extended their original work offering improvements in the classification criteria, training process, and especially, proposing a different manner to build the spam and legitimate models. Assuming that each message m is composed by a set of terms m = t 1,..., t n, where each term t k corresponds to a word ( adult, for example), a set of words ( to be removed ), or a single character ( $ ), we can represent each message by a vector x = x 1,..., x n, where x 1,..., x n are values of the attributes X 1,..., X n associated with the terms t 1,..., t n. In the simplest case, each term represents a single word and all attributes are Boolean: X i = 1 if the message contains t i or X i = 0, otherwise. Given a set of pre-classified training messages M, the task is to assign a target m with an unknown label to one of the classes c {spam,legitimate}. So, the method measures the increase of the description length of the data set as a result of the addition of the target document. Finally, it chooses the class for which the description length increase is minimal. Unlike the algorithm proposed by Bratko et al. [4], we consider in this work, each class (model) c as a sequence of terms extracted from the messages and inserted into the training set. Each term t from m has a code length L t based on the sequence of terms presented in the messages of the training set of c. The length of m when assigned to the class c corresponds to the sum of all code lengths associated with each term of m, Lm = P m i=1 Lt i. We calculate L ti = log 2P ti, where P is a probability distribution related with the terms of class. Let n c(t i) the number of times that t i appears in messages of class c, then the probability that any term belongs to c is given by the maximum likelihood estimation: P ti = nc(ti) + 1 n c + 1 where n c corresponds to the sum of n c(t i) for all terms which appear in messages that belongs to c and is the vocabulary size. In this work, we assume that = 2 32, that is, each term in an uncompress mode is a symbol with 32 bits. This estimation reserves a portion of probability to words which the classifier has never seen before [5]. Briefly, the proposed MDL anti-spam filter classify a message by following these steps: 1. Tokenization: the classifier extract all terms of the new message m = {t 1,..., t m }; 2. Compute the increase of the description length when m is assigned to each class c {spam,legitimate}: & m X L m(spam) = i=1 i=1! n spam(t i) + 1 log 2 n spam + 1! m X n legitimate (t i) + L m(legitimate) = & log 1 2 n legitimate if L m(spam) > L m(legitimate), then m is classified as spam; otherwise, m is labeled as legitimate. 4. Training method. In the following, we offer more details about the steps 1 and Preprocessing and Tokenization We did not perform language-specific preprocessing techniques such as word stemming, stop word removal, or case folding, since other researchers found that such techniques tend to hurt spam-filtering accuracy [2, 16, 20]. However, we use an -specific preprocessing before the classification task. In this way, we employ the Jaakko Hyvattis normalizemime 1. This program converts the character set to UTF-8, decoding Base64, Quoted-Printable and URL encoding and adding warn tokens in case of encoding errors. It also appends a copy of HTML/XML message bodies with 1 Available at

3 most tags removed, decodes HTML entities and limits the size of attached binary files. Tokenization is the first stage in the classification pipeline; it involves breaking the text stream into terms ( words ), usually by means of a regular expression. We consider in this work that terms start with a printable character; followed by any number of alphanumeric characters, excluding dots, commas and colons from the middle of the pattern. With this pattern, domain names and mail addresses will be split at dots, so the classifier can recognize a domain even if subdomains vary [19]. As proposed by Drucker et al. [10] and Metsis et al. [16], we do not consider the number of times a term appears in each message. In this way, each term is computed only one time per message it appears. 2.2 Training Method Anti-spam filters generally build their predicting models by learning from examples. A basic training method is to start with an empty model, classify each new sample and train it in the right class if the classification is wrong. This is known as train on error (TOE). An improvement to this method is to train also when the classification is right, but the score is near the boundary that is, train on or near error (TONE) [19]. The advantage of TONE over TOE is that it accelerates the learning process by exposing the filter to additional hardto-classify samples in the same training period. Therefore, we employ the TONE as training method used by the proposed MDL anti-spam filter. A good point of the MDL classifier is that we can start with an empty training set and according to the user feedback the classifier builds the models for each class. Moreover, it is not necessary to keep the messages used for training since the models are incrementally building by the term frequencies [5]. 3. PERFORMANCE MEASUREMENTS According to Cormack [8], the filters should be judged along four dimensions: autonomy, immediacy, spam identification, and non-spam identification. However, it is not obvious how to measure any of these dimensions separately, nor how to combine these measurements into a single one for the purpose of comparing filters. Reasonable standard measures are useful to facilitate comparison, given that the goal of optimizing them does not replace that of finding the most suitable filter for the purpose of spam filtering. Let S and L sets of spam and legitimate messages, the possible prediction results are: true positives (T P) corresponding to the set of spam messages correctly classified, true negatives (T N) the set of legitimate messages correctly classified, false negatives (FN) the set of spam messages incorrectly classified as legitimate, and false positives (FP) the set of legitimate messages incorrectly classified as spam. Some well-known evaluation measurements are: True positive rate (Tpr), True negative rate (Tnr), True negative rate (T nr), Spam precision (Spr), Spam recall (Sre), Legitimate precision (Lpr), Legitimate recall (Lre), area under the ROC curves (1-AUC), LAM [8], precision recall [16], Accuracy rate (Acc) and Total Cost Ratio (TCR) [2]. Nevertheless, failures to identify legitimate and spam messages have materially different consequences. Misclassified non-spam substantially increases the risk that the information contained in the message will be lost, or at least delayed. Exactly how much risk and delay are incurred is difficult to quantify, as are the consequences, which depend on the nature of the message. On the other hand, failures to identify spam also vary in importance, but are generally less important than failures to identify non-spam. Viruses, worms, and phishing messages may be an exception, as they pose 1856 significant risks to the user [8]. In order to take into consideration the asymmetry in the misclassification costs, Androutsopoulos et al. [2] proposed a refinement based on spam recall and precision, to allow the performance evaluation based on a single measure. They consider a false positive as being λ times more costly than false negatives, with λ equals to 1 or 9. Hence, each false positive is accounted as λ mistakes, with the weighted accuracy (Acc w) being given by Acc w = T P + λ T N. S + λ L The total cost ratio (TCR) can be calculated by TCR = S λ FP + FN. It offers an indication of the improvement provided by the filter. Greater T CR indicates better performance, and for TCR < 1, not using the filter is better. 3.1 Matthews correlation coefficient According to Carpinter and Hunt [6] and Cormack and Lynam [9] the value of λ is very difficult to be determined. In particular, it depends on the message once some messages are more important than others, as previously discussed. Furthermore, the problem of using TCR is that it does not return a value inside a predefined range. Consider, for example, two classifiers A and B employed to filter 600 messages (450 spams legitimates, λ = 1). Suppose that A attained a perfect prediction with FP A = FN A = 0, and B classified incorrectly only 3 spam messages as legitimate, thus FP B = 0 and FN B = 3. In this way, TCR A = + and TCR B = 150. Intuitively, we can observe that both classifiers achieved similar performance with a small advantage for A. However, if we analyze only the TCR, we would wrong conclude that A was much better than B. Furthermore, a TCR is not a representative value which we can make strong assumptions about the performance achieved by a single classifier, because it gives us only the information about the improvement provided by using the filter, but not provide an information about how much the classifier could be improved. In order to avoid these characteristics, we propose the use of the Matthews correlation coefficient (MCC) [15]. MCC is used in machine learning as a measure of the quality of binary classifications which provides much more information than TCR. It returns a real value between -1 and +1. A coefficient equals to +1 indicates a perfect prediction; 0, an average random prediction; and -1, an inverse prediction. MCC provides a more balanced evaluation of the prediction than other measures, such as the proportion of correct predictions, especially if the two classes are of different sizes. MCC = ( T P. T N ) ( FP. FN ) p ( T P + FP ).( T P + FN ).( T N + FP ).( T N + FN ), It provides a fairer evaluation since in a real situation the number of spams we receive is much higher than the number of legitimate messages, therefore MCC tends to automatically adjust the how much a false positive error is worst than a false negative one. As the proportionality between the number of spams and legitimate messages increases a false positive tends to be much worst than a false negative. Using the previous example, the classifier A would achieve MCC A = and MCC B = Thus, we can make

4 correct conclusions about the classifiers predictions as much as each performance achieved individually. Furthermore, we can combine MCC with other measures, as precision recall rates, for instance, in order to provide a fairer comparison [1]. 4. EXPERIMENTAL RESULTS We use the six well-known Enron corpora [16] in our experiments. All corpus are composed by real legitimate messages extracted from the mailboxes of six former employees of the Enron Corporation and selected spam messages from different sources. Enron corpora tries to keep the same characteristics of a real user mailbox. The composition of each dataset is shown in Table 1. Table 1: Enron datasets Dataset N o of Legitimate N o of Spam Total Enron 1 3,672 1,500 5,172 Enron 2 4,361 1,496 5,857 Enron 3 4,012 1,500 5,512 Enron 4 1,500 4,500 6,000 Enron 5 1,500 3,675 5,175 Enron 6 1,500 4,500 6,000 Total 16,545 17,171 33,716 Tables 2, 3, 4, 5, 6, and 7 present the performance achieved by each classifier for each Enron dataset. Bold values indicate the highest score. In order to provide a fairer evaluation, we consider the most important measures the spam recall rate (Spr), legitimate recall rate (Lre) and Matthews correlation coefficient (MCC) achieved by each filter. Additionally, we present other measures as legitimate spam precision rates, weighted accuracy (Acc w) and total cost ratio (TCR). In this way, we employed λ = 1 in order to calculate Acc w and TCR, as described in Section 3. The results achieved by the proposed MDL classifier are compared with the ones attained by methods considered the actual top-performers in anti-spam filtering: the Boolean naive Bayes (NB) classifier [2, 16] and linear support vector machine (SVM) with Boolean attributes [10, 12, 14]. Due to paper limitations, we present the results achieved by each evaluated classifier. A comprehensive set of results, including all tables and figures, is available at dt.fee.unicamp.br/~tiago/research/spam/spam.htm. Table 2: Enron 1 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Regarding the results achieved by the filters, the proposed MDL anti-spam classifier outperforms the other classifiers for the majority corpus used in our empirical evaluation. It is important to realize that in some situations the MDL performs much better than SVM or NB. For instance, for 1857 Table 3: Enron 2 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Table 4: Enron 3 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Table 5: Enron 4 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Table 6: Enron 5 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Table 7: Enron 6 Results achieved by each filter Sre(%) Spr(%) Lre(%) Lpr(%) Acc w(%) T CR MCC Enron 1 (Table 2) MDL achieved spam recall rate equal to 92% while SVM attained 83.33%, even thought MDL presented better legitimate recall. It means that for Enron 1 MDL was able to recognize more than 8% of spams than SVM, representing an improvement of 10.40%. In real applications this difference is extremely important. Note that,

5 the same result can be found for Enron 5 (Table 6) and Enron 6 (Table 7). Both methods, MDL and SVM, achieved similar performance with no significant statistical difference just for Enron 4 (Table 5). The results indicate that our approach is more efficient to distinguish messages as spams or legitimates. It attained an accuracy rate higher than 95% and high precision recall rates for all datasets indicating that the proposed filter makes few mistakes. We also verify that the MDL classifier achieved high MCC score ( 0.878) for all tested corpus. It indicates that the proposed filter almost accomplished a perfect prediction (MCC = 1.000) and it is much better than not use a filter (MCC = 0.000). Nevertheless, it is important to note that TCR is really not an informative measurement. For instance, for Enron 4 (Table 5), MDL and SVM achieved similar performances (MDL MCC = and SV M MCC = 0.978). However, their TCR are very different (MDL TCR = and SV M TCR = ), besides their precision recall rates are very close. 5. CONCLUSIONS AND FURTHER WORK In this paper, we have presented an anti-spam classifier based on the minimum description length principles. Particularly, we based our research on the filter proposed by Bratko et al. [4]. We have extended their original work offering several improvements mainly in the classification criteria, training process, and proposing a different manner to build the spam and legitimate models. We have conducted empirical experiments using six wellknown, large and public corpora. In addition, we have compared the results achieved by methods considered the actual top-performers in spam filtering and they indicate that our approach outperforms the state-of-the-art spam filters. Furthermore, we have proposed the use of Matthews correlation coefficient (MCC) as the evaluation measurement in order to provide a fairer comparison. We have showed that MCC provides a balanced evaluation of the prediction, especially if the two classes are of different sizes. Moreover, MCC returns a value inside a predefined range which provides more information about the classifiers performance than other measures. Actually, we are conducting more experiments using larger datasets as TREC05, TREC06 and TREC07 corpora [8] in order to reinforce the validation. We also aim to compare our approach with other commercial and open-source anti-spam filters, as Bogofilter, SpamAssassin, OSBF-Lua, among others. Future works should take into consideration that spam filtering is a coevolutionary problem, because while the filter tries to evolve its prediction capacity, the spammers try to evolve their spam messages in order to overreach the classifiers. Hence, an efficient approach should have an effective way to adjust its rules in order to detect the changes of spam features. In this way, collaborative filters could be used to assist the classifier by accelerating the adaptation of the rules and increasing the classifiers performance. 6. ACKNOWLEDGMENTS This work is supported by the Brazilian funding agencies CNPq, CAPES and FAPESP REFERENCES [1] T. Almeida, A. Yamakami, and J. Almeida. Evaluation of approaches for dimensionality reduction applied with naive bayes anti-spam filters. In Proc. of the 8th IEEE Int. Conf. on Machine Learning and Applications, pp. 1 6, Miami, FL, USA, [2] I. Androutsopoulos, G. Paliouras, and E. Michelakis. Learning to filter unsolicited commercial . Technical Report 2004/2, National Centre for Scientific Research, Athens, Greece, [3] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Trans. on Inf. Theory, 44(6): , [4] A. Bratko, G. Cormack, B. Filipic, T. Lynam, and B. Zupan. Spam filtering using statistical data compression models. JMLR, 7: , [5] I.A Braga, M Ladeira. Filtragem Adaptativa de Spam com o Principio Minimum Description Length. In Proc. of the 28th Brazilian Computer Society, pp , Belem, Brazil, 2008 (In Portuguese). [6] J. Carpinter and R. Hunt. Tightening the net: A review of current and next generation spam filtering tools. Computers and Security, 25(8): , [7] X. Carreras and L. Marquez. Boosting trees for anti-spam filtering. In Proc. of the 4th Int. Conf. on Recent Advances in Natural Language Processing, pages 58 64, Tzigov Chark, Bulgaria, [8] G. Cormack. spam filtering: A systematic review. Foundations and Trends in Information Retrieval, 1(4): , [9] G. Cormack and T. Lynam. Online supervised spam filter evaluation. ACM Trans. on Information Systems, 25(3):1 11, [10] H. Drucker, D. Wu, and V. Vapnik. Support vector machines for spam categorization. IEEE Trans. on Neural Networks, 10(5): , September [11] P. Grünwald. A tutorial introduction to the minimum description length principle. In P. Grünwald, I. Myung, and M. Pitt, editors, Advances in Minimum Description Length: Theory and Applications, pp MIT Press, [12] J. Hidalgo. Evaluating cost-sensitive unsolicited bulk categorization. In Proc. of the 17th ACM SAC, pp , Madrid, Spain, [13] T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In Proc. of 14th ICML, pp , Nashville, TN, USA, [14] A. Kolcz and J. Alspector. Svm-based filtering of spam with content-specific misclassification costs. In Proc. of the 1st IEEE ICDM, pp. 1 14, San Jose, CA, USA, [15] B. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta, 405(2): , [16] V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? In Proc. of the 3rd CEAS, pp. 1 5, Mountain View, CA, USA, [17] J. Rissanen. Modeling by shortest data description. Automatica, 14: , [18] R. Schapire, Y. Singer, and A. Singhal. Boosting and rocchio applied to text filtering. In Proc. of the 21st ACM SIGIR, pp , Melbourne, [19] C. Siefkes, F. Assis, S. Chhabra, and W. Yerazunis. Combining winnow and orthogonal sparse bigrams for incremental spam filtering. In Proc. of the 8th PKDD, pp , Pisa, Italy, [20] L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques. ACM Trans. on Asian Lang. Information Proc., 3(4): , 2004.

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering 2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University

More information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,

More information

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Naive Bayes Spam Filtering Using Word-Position-Based Attributes Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper

More information

Not So Naïve Online Bayesian Spam Filter

Not So Naïve Online Bayesian Spam Filter Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China freizsu@gmail.com Congfu Xu Institute of Artificial

More information

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Three-Way Decisions Solution to Filter Spam Email: An Empirical Study

Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

Volume 2, Issue 9, September 2014 International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 9, September 2014 International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 9, September 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

More information

Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme

Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme IJCSET October 2012 Vol 2, Issue 10, 1447-1451 www.ijcset.net ISSN:2231-0711 Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme I.Kalpana, B.Venkateswarlu Avanthi Institute

More information

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier Spam Detection System Combining Cellular Automata and Naive Bayes Classifier F. Barigou*, N. Barigou**, B. Atmani*** Computer Science Department, Faculty of Sciences, University of Oran BP 1524, El M Naouer,

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Filtering Email Spam in the Presence of Noisy User Feedback

Filtering Email Spam in the Presence of Noisy User Feedback Filtering Email Spam in the Presence of Noisy User Feedback D. Sculley Department of Computer Science Tufts University 161 College Ave. Medford, MA 02155 USA dsculley@cs.tufts.edu Gordon V. Cormack School

More information

Highly Scalable Discriminative Spam Filtering

Highly Scalable Discriminative Spam Filtering Highly Scalable Discriminative Spam Filtering Michael Brückner, Peter Haider, and Tobias Scheffer Max Planck Institute for Computer Science, Saarbrücken, Germany Humboldt Universität zu Berlin, Germany

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,

More information

Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering

Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering Christian Siefkes 1, Fidelis Assis 2, Shalendra Chhabra 3, and William S. Yerazunis 4 1 Berlin-Brandenburg Graduate School

More information

WE DEFINE spam as an e-mail message that is unwanted basically

WE DEFINE spam as an e-mail message that is unwanted basically 1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

More information

Lasso-based Spam Filtering with Chinese Emails

Lasso-based Spam Filtering with Chinese Emails Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1

More information

SVM-Based Spam Filter with Active and Online Learning

SVM-Based Spam Filter with Active and Online Learning SVM-Based Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,

More information

Tweaking Naïve Bayes classifier for intelligent spam detection

Tweaking Naïve Bayes classifier for intelligent spam detection 682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. araturi@uci.edu 2 School of Computing, Information

More information

Cosdes: A Collaborative Spam Detection System with a Novel E- Mail Abstraction Scheme

Cosdes: A Collaborative Spam Detection System with a Novel E- Mail Abstraction Scheme IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 9 (September 2012), PP 55-60 Cosdes: A Collaborative Spam Detection System with a Novel E- Mail Abstraction Scheme

More information

Combining SVM classifiers for email anti-spam filtering

Combining SVM classifiers for email anti-spam filtering Combining SVM classifiers for email anti-spam filtering Ángela Blanco Manuel Martín-Merino Abstract Spam, also known as Unsolicited Commercial Email (UCE) is becoming a nightmare for Internet users and

More information

Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering

Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering Christian Siefkes 1, Fidelis Assis 2, Shalendra Chhabra 3, and William S. Yerazunis 4 1 Berlin-Brandenburg Graduate School

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

Filtering Junk Mail with A Maximum Entropy Model

Filtering Junk Mail with A Maximum Entropy Model Filtering Junk Mail with A Maximum Entropy Model ZHANG Le and YAO Tian-shun Institute of Computer Software & Theory. School of Information Science & Engineering, Northeastern University Shenyang, 110004

More information

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 and Barry Smyth 3 Abstract. While text classification has been identified for some time

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Detecting E-mail Spam Using Spam Word Associations

Detecting E-mail Spam Using Spam Word Associations Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Shafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam

Shafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam An Improved AIS Based E-mail Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State Idris.ismaila95@gmail.com Abdulhamid Shafi i

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Some fitting of naive Bayesian spam filtering for Japanese environment

Some fitting of naive Bayesian spam filtering for Japanese environment Some fitting of naive Bayesian spam filtering for Japanese environment Manabu Iwanaga 1, Toshihiro Tabata 2, and Kouichi Sakurai 2 1 Graduate School of Information Science and Electrical Engineering, Kyushu

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Email Classification Using Data Reduction Method

Email Classification Using Data Reduction Method Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user

More information

Single-Class Learning for Spam Filtering: An Ensemble Approach

Single-Class Learning for Spam Filtering: An Ensemble Approach Single-Class Learning for Spam Filtering: An Ensemble Approach Tsang-Hsiang Cheng Department of Business Administration Southern Taiwan University of Technology Tainan, Taiwan, R.O.C. Chih-Ping Wei Institute

More information

Naïve Bayesian Anti-spam Filtering Technique for Malay Language

Naïve Bayesian Anti-spam Filtering Technique for Malay Language Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information

More information

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 Abstract. While text classification has been identified for some time as a promising application

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Enrique Puertas Sánz Francisco Carrero García Universidad Europea de Madrid Villaviciosa de Odón 28670 Madrid, SPAIN 34 91 211 5611

Enrique Puertas Sánz Francisco Carrero García Universidad Europea de Madrid Villaviciosa de Odón 28670 Madrid, SPAIN 34 91 211 5611 José María Gómez Hidalgo Universidad Europea de Madrid Villaviciosa de Odón 28670 Madrid, SPAIN 34 91 211 5670 jmgomez@uem.es Content Based SMS Spam Filtering Guillermo Cajigas Bringas Group R&D - Vodafone

More information

Impact of Feature Selection Technique on Email Classification

Impact of Feature Selection Technique on Email Classification Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Index Terms Domain name, Firewall, Packet, Phishing, URL. BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

An Efficient Spam Filtering Techniques for Email Account

An Efficient Spam Filtering Techniques for Email Account American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

Effectiveness and Limitations of Statistical Spam Filters

Effectiveness and Limitations of Statistical Spam Filters Effectiveness and Limitations of Statistical Spam Filters M. Tariq Banday, Lifetime Member, CSI P.G. Department of Electronics and Instrumentation Technology University of Kashmir, Srinagar, India Abstract

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle

More information

Mining the Software Change Repository of a Legacy Telephony System

Mining the Software Change Repository of a Legacy Telephony System Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

More information

Batch and On-line Spam Filter Comparison

Batch and On-line Spam Filter Comparison Batch and On-line Spam Filter Comparison Gordon V. Cormack David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada gvcormac@uwaterloo.ca Andrej Bratko Department

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

Learning to classify e-mail

Learning to classify e-mail Information Sciences 177 (2007) 2167 2187 www.elsevier.com/locate/ins Learning to classify e-mail Irena Koprinska *, Josiah Poon, James Clark, Jason Chan School of Information Technologies, The University

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

A Novel Technique of Email Classification for Spam Detection

A Novel Technique of Email Classification for Spam Detection A Novel Technique of Email Classification for Spam Detection Vinod Patidar Student (M. Tech.), CSE Department, BUIT Divakar singh HOD, CSE Department, BUIT Anju Singh Assistant Professor, IT Department,

More information

A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

More information

Investigation of Support Vector Machines for Email Classification

Investigation of Support Vector Machines for Email Classification Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software

More information

An Evaluation of Statistical Spam Filtering Techniques

An Evaluation of Statistical Spam Filtering Techniques An Evaluation of Statistical Spam Filtering Techniques Le Zhang, Jingbo Zhu, Tianshun Yao Natural Language Processing Laboratory Institute of Computer Software & Theory Northeastern University, China ejoy@xinhuanet.com,

More information

Forecasting stock markets with Twitter

Forecasting stock markets with Twitter Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,

More information

Relaxed Online SVMs for Spam Filtering

Relaxed Online SVMs for Spam Filtering Relaxed Online SVMs for Spam Filtering D. Sculley Tufts University Department of Computer Science 161 College Ave., Medford, MA USA dsculleycs.tufts.edu Gabriel M. Wachman Tufts University Department of

More information

2. NAIVE BAYES CLASSIFIERS

2. NAIVE BAYES CLASSIFIERS Spam Filtering with Naive Bayes Which Naive Bayes? Vangelis Metsis Institute of Informatics and Telecommunications, N.C.S.R. Demokritos, Athens, Greece Ion Androutsopoulos Department of Informatics, Athens

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang

Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang Graduate Institute of Information Management National Taipei University 69, Sec. 2, JianGuo N. Rd., Taipei City 104-33, Taiwan

More information

How To Reduce Spam To A Legitimate Email From A Spam Message To A Spam Email

How To Reduce Spam To A Legitimate Email From A Spam Message To A Spam Email Enhancing Scalability in Anomaly-based Email Spam Filtering Carlos Laorden DeustoTech Computing - S 3 Lab, University of Deusto Avenida de las Universidades 24, 48007 Bilbao, Spain claorden@deusto.es Borja

More information

Hoodwinking Spam Email Filters

Hoodwinking Spam Email Filters Proceedings of the 2007 WSEAS International Conference on Computer Engineering and Applications, Gold Coast, Australia, January 17-19, 2007 533 Hoodwinking Spam Email Filters WANLI MA, DAT TRAN, DHARMENDRA

More information

Representation of Electronic Mail Filtering Profiles: A User Study

Representation of Electronic Mail Filtering Profiles: A User Study Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu

More information

On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Abstract. The efforts of anti-spammers

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS Charanma.P 1, P. Ganesh Kumar 2, 1 PG Scholar, 2 Assistant Professor,Department of Information Technology, Anna University

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

On-line Spam Filter Fusion

On-line Spam Filter Fusion On-line Spam Filter Fusion Thomas Lynam & Gordon Cormack originally presented at SIGIR 2006 On-line vs Batch Classification Batch Hard Classifier separate training and test data sets Given ham/spam classification

More information

SpamNet Spam Detection Using PCA and Neural Networks

SpamNet Spam Detection Using PCA and Neural Networks SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India abhimanyulad@iiita.ac.in

More information

A semi-supervised Spam mail detector

A semi-supervised Spam mail detector A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

Bayesian Spam Detection

Bayesian Spam Detection Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional

More information

Using Biased Discriminant Analysis for Email Filtering

Using Biased Discriminant Analysis for Email Filtering Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and Marie-Francine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico juancarlos.gomez@invitados.itesm.mx 2

More information

Keywords Phishing Attack, phishing Email, Fraud, Identity Theft

Keywords Phishing Attack, phishing Email, Fraud, Identity Theft Volume 3, Issue 7, July 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Detection Phishing

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

A Bayesian Topic Model for Spam Filtering

A Bayesian Topic Model for Spam Filtering Journal of Information & Computational Science 1:12 (213) 3719 3727 August 1, 213 Available at http://www.joics.com A Bayesian Topic Model for Spam Filtering Zhiying Zhang, Xu Yu, Lixiang Shi, Li Peng,

More information

Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques

Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques 52 The International Arab Journal of Information Technology, Vol. 6, No. 1, January 2009 Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques Alaa El-Halees

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

An Efficient Three-phase Email Spam Filtering Technique

An Efficient Three-phase Email Spam Filtering Technique An Efficient Three-phase Email Filtering Technique Tarek M. Mahmoud 1 *, Alaa Ismail El-Nashar 2 *, Tarek Abd-El-Hafeez 3 *, Marwa Khairy 4 * 1, 2, 3 Faculty of science, Computer Sci. Dept., Minia University,

More information