Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning
|
|
|
- Rafe Payne
- 10 years ago
- Views:
Transcription
1 Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning Yan Zhou, Zach Jorgensen, Meador Inge School of Computer and Information Sciences University of South Alabama, Mobile, AL USA Abstract Statistical spam filters are known to be vulnerable to adversarial attacks. One such adversarial attack, known as the Good Word Attack, thwarts spam filters by appending to spam messages sets of good words, which are common in legitimate but rare in spam. We present a counterattack strategy that first attempts to differentiate spam from legitimate in the input space, by transforming each e- mail into a bag of multiple segments, and subsequently applies multiple instance logistic regression on the bags. We treat each segment in the bag as an instance. An is classified as spam if at least one instance in the corresponding bag is spam, and as legitimate if all the instances in it are legitimate. We show that a spam filter using our multiple instance counter-attack strategy stands up better to good word attacks than its single instance counterpart and the commonly practiced Bayesian filters. 1. Introduction It has been nearly thirty years since the debut of the first spam. Today, to most end users, spam does not seem to be a serious threat due to the effectiveness of spam filters. Behind the scenes, however, is a seemingly neverending battle between spammers and spam fighters. With millions of users, profit-driven spammers have great incentives to spam. With as little as 0.001% response rate, a spammer could potentially profit $25,000 on a $50 product [3]. Over the years, spammers have grown in sophistication with cutting-edge technologies and have become more evasive. The best evidence of their growing effectiveness is a recent estimate of over US$10 billion worldwide spamrelated cost [7]. In this paper, we target one of the attacking techniques spammers often use on existing spam filters. Adversarial attacks on spam filters have become an increasing challenge to the anti-spam community. The Good Word Attack [11] is one of the most popular techniques frequently employed by spammers. This technique involves appending to spam messages sets of good words that are common to legitimate s (ham) but rare in spam. Spam messages inflated with good words are more likely to bypass spam filters. So far, relatively little research has been done on how spam filters can be trained to account for such attacks. This paper presents an effective defense strategy using multiple instance (MI) learning against spam disguised with good words. The concept of multiple instance learning was initially proposed by Dietterich et al. for predicting drug activities [5]. The challenge of identifying a drug molecule that binds strongly to a drug target is that a drug molecule can have multiple conformations. A molecule is positive if at least one of its conformations binds tightly to the target, and negative if none of its conformations bind well to the target. The problem was tackled with an MI model that aims to learn axis-parallel rectangles (APR). Later, learning APR in the multiple instance setting was further studied and proved to be NP-complete by several other researchers in the PAClearning framework [1, 9, 2]. Several probabilistic models, Diverse Density (DD) [12] and its variation EMDD [20], and multiple instance logistic regression (MILR) [14], employ a maximum log-likelihood estimation to solve problems in the MI domain. More recent work in this area has focused on reducing a multiple instance learning problem to a single instance learning problem [6, 16, 13, 21]. Our spam filtering strategy adopts the classical MI assumption, i.e. a bag is positive if at least one of its instances is positive, and negative if all instances are negative. We treat each as a bag of instances. An is classified as spam if at least one instance in the corresponding bag is spam, and as legitimate if all the instances in it are legitimate. We split each into multiple instances so that even when the spam is inflated with good words, a multiple instance learner is still able to recognize the spam part of the message. Our experimental results show that a multiple instance learner stands up better to good word attacks
2 than its single instance counterpart and the Bayesian filters. In the remainder of our paper, we first discuss recent research that has motivated our work. Next, we formalize the spam filtering problem as a multiple instance learning problem. We present our experimental results to demonstrate the effectiveness of our filtering strategy. Finally, we conclude our work and discuss future directions. 2. Related Work Our work is primarily motivated by recent research on adversarial learning [4, 10, 8]. Dalvi et al. consider classification as a game between classifiers and adversaries in problem domains where adversarial attacks are expected [4]. They model the computation of the adversary s optimal strategy as a constrained optimization problem and approximate its solution based on dynamic programming. Subsequently, an optimal classifier is produced against the optimal adversarial strategy. Their experimental results demonstrate that their game-theoretic approach outperforms traditional classifiers in the spam filtering domain. However, in their adversarial classification framework, they assume both the classifier and the adversary have perfect knowledge of each other, which is practically unrealistic. Instead of assuming the adversary has perfect knowledge of the classifier, Lowd and Meek formalize the task of adversarial learning as the process of reverse engineering the classifier [10]. In their adversarial classifier reverse engineer (ACRE) framework, the adversary aims to identify difficult spam instances, the ones that are hard to detect by the classifier, through membership queries. The goal is to find a set of negative instances with minimum adversarial cost within a polynomial number of membership queries. A practical aspect of adversarial learning is the techniques used to initiate the attacks on classifiers. One of the common techniques is known as the Good Word Attack. Lowd and Meek [11] present and evaluate several variations of the good word attack that involves appending good words that are common to legitimate s to spam messages. There are two different ways to carry out the attack: passively and actively. Active good word attacks use feedback obtained by sending test messages to a spam filter in order to determine which words are good. The active attacks were found to be more effective than the passive attacks. However, active attacks are generally more difficult than passive attacks because they require user-level access to the spam filter, which is not always possible. Passive good word attacks, on the other hand, do not involve any feedback from the spam filter and guesses are made as to what words are considered good. Lowd and Meek identify three common ways that good words can be chosen for passive attacks. First, dictionary attacks involve selecting random words from a large collection of words, such as a dictionary. In testing, this method did not prove to be effective; in fact, it actually increased the chances that the would be classified as spam. Next, frequent word attacks involve the selection of words that occur most often in legitimate messages, such as news articles. This method was more effective than the previous one, but it still required as many as 1,000 words to be added to the original message, on average. Finally, frequency ratio attacks involve the selection of words that occur very often in legitimate messages but not in spam messages. The authors tests showed that this technique was quite effective, resulting in the average spam message being passed off as legitimate by adding as few as 150 words to the . In this paper, we demonstrate that an effective counterattack strategy can be developed in the framework of multiple instance learning, provided that a single instance can be properly transformed into a bag of instances. Our experiments also verify earlier observations, discussed in [11, 17], that re-training on s modified during adversarial attacks may improve the performance of the filters. 3. Problem Definition Consider a standard supervised learning problem with a set of training data D = {< X 1, Y 1 >,..., < X m, Y m > }, where X i is an instance represented as a single feature vector, Y i = C(X i ) is the target value of X i, where C is the target function. Normally, the task is to learn C given D. The learning task becomes more difficult when there are adversaries who could alter some instance X so that X X and C(X) C(X ). Let X be the difference between X and X, i.e. X = X + X. There are two cases that need to be studied separately: 1. the training set is clean; no training instances are altered. Adversaries can only modify examples in the test set. 2. both the training set and the test set are compromised. In the first case, the classifier is trained on a clean training set. Predictions made for altered instances are highly unreliable. In the second case, the classifier may capture some adversarial patterns as long as the adversaries consistently follow a particular pattern. In both cases, the problem becomes trivial if we know how the instances are altered. We could recover the original data and solve the problem as if no instances were altered by the adversary. In reality, knowing exactly how the instances are altered is impossible. Instead, we seek to approximately separate X and X and treat them as separate instances in a bag. We apply multiple instance learning to learn a hypothesis defined over a set of bags.
3 4. Multiple Instance Bag Creation We formulate the spam filtering problem as a multiple instance binary classification problem in the context of adversarial attacks. Note that the adversary is only interested in altering positive instances, i.e. spam, by appending sets of good words that are commonly encountered in negative instances, i.e. legitimate s, often referred to as ham in practice. We propose three different approaches to creating multiple-instance bags. We call them split-half (split-h), split-term (split-t), and split-projection (split-p) Split-H In our first splitting method, split-half, we split every e- mail right down the middle into approximately equal halves. Formally, let B = {B 1,..., B i,..., B m } be a set of bags ( s), where B i = {X i1, X i2 } is the i th bag, X i1 and X i2 are the two instances in the i th bag, created from the upper half and the lower half of the respectively. This splitting approach is reasonable in practice because spammers usually append a section of good words to either the beginning or the end of an to ensure the legibility of the spam message. There are two possible ways for spammers to attack spam filters with splitting methods of this nature. One is to create a visual pattern with good words so that the original spam message is still legible after the attack, but the spam is fragmented in such a way that spammy words are well separated and become less indicative of spam in the presence of a number of good words. The following example demonstrates how this idea can work effectively to deceive this type of filter. From: [email protected] To: foo-foo@ .org Subject: meeting agenda goodwords... low... goodwords goodwords... mortgage... goodwords goodwords... rate... goodwords The second way to defeat the split-h method is to append a very large block of good words so that after the split, good words outweigh spam-indicative words in all instances Split-T The second splitting method, split-term, partitions a message into three groups of terms (words) based on whether the word is an indicator of spam, an indicator of legitimate, or neutral, i.e. B i = {X is, X in, X il }, where X is is the spam-likely instance, X in is the neutral instance, and X il is the legit-likely instance. The determination for each word is based on a weight generated for it during preprocessing. These weights are calculated using word frequencies obtained from the spam and legitimate messages. More specifically, the weight of a term W is given as follows: s(w ) weight(w ) = s(w ) + h(w ), where s(w ) = number of spam s containing the word W total number of spam s number of ham s containing the word W h(w ) =. total number of ham s We considered any word with a weight less than thresh s to be spammy, any word with a weight greater than thresh l to be legitimate, and any word with a weight in between to be neutral. thresh s is selected such that 20% of the terms have a weight less than or equal to it. thresh l is selected so that 50% of the terms have a weight greater than or equal to it. Since most legitimate messages have at least a few terms that are more or less spam-indicative, the threshold values are chosen asymmetrically so that relatively fewer words are considered spammy in order to reduce the risk of false positives Split-P The third splitting method, split-projection, transforms each message into a bag of two instances by projecting the message vector onto the spam and ham prototype vectors. The prototype vectors are computed from the spam cluster and the ham cluster in the training set. More specifically, let C s be the set of instances that are spam and C l be the set of instances that are legitimate. The prototypes are computed using Rocchio s algorithm [15] as follows: C s C l P s = β 1/ C s C si γ 1/ C l C l C s P l = β 1/ C l C li γ 1/ C s C li C si where C si is the i th spam instance in C s and C li is the i th ham instance in C l, β is a fixed constant suggested to be 16 and γ is a fixed constant suggested to be 4. Given a message M, the two instances are formed by projecting M onto P s and P l respectively. The rationale of this splitting approach rests on the assumption that a message is close to the spam prototype in terms of cosine similarity if it is indeed spam, and a ham is close to the ham prototype.
4 To attack split-t and split-p, adversaries would need to initiate active attacks through trial and error, which is only feasible if they have access to the spam filters at the user level. Once we create bags of instances, we can transform a standard supervised learning problem into a multiple instance learning problem under the standard MI assumption. In this paper, we adopt the multiple instance logistic regression model to train a spam filter that is more robust to good word attacks than traditional spam filters. 5. Multiple Instance Logistic Regression Given a set of training bags B = {< B 1, Y 1 >,..., < B i, Y i >,..., < B m, Y m >}, let P r(y i = 1 B i ) be the probability that the i th bag is positive, and P r(y i = 0 B i ) be the probability that it is negative. The bag-level binomial log-likelihood function is: L = m [Y i log P r(y i = 1 B i )+(1 Y i ) log P r(y i = 0 B i )] We do not have direct measure of bag-level probabilities in the log-likelihood function. However, we can compute P r(y i = 0 B i ) with instance-level class probabilities 1 P r(y j = 0 X ij ) = 1 + exp(p X ij ), where X ij is the j th instance in the i th bag, and p is the parameter vector that needs to be estimated. A bag is negative if and only if all the instances in the bag are negative. Thus, P r(y i = 0 B i ) = n P r(y j = 0 X ij ) j=1 = exp( n (log(1 + exp(p X ij )))) j=1 where n is the number of instances in the i th bag, and P r(y i = 1 B i ) = 1 P r(y i = 0 B i ). We can now apply maximum likelihood estimation (MLE) to maximize the bag-level log-likelihood function, and estimate the parameter vector p that maximizes the probability of observing the bags in B. 6. Experimental Results We test our multiple instance learning counter-attack strategy on s from the 2006 TREC Public Spam Corpus 1. We simulate good word attacks by generating a list of 1 gvcormac/treccorpus06/ good words from the corpus and randomly selecting some to add to the spam messages in the data sets. We compare our multiple instance logistic regression classifiers to their single instance learning counterpart logistic regression (LR), support vector machine (SVM), and the multinomial naïve Bayes (MNB) classifier Experimental Data Our experimental data consists of 36,674 spam and legitimate messages from the 2006 TREC spam corpus. We preprocessed the entire corpus by stripping HTML and attachments, and applying stemming and stop-list to all terms. Messages that became empty after preprocessing were discarded Experiment Setup For our experiments we sorted the messages in the corpus by their receiving date and evenly divided them into 11 subsets {D 1,..., D 11 }. In other words, the messages in subset i come chronologically before the messages in i + 1. Experiments were run in the online fashion, i.e. training on subset i and testing on subset i + 1 for i = 1,..., 10. Each subset contains approximately 3300 messages. The percentage of spam messages in each subset varies as in the operational settings (see Figure 1). We used the Multiple Instance Learning Tool Kit (MILK) [19] and Weka 3.4.7, a well known open-source machine learning tool [18] in our experiments. We reduced the feature space to the top 500 words ranked using information gain. Attribute values for each message are calculated using the tf-idf (term frequency inverse document frequency) weighting scheme. Figure 1. Percentage of spam in each dataset Good Word List To simulate the good word attacks, we generated a list of good words using all the messages in the TREC spam cor-
5 pus. We ranked each word according to the ratio of its frequency in the legitimate messages over its frequency in the spam messages. Any words that appeared only in the legitimate messages and not in the spam messages were not included in the ranking as described by Lowd and Meek [11]. We then chose the top 1,000 words from this ranking to use as the good word list in our experiments. In the real world, it is highly unlikely that a spammer would have access to the set of s from which the training set for the spam filter was obtained; however, we generated our good word list in this way so that the good words would have an observable effect on the accuracy of the classifiers. precision for LR, SVM and MNB was 6.2%, 8.0% and 2.4% respectively Experiment 1 In the first experiment, we tested the MILR-based classifiers on injected with varied amounts of good words. We also tested the standard logistic regression, support vector machine and multinomial naïve Bayes classifiers for comparrison. The classifiers were each tested on the eleven chronologically sorted data sets in an on-line fashion, i.e. all of the classifiers were trained on the same unaltered data set D i, and then tested on the data set D i+1, for all i from 1 to 10. Six variations of each test set were created to test the susceptibility of the classifiers to good word attacks of varying strength. The first version of each test set was left unmodified, i.e. no good words were added to the test set. With each successive version of the test set, the number of good words added to half of the spam messages increased in increments of 50 words up to 250. The good words added to each message were randomly selected from our good word list, without replacement or regard to rank, on a message by message basis. The precision and recall on each version of the test set for all 10 tests were averaged and recorded for each classifier. MILR was tested using the split-h, split-t and split-p methods to split the messages into multiple instances. In our results, we use MILRH, MILRT and MILRP when split-h, split-t and split-p are used with MILR, respectively. Figure 2 shows how the precision and recall values are affected by the number of good words added to the test set. MILRT is barely affected by the attacks. MILRH and MILRP apparently suffer less precision and recall loss compared to LR and SVM, and suffer much less recall loss compared to MNB. More specifically, the average recall value for MILRH, MILRT, and MILRP dropped 17.9% (from 0.98 to 0.80), 1.1% (from 0.93 to 0.92), and 24.4% (from 0.95 to 0.72) respectively, as a result of the good word attack while the average recall value of LR, SVM and MNB dropped by 34.2% (from 0.97 to 0.64), 36.6% (from 0.97 to 0.62), and 45.1% (from 0.91 to 0.50) respectively. In the same experiment, the drop in precision for MILRH, MILRT and MILRP was 4.0%, 0.3%, and 4.3% while the drop in Figure 2. Precision & Recall loss when good words are added to 50% of the test set, and no good words are added to the training set. Figure 3 shows the change in F-measure, which is the harmonic mean of the precision and recall values, as the number of good words added to the test set increases. As can be observed, the MILR-based classifiers stand up much better to the good word attack than do LR, SVM and MNB. Among MILRH, MILRT and MILRP, MILRT appears to be most effective against the attacks. Figure 4 shows the success rate of MILRH, MILRT, MILRP, LR, SVM and MNB as the number of good words added to the test set increases. Again, MILR classifiers outperform LR, SVM and MNB as more and more good words are added Experiment 2 In the second experiment, our goal was to observe the effect that training on messages injected with good words had on the susceptibility of the classifiers to attacks on the test set. We again tested each of the classifiers on the 11 chronologically sorted data sets in an on-line fashion. This
6 on each version of the test set were averaged and recorded for each classifier. Figures 5, 6, 7, 8, and 9 show the success rate and the F- measure of the six algorithms as the number of good words added to 50% of the spam messages in the training set increases from 10 to 50. Figure 3. F-measure when good words are added to 50% of the test set. Figure 4. Success rate when good words are added to 50% of the test set. time, we also created five versions of the training set. We injected 10 good words into the first version of the training set and then increased the number of injected good words by 10 for each successive version, up to 50 good words for the fifth version. We also tried injecting larger numbers of good words, but after exceeding 50 words, the effect was minimal; therefore, the corresponding results are not reported here. For each version of the training set we tested the classifiers on 10 versions of the corresponding test set. The first version of each test set was left unmodified. With each successive version of the test set, half of the spam messages were injected with an increasingly larger number of good words, first in increments of 10 words for the first five versions and then in increments of 50 words for the next four versions. As before, good words were selected from the good word list randomly and without replacement or regard to rank, on a message by message basis. For all ten tests on the chronologically sorted data sets, the precision and recall Figure 5. Success rate and F-measure when 10 good words are added to 50% of the spam in the training set. The general observations are: when some of the spam messages in the training set are modified with a small number of good words, the performance of all classifiers except MILRT worsens as the number of good words added to the spam in the test set increases, as shown in Figure 5. but as the training spam is inflated with more good words, the attacks become less effective as shown in Figure 6, 7, 8, 9. In our experiments, we observed that after the number of good words added to the training spam exceeds 50, the added good words actually help all classifiers correctly identify spam, which is in line with the results reported by some other researchers [11, 17]. Due to space limit, we do not show all the plots with more than 50 words added to the training spam.
7 Figure 6. Success rate and F-measure when 20 good words are added to 50% of the spam in the training set. Figure 7. Success rate and F-measure when 30 good words are added to 50% of the spam in the training set. 7. Conclusions and Future Work We propose a multiple instance counter-attack strategy for combating good word attacks on statistical spam filters. We treat each as a bag of multiple instances. A logistic model at the instance level is learned indirectly by maximizing the bag-level binomial log-likelihood function. We have demonstrated that a multiple instance learner stands up better to good word attacks than the standard single instance learners: logistic regression, support vector machine, and the often practiced naïve Bayes model. Our preliminary results demonstrate the great potential of the proposed multiple instance learning strategy against good word attacks on spam filters. However, due to the complexity of the problem nature, we need to carefully investigate the influences of various parameters, such as the distribution of spam in the corpus, the size of the training set, the percentage of spam in either the training or the test set being inflated with good words, and the number of good words appended to spam messages. In addition, we plan to extend the proposed multiple instance learning strategy to handle adversarial attacks in general. Since it is an arms race between spammers and the filter designer, we also plan to make our MI strategy adaptive as new spam techniques are devised, and online as the concept of spam drifts over time. References [1] P. Auer. On learning from multi-instance examples: Empirical evaluation of a theoretical approach. In Proceedings of the 14th International Conference on Machine Learning, pages 21 29, San Francisco, CA, Morgan Kaufmann. [2] A. Blum and A. Kalai. A note on learning from multipleinstance examples. Machine Learning, 30(1):23 30, [3] J. Carpinter and R. Hunt. Tightening the net: A review of current and next generation spam filtering tools. Computers and Security, 25(8): , [4] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages ACM Press, [5] T. Dietterich, R. Lathrop, and T. Lozano-Pérez. Solving the multiple-instance problem with axis-parallel rectangles. Artificial Intelligence Journal, 89(1-2):31 71, [6] T. Gärtner, P. Flach, A. Kowalczyk, and A. Smola. Multiinstance kernels. In Proceedings of the 19th International
8 Figure 8. Success rate and F-measure when 40 good words are added to 50% of the spam in the training set. Figure 9. Success rate and F-measure as 50 good words are added to 50% of the spam in the training set. Conference on Machine Learning, pages , San Francisco, CA, Morgan Kaufmann. [7] R. Jennings. The global economic impact of spam. Technical report, Ferris Research, [8] J. Z. Kolter and M. A. Maloof. Using additive expert ensembles to cope with concept drift. In Proceedings of the Twenty-second International Conference on Machine Learning, pages , New York, NY, ACM Press. [9] P. Long and L. Tan. PAC learning axis-aligned rectangles with respect to product distribution from multiple-instance examples. Machine Learning, 30(1):7 21, [10] D. Lowd and C. Meek. Adversarial learning. In Proceedings of the 2005 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages ACM Press, [11] D. Lowd and C. Meek. Good word attacks on statistical spam filters. In Proceedings of the 2nd Conference on and Anti-Spam, [12] O. Maron and T. Lozano-Pérez. A framework for multipleinstance learning. Advances in Neural Information Processing Systems, 10: , [13] J. Ramon and L. Raedt. Multi instance neural networks. In Proceedings of ICML-2000 workshop on Attribute-Value and Relational Learning, [14] S. Ray and M. Craven. Supervised versus multiple instance learning: An empirical comparison. In Proceedings of the 22nd International Conference on Machine Learning, pages , New York, NY, ACM Press. [15] J. Rocchio Jr. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, pages Prentice Hall, [16] J. Wang and J. Zucker. Solving the multiple-instance learning problem: A lazy learning approach. In Proceedings of the 17th International Conference on Machine Learning, pages , San Francisco, CA, [17] S. Webb, S. Chitti, and C. Pu. An experimental evaluation of spam filter performance and robustness against attack. In The 1st International Conference on Collaborative Computing: Networking, Applications and Worksharing, pages 19 21, [18] I. Witten and E. Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco, CA, USA, [19] X. Xu. Statistical learning in multiple instance problems. Master s thesis, University of Waikato, [20] Q. Zhang and S. Goldman. EM-DD: An improved multipleinstance learning technique. In Proceedings of the 2001 Neural Information Processing Systems (NIPS) Conference, pages , Cambridge, MA, MIT Press. [21] Z. H. Zhou, K. Jiang, and M. Li. Multi-instance learning based web mining. Applied Intelligence, 22(2): , 2005.
A Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
Adaption of Statistical Email Filtering Techniques
Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
A semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
Spam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
WE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Bayesian Spam Detection
Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional
Mining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo [email protected],[email protected]
Spam Filtering based on Naive Bayes Classification. Tianhao Sun
Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
Impact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
Software Defect Prediction Modeling
Software Defect Prediction Modeling Burak Turhan Department of Computer Engineering, Bogazici University [email protected] Abstract Defect predictors are helpful tools for project managers and developers.
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam
Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques
Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools
Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools Spam Track Wednesday 1 March, 2006 APRICOT Perth, Australia James Carpinter & Ray Hunt Dept. of Computer Science and Software
Improving spam mail filtering using classification algorithms with discretization Filter
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Tweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information
Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Equity forecast: Predicting long term stock price movement using machine learning
Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK [email protected] Abstract Long
Automated Collaborative Filtering Applications for Online Recruitment Services
Automated Collaborative Filtering Applications for Online Recruitment Services Rachael Rafter, Keith Bradley, Barry Smyth Smart Media Institute, Department of Computer Science, University College Dublin,
Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme
IJCSET October 2012 Vol 2, Issue 10, 1447-1451 www.ijcset.net ISSN:2231-0711 Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme I.Kalpana, B.Venkateswarlu Avanthi Institute
Detecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
On Attacking Statistical Spam Filters
On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Abstract. The efforts of anti-spammers
Forecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia [email protected] Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
Applying Machine Learning to Stock Market Trading Bryce Taylor
Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,
III. DATA SETS. Training the Matching Model
A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: [email protected] Prem Melville IBM T.J. Watson
On Cross-Validation and Stacking: Building seemingly predictive models on random data
On Cross-Validation and Stacking: Building seemingly predictive models on random data ABSTRACT Claudia Perlich Media6 New York, NY 10012 [email protected] A number of times when using cross-validation
Anti Spamming Techniques
Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
Cosdes: A Collaborative Spam Detection System with a Novel E- Mail Abstraction Scheme
IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 9 (September 2012), PP 55-60 Cosdes: A Collaborative Spam Detection System with a Novel E- Mail Abstraction Scheme
Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto [email protected] @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto [email protected] @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Online Ensembles for Financial Trading
Online Ensembles for Financial Trading Jorge Barbosa 1 and Luis Torgo 2 1 MADSAD/FEP, University of Porto, R. Dr. Roberto Frias, 4200-464 Porto, Portugal [email protected] 2 LIACC-FEP, University of
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach
Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering
Knowledge Discovery from Data Bases Proposal for a MAP-I UC
Knowledge Discovery from Data Bases Proposal for a MAP-I UC P. Brazdil 1, João Gama 1, P. Azevedo 2 1 Universidade do Porto; 2 Universidade do Minho; 1 Knowledge Discovery from Data Bases We are deluged
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
