Combining Global and Personal Anti-Spam Filtering
|
|
|
- Carmel Bridges
- 10 years ago
- Views:
Transcription
1 Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized classifiers that were trained on an individual user s spam and ham . Proponents of personalized filters argue that statistical text learning is effective because it can identify the unique aspects of each individual s . On the other hand, a single classifier learned for a large population of users can leverage the data provided by each individual user across hundreds or even thousands of users. This paper investigates the trade-off between globally- and personallytrained anti-spam classifiers. We find that globally-trained text classification easily outperforms personally-trained classification under realistic settings. This result does not imply that personalization is not valuable. We show that the two techniques can be combined to produce a modest improvement in overall performance. 1 Introduction Statistical text classification is at the core of many commercial and open-source anti-spam solutions. Statistical classifiers can either be trained globally with one classifier learned for all users, or personally where a separate classifier is learned for each user. Personallytrained classifiers have the advantage of allowing each user to provide their own personal definition of spam. A user actively refinancing their home can train a personal filter to delete unsolicited stock advice as spam but deliver unsolicited refinancing offers to their Inbox. Another user might train a personal filter to block all unsolicited offers. Personal classifiers can quickly identify terms that are unique to an individual and use them as strong indicators of ham. Phrases such as pastry-powered, swiftfile, and spamguru appear frequently in the author s ham but probably have never appeared in any spam, let alone spam directed towards the author s Inbox. A globally-trained anti-spam filter is trained on spam and ham from a large collection of users. There are several types of globally-trained anti-spam solutions commonly in use. Anti-Spam services such as BrightMail and IBM Security use a wide variety of data sources (including spam-trap data, ham-trap data, and customer-labeled data) to create a globallylearned anti-spam model that is shared by all customers. The primary advantage of globally-trained anti-spam solutions is its ability to leverage data from multiple individuals. In a personal system, every training message must be provided by the end user. If a new spam message bypasses several users personallytrained filter, each user must separately update their classifier. In a globally-trained system, only a few users need to label a message for the classifier to be updated for everyone. Globally-trained systems are much easier to manage and deploy in large organizations. They often take the form of a single SMTP server that can be placed in an organizations SMTP chain and used to filter for hundreds if not thousands of users. Managing a single server in a central location is vastly easier and less costly than managing and supporting one thousand personal classifiers deployed on the desktop. It is often assumed that the convenience of globallytrained solutions comes at a cost. Globally-trained solutions cannot take into account individual definitions of spam, cannot easily profile the terms used to each user, and may be easier to attack as a spammer only needs to defeat a single system to reach a broad range of users. This paper challenges this assumption by empirically comparing personal and global anti-spam filters. We find that globally-trained classifiers substantially out-
2 perform personally-trained classifiers for both small and large user communities. We demonstrate that one of the reasons for the success of globally-trained classifiers is that not all personal data gets averaged out when training across a large community. In fact, much of the personal data is retained allowing the globallytrained classifier to benefit from individual preferences. However, some personal data is indeed lost by a globally-trained classifier. Globally-trained classifiers cannot retain personal preferences when users differ on whether a feature should be treated as an indicator of spam or ham. This loss of personal data can hurt the performance of globally-trained classifiers on diverse datasets. We propose a new algorithm for combining personal and globally-trained data that makes use of both global and personal information to classify each incoming message. We show that this new algorithm offers a substantial improvement in classifier accuracy when the user-base is sufficiently diverse. In the next section, we describe the Naïve Bayes text classifier that we use as a basis for our research and detail the globally-trained and personally-trained variants used in the experiments to follow. The second section describes Dynamic Personalization, our algorithm for combining personally- and globally-trained text classifiers. We then present the results of our empirical evaluation, discuss related work, and then conclude. 2 Naïve Bayes Spam Filtering We compare personally-trained and globally-trained anti-spam filtering by analyzing their performance in the specific case of a Naïve Bayes text classifier (Lewis, 1998). We believe most of the results and observations in this paper apply equally well to other bag-of-word text classifiers (e.g. linear regression, SVM), but we leave proving this hypothesis to future work. Let D denote a document containing words, w 1...w n. Let S denote the event that document D is spam. And, let θ denote the set of labeled training data. A Naïve Bayesian text classifier computes the probability a document D is spam using the formula: P(S D,θ) = αp(s θ) w i D P(w i S,θ) (1) where α is a normalization constant chosen to ensure that P(S D,θ) + P( S D,θ) = 1. We estimate P(w i S,θ) using a variation of Laplacean smoothing that has terms added to reduce the strength of the smoothing for long words. The difference between globally- and personallytrained Naïve Bayes is the training set θ used to build the classifier. Let θ u denote a training set consisting of only spam and ham samples from user u. Using this notation, P(S D,θ u ) denotes the personallytrained classifier for user u. A globally-trained classifier is trained on all labeled messages. Let U denote the set of all users. Let θ denote the training set consisting of all labeled data regardless of recipient. Then, the globally-trained classifier P(S D,θ ) is the classifier that results from training on θ. For the purposes of this comparison, we make the simplifying assumption that all training data comes from individual users. That is, θ = θ u. (2) u U This assumption accurately models systems where all labeled data originates from user-provided spam and ham samples. It does not accurately characterize labeled data from exogenous sources such as spam traps or the auto-voting of destined for invalid recipients. We believe this assumption is reasonable as it limits globally-trained classifiers to the same information that is usually available to personally-trained classifiers. We will compare personally- and globally-trained classifiers on large corpora that contain data from multiple users. This methodology accurately models domain-level anti-spam solutions that can employ either globally-trained or personally-trained classifiers. Let R : D 2 U denote the mapping from messages to message recipients. Let P = (D,u) u R(D) be the set of all message-recipient pairs. The personallytrained Naïve-Bayes classifier NBP labels each incoming message-recipient pair using the model learned for the specified recipient: NBP(D,u,τ) = [P(S D,θ u ) τ]. where τ is a threshold used to convert the classifier s probability model into a decision function. Note, NBP applies the same classification formula for user u that user u would expect if he or she was using a personallytrained classifier in isolation. The performance of NBP is the same as would result from having each user install his or her own personally-trained filter. The globally-trained Naïve Bayes classifier NBG labels each incoming message-recipient pair based on the global model, ignoring the message s intended recipient: NBG(D,u,τ) = [P(S D,θ ) τ].
3 We define classification in terms of message-recipient pairs to give each user the chance to benefit from their own personal filter. This introduces the question of how to evaluate the accuracy of the classifier in terms of message-recipient pairs. One option would be to treat each message-recipient pair as a separate event. This would imply that incorrectly classifying a single message for two different users would be treated as two distinct errors. However, this option is not ideal for a comparative study as it tends to produce highly correlated errors. Instead, we divide the weight of each message equally among its recipients. If a message is classified incorrectly for all the recipients in a message, we count it as one full error. If is classified incorrectly for half the recipients, then it is counted as half an error. 3 Dynamic Personalization The distinction between globally-trained and personally-trained classifiers does not need to be absolute. Varying degrees of personalization can be added to globally-trained systems or vice versa. Personal classification is probably ideal if there is sufficient data to make a judgment. In contrast, global data is best when good personal data is not available. One method for combining globally-trained and personally-trained classifiers would be to treat them as separate classifiers and combine them using standard classifier-aggregation techniques (Segal, 2005; Dietterich, 2000; Larkey & Croft, 1996). However, this method takes an all-or-nothing approach. Users with insufficient data to train a good personal classifier will have their personalization data ignored as the aggregate system would rely solely on the global classifier. We seek a method that allows even small amounts of personalization data to have an impact when appropriate. Instead, we propose that the trade-off between global and personal data be made at the word or feature level. The basic idea is to apply personal data whenever it provides a better estimate for P(w i S) (or P(w i S)), and to use global data otherwise. We accomplish this trade-off using the following equation: P(w i S,θ,θ u ) F(w i,s,θ u ) + P(w i S,θ )K F(S,θ u ) + K (3) where F(w i,s,θ u ) denotes the number of spam documents in θ u that contain word w i, and F(S,θ u ) denotes the total number of spam documents in θ u. This equation works by applying a Beta prior with a mean of P(w i S,θ ) to the estimate of P(w i S,θ u ). Using this equation, as the amount of personalization data F(S,θ u ) grows, the value of P(w i S,θ,θ u ) asymptotically approaches P(w i S,θ u ). When the amount of personalization data is small, the same equation asymptotes to P(w i S,θ ). Therefore, equation 3 provides a linear transition from a globally-trained classifier to a personally-trained classifier as the amount of personalization data increases. The parameter K determines the relative significance given to personalization data. We use the value of K = 100 for our experiments. We define Naïve-Bayes with dynamic personalization NBDP as the classifier that uses equation 3 to classify NBDP(D,u,τ) = [P(S D,θ,θ u ) τ] 4 Experimental Setup We use three datasets for our analysis. The first two are the TREC 2005 and TREC 2006 public corpora (Cormack & Lynam, 2006; Cormack, 2007). The third data set is the private SpamGuru 2004 corpus (Segal et al., 2004). The evaluation of personally-trained classifiers requires that we can identify the intended recipients of each message. As the original SMTP delivery information is not available in any of the corpora, the recipients for each message were extracted from the To: and CC: headers. Recipients for domains outside of the domains served by each dataset were filtered out. We also removed any address that did not receive at least ten spam and ten ham messages to help ensure that only valid addresses were selected. Many messages in each corpora either cannot be attributed to an appropriate recipient or are destined for users with less than ten ham or less than ten spam messages. We remove these message from each corpora to ensure that all messages in our test data can be attributed to at least one selected address. This is important for comparative purposes as personallytrained classifiers cannot classify messages for which no recipient can be identified. Table 1 describes our test corpora, including the number of valid addresses extracted from each data set. Our testing methodology is based on the TREC 2005 methodology (Cormack & Lynam, 2006), but adapted for personally-trained classification. We train and test each classifier in an incremental fashion. The dataset is processed in chronological order. Each message is presented to the classifier to be labeled. Once the label is returned, the classifier is given the label of the current message for training. Unlike the original TREC 2005 setup, the classifier is asked to label the message
4 SpamGuru TREC TREC Original Spam 130,455 52,790 24,912 Original Ham 42,557 39,399 12,910 Selected Recipients Selected Spam 120,368 40,805 16,499 Selected Ham 33,886 21,549 9,121 Table 1: Test corpora statistics. separately for each of its intended recipients. The correct labels for each recipient is separately passed to the classifier. In personally-trained anti-spam filters, spam is usually defined operationally based on user-supplied spam and ham samples. Each recipient can assign its own label to each message. A message labeled spam by one recipient can be labeled ham by another. Ideally, each message in our test corpora would be assigned a separate label for each user. However, there are no user-specific labels available for any of our test corpora. As we are not aware of any public corpora with judgments stored on a per user basis, we make the unrealistic assumption that each user assigns the same label to each message and assign each user the same judgment, the message s correct label as indicated by the test corpora. As a result of this assumption, the experiments below may underestimate the value of personalized classification. However, we believe the results below are still valid as the large performance differences reported are unlikely to be reversed by the small number of messages that would be labeled differently by individual users. We perform several evaluations with only a subset of all recipients. For these evaluations, we only train and test on those message-recipient pairs that contain the target recipients. 5 Empirical Results Figure 1 compares the performance of globally-trained and personally-trained anti-spam filters on each of our test corpora. For this experiment, we learned separate personal classifiers for each of the recipients in the database that had received ten on more spam and ten or more ham examples. Messages that could not be attributed to recipients with enough examples were excluded from the experiment. The results for the personally-trained classifier were computed by classifying each message-recipient pair with the personallytrained classifier for the pair s recipient. We also evaluated a single, global classifier that was trained on all messages and used to classify every message-recipient pair. The results reveal the limitations of personally-trained classifiers. The personally-trained classifiers performed very poorly, providing at best twice the falsenegative rate of a globally-trained system at a false positive rate of 1%, and missing as much as 10 times as much spam at a false positive rate of 0.1%. The reasons for this failure in retrospect are obvious. Many of the recipients in these experiments have fewer than 100 training examples. For instance, in the TREC 2005 corpus, 7.6% of all ham comes from users with 100 or fewer ham examples. The personally-trained classifiers for these users are unlikely to achieve good performance at a 0.1% false positives as the data is just not there. The poor performance on users with insufficient data makes it virtually impossible for personallytrained classifiers to perform well in this experiment. It is unclear whether personally-trained classifiers should be held accountable for users that cannot provide sufficient training data. Personally-trained classifiers can certainly be effective for users with sufficient data. But, there will always be users which either do not have sufficient messages or time to train the system appropriately. For these users, a globally-trained classifier is likely to be a better alternative. We can get a better idea of the potential of personallytrained classifiers by limiting our evaluation to users with sufficient data. Figure 2 shows the same comparison applied to the four highest ranked users from each corpora, where users are ranked based on the number of spam messages they receive or the number of ham messages, whichever is smaller. We use this ranking as the data set contains several virtual spam traps and ham traps that receive a large number of one type of message, but almost none of the other. All the recipients in this experiment have at least 500 spam and 500 ham examples available for training. It is unlikely that all but the most active of users with have much more than this easily available to train a personal classifier. The results of this second experiment are surprising. The personally-trained classifier performs reasonably well, catching 99.7% of spam on the TREC 2006 database with at a false positive rate of 0.1%. For as good as this is, the globally-trained classifier is even better, catching close to 99.85% at the same false positive rate. Overall, the globally-trained classifier let through about half as many spam messages as did the personally-trained classifier across most of each ROC curve. These results suggest there is little benefit to be gained from personalized classification. But even the results for the complete corpora are for a relatively small set of users. The results cannot be directly applied to large
5 TREC 2006 Corpus 70% 60% 50% 40% 30% 20% 70% 60% 50% 40% 30% 20% 10% 88% Figure 1: Comparison of globally- and personally-trained classifiers % TREC 2006 Corpus 85% 85% 99.9% 99.8% 75% 70% 75% 99.7% Figure 2: Comparison of globally- and personally-trained classifiers for the four largest users in each dataset. organization and ISP s where the user base can include thousands or even hundreds of thousands. This is important as there is still a legitimate concern that aggregating across too many users may degrade the effectiveness of globally-trained classifiers. Figure 3 shows how the performance of globallytrained Naïve Bayes scales as we increase the number of users. Rather than show the complete ROC curve for each data point, we summarize the results of each experiment using one minus the area under the ROC curve (Fawcett, 2003). The results for the SpamGuru 2004 corpus, which is the larger and possible more realistic corpus, show globally-trained classification only improving with the number of included recipients. The results on the TREC 2006 corpus is similar, showing a strict improvement in overall performance as the number of included recipients increase. The results on these two corpora are suggestive of globally-trained classifiers scaling well to hundreds if not thousands of users. However, the results on the TREC 2005 corpus show performance only improving for the first 75 users, and shows performance decreasing as the number of users increase from 75 to 100. The reason for this drop in performance is unclear and warrants further study. Figure 4 compares the performance of dynamic personalization to a strictly globally-trained classifier. The results are mixed. Dynamic personalization offers a modest improvement on the SpamGuru 2004 corpus for false positive rates above 0.1%, a modest improvement on the TREC 2006 corpus for false positive rates above 0.4%, and offers almost identical performance to a globally-trained classifier on the TREC 2005 corpus. On all three datasets the performance of dynamic personalization drops below the globally-trained classifier for the smallest false-positive rates we measure. The modest gains shown for dynamic personalization on two of the three corpora at a false positive rate of 1% is suggestive that it may offer real advantages on very large datasets. But, the consistent drop in performance at low false positive rates is definitely of concern. Further analysis suggests that Dynamic Personalization did not offer a large benefit on these corpora partially because of the ability of globally-trained classifiers to retain a large amount of personalization information. In a globally-trained classifier, the statistics for each term is based on a sum of the statistics for
6 TREC 2006 Corpus 1 - ROC Area (log scale) ROC Area (log scale) 1 - ROC Area (log scale) Number of Recipients Number of Recipients Number of Recipients Figure 3: Scaling of a globally-trained classifier. TREC 2006 Corpus 93% Dynamic Personalization 91% Dynamic Personalization Dynamic Personalization Figure 4: Comparison of a globally-trained classifier and dynamic-personalization. each personal classifier. That is, F(w i,s,θ ) = u U F(w i,s,θ u ). If a term only occurs for a single user, then the global statistics for that term will be the same as the personal statistics. As a result, one of the key advantages of personalization the ability to identify unique aspects of each user s is largely retained in a globallytrained classifier. We can test this hypothesis by modifying our globallytrained classifier to ignore the data unique to each individual. That is, we classify each user s with a classifier trained on θ u = θ θ u. If the globallytrained classifier did not benefit from personalization effects, we would expect this classifier to perform similarly to the classifier learned on θ. Figure 5 shows the results of this experiment for the SpamGuru 2004 and the TREC 2005 corpora. We omit the TREC 2006 dataset as the two largest users represent over of the dataset. The results show that eliminating the personal information from a globally-trained classifier doubles the error rate across the ROC-curve. This is strong evidence that globally-trained classifiers make effective use of personal data. The above analysis is also suggestive of when Dynamic Personalization would work best. Globallytrained classifiers cannot benefit from personalization data when users differ on how each term should be categorized. The term mortgage in a globally-trained classifier will either increase the spam score for everybody, decrease the spam score for everybody, or have little effect. If every user agrees that this term is indicative of spam, then a globally-trained classifier will treat the term as in indicator of spam for all users. But if half the users consider mortgage-related as spam and half the users consider it ham, then a globally-trained classifier will average these opinions and will treat mortgage as neither an indicator of spam or ham. Dynamic personalization will therefore be most useful for disparate user communities that have differing opinions of what is spam. To test this hypothesis, we created a fourth dataset that is a combination of our three test corpora. We used roughly equal number of examples from each corpora to maximize the diversity within the combined corpus. We use the entire TREC 2006 corpora as it is the smallest. We sub-sampled the SpamGuru 2004 and TREC 2005 datasets by selecting the for the largest M users in each data set, where M was
7 93% 93% 91% Personal Data Included Personal Data Excluded 91% Personal Data Included Personal Data Excluded Figure 5: Comparison of a globally-trained classifier with and without personalization data. 88% 86% 84% 82% 78% 76% Combined Corpus Dynamic Personalization Figure 6: Comparison of a globally-trained classifier, a personally-trained classifier, and dynamic personalization on the combined SpamGuru 2004, TREC 2005, and TREC 2006 corpus. chosen such that the number of selected messages was as close as possible to the number of messages in the TREC 2006 corpora. This process produced a new combined corpora with 77,935 examples and 58 users. Figure 6 compares the performance of dynamic personalization and global classification on the combined dataset. The results show that dynamic personalization can be very effective for diverse user communities. Dynamic personalization achieved a 20% reduction in the number of missed spams across the entire ROC curve. The figure also shows the performance of a personally-trained classifier on this more diverse dataset. The personally-trained classifier cannot match the performance of the globally-trained classifier, let alone the performance of dynamic personalization. Interestingly, dynamic personalization makes effective use of personal classification despite the overall poor performance of personal classification on this dataset. 6 Related Work Many personally-trained anti-spam filters offer the ability to leverage globally-trained data sources such as real-time blacklists and spam signature databases. For instance, SpamAssassin supports a variety of DNSRBL, URIBL, and signature databases (SpamAssassin, 2006). SpamAssassin combines the predictions from each external source with its own classification data using a fixed weighting scheme. Dynamic personalization differs in that it uses the specific structure of bag-of-word classifiers to dynamically choose between global and personal data at the level of individual tokens. Cosoi combines globally- and personally-trained text classifiers by dynamically learning the optimal weights to combine the outputs of each classifier (Cosoi, 2007). Cosoi focuses on a filtering model in which the personally-trained classifier is continuously updated, but the user only infrequently retrieves the latest globally-trained classifier, say once a month. As a result, Cosoi focuses on the problem of how the weight assigned to the globally-trained classifier should be adapted over time to adjust to the timeliness of the global model. Yerazunis discusses the importance of inoculating a personally-trained text classifier by sharing spam examples across users (Yerazunis, 2004). Inoculation works by training each user s personal classifier on the spam examples forwarded by trusted colleagues. Our results confirm the value of sharing data across multiple users, and even across large collections of dissimilar users. Graham has argued that personally-trained statistical filters are the key to an effective solution to the spam problem (Graham, 2003b; Graham, 2003a). He reasons that spammers will find it difficult to tune their content to bypass each individual user s classifier. Our
8 results suggest that globally-trained anti-spam filters may actually be a more effective solution due to the larger performance gains afforded by sharing all training data across a large user community. Dynamic Personalization may be better yet as it both leverages training data across all users and creates user-specific classifiers that may be more difficult for spammers to target. 7 Conclusion We compared the performance of globally-trained and personally-trained text classifiers on three separate corpora. Our results show that globally-trained classifiers easily outperform personalized text classifiers for both small and large user communities. Furthermore, we demonstrate that globally-trained classifiers appear to scale well with performance improving as we increase the number of accounts being aggregated. We show that the ability of globally-trained classifiers to scale to large, diverse user communities is partially due to its ability to retain and use a substantial amount of personal data. However, globally-trained classifiers cannot make use of all available personal data. A globally-trained classifier cannot honor the preferences of two users if the preferences of those users contradict. Dynamic personalization combines globally-trained and personallytrained text classifiers at the level of individual words to allow globally-trained text classifiers to take advantage of any available user-specific training data. Our results demonstrate that Dynamic Personalization offers a modest performance improvement on test corpora that includes diverse user communities. Graham, P. (2003a). Better bayesian filtering. Proceedings of the MIT Spam Conference. Graham, P. (2003b). A plan for spam. Proceedings of the MIT Spam Conference. Larkey, L., & Croft, W. (1996). Combining classifiers in text categorization. SIGIR-96: 19th ACM International Conference on Research and Development in Information Retrieval (pp ). Zurich. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings of ECML-98, 10th European Conference on Machine Learning. Segal, R. (2005). Combining multiple classifiers. Virus Bulletin. February, Segal, R. B., Crawford, J., Kephart, J. O., & Leiba, B. (2004). SpamGuru: An enterprise anti-spam filtering system. Proceedings of the First Conference on and Anti-Spam. SpamAssassin (2006). The spamassassin open-source spam filter. Yerazunis, W. (2004). The spam filtering plateau at 99.9% accuracy and how to get past it. Proceedings of the MIT Spam Conference. References Cormack, G. (2007). TREC 2006 Spam Track overview. The Fifteenth Text REtrieval Conference (TREC2006) Notebook. Cormack, G., & Lynam, T. (2006). TREC 2005 Spam Track overview. The Fourteenth Text REtrieval Conference (TREC 2005) Notebook. Cosoi, C. (2007). Combining antispam filters. Proceedings of the MIT Spam Conference. Dietterich, T. G. (2000). Ensemble methods in machine learning. MCS 00: Proceedings of the First International Workshop on Multiple Classifier Systems (pp. 1 15). Springer-Verlag. Fawcett, T. (2003). Roc graphs: Notes and practical considerations for data mining researchers.
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
Anti Spamming Techniques
Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam
Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques
PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering
2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University
An Overview of Spam Blocking Techniques
An Overview of Spam Blocking Techniques Recent analyst estimates indicate that over 60 percent of the world s email is unsolicited email, or spam. Spam is no longer just a simple annoyance. Spam has now
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide
eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide This guide is designed to help the administrator configure the eprism Intercept Anti-Spam engine to provide a strong spam protection
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
Not So Naïve Online Bayesian Spam Filter
Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China [email protected] Congfu Xu Institute of Artificial
Spam Filtering based on Naive Bayes Classification. Tianhao Sun
Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous
Adaption of Statistical Email Filtering Techniques
Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques
Introduction. How does email filtering work? What is the Quarantine? What is an End User Digest?
Introduction The purpose of this memo is to explain how the email that originates from outside this organization is processed, and to describe the tools that you can use to manage your personal spam quarantine.
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development
Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development Author André Tschentscher Address Fachhochschule Erfurt - University of Applied Sciences Applied Computer Science
Solutions IT Ltd Virus and Antispam filtering solutions 01324 877183 [email protected]
Contents Reduce Spam & Viruses... 2 Start a free 14 day free trial to separate the wheat from the chaff... 2 Emails with Viruses... 2 Spam Bourne Emails... 3 Legitimate Emails... 3 Filtering Options...
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Achieve more with less
Energy reduction Bayesian Filtering: the essentials - A Must-take approach in any organization s Anti-Spam Strategy - Whitepaper Achieve more with less What is Bayesian Filtering How Bayesian Filtering
Journal of Information Technology Impact
Journal of Information Technology Impact Vol. 8, No., pp. -0, 2008 Probability Modeling for Improving Spam Filtering Parameters S. C. Chiemeke University of Benin Nigeria O. B. Longe 2 University of Ibadan
Savita Teli 1, Santoshkumar Biradar 2
Effective Spam Detection Method for Email Savita Teli 1, Santoshkumar Biradar 2 1 (Student, Dept of Computer Engg, Dr. D. Y. Patil College of Engg, Ambi, University of Pune, M.S, India) 2 (Asst. Proff,
Antispam Security Best Practices
Antispam Security Best Practices First, the bad news. In the war between spammers and legitimate mail users, spammers are winning, and will continue to do so for the foreseeable future. The cost for spammers
Spam filtering. Peter Likarish Based on slides by EJ Jung 11/03/10
Spam filtering Peter Likarish Based on slides by EJ Jung 11/03/10 What is spam? An unsolicited email equivalent to Direct Mail in postal service UCE (unsolicited commercial email) UBE (unsolicited bulk
PROOFPOINT - EMAIL SPAM FILTER
416 Morrill Hall of Agriculture Hall Michigan State University 517-355-3776 http://support.anr.msu.edu [email protected] PROOFPOINT - EMAIL SPAM FILTER Contents PROOFPOINT - EMAIL SPAM FILTER... 1 INTRODUCTION...
1 Introductory Comments. 2 Bayesian Probability
Introductory Comments First, I would like to point out that I got this material from two sources: The first was a page from Paul Graham s website at www.paulgraham.com/ffb.html, and the second was a paper
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Intercept Anti-Spam Quick Start Guide
Intercept Anti-Spam Quick Start Guide Software Version: 6.5.2 Date: 5/24/07 PREFACE...3 PRODUCT DOCUMENTATION...3 CONVENTIONS...3 CONTACTING TECHNICAL SUPPORT...4 COPYRIGHT INFORMATION...4 OVERVIEW...5
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
Spam Filtering Based on Latent Semantic Indexing
Spam Filtering Based on Latent Semantic Indexing Wilfried N. Gansterer Andreas G. K. Janecek Robert Neumayer Abstract In this paper, a study on the classification performance of a vector space model (VSM)
Spam Filtering Based On The Analysis Of Text Information Embedded Into Images
Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio
About this documentation
Wilkes University, Staff, and Students have a new email spam filter to protect against unwanted email messages. Barracuda SPAM Firewall will filter email for all campus email accounts before it gets to
A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng
Why Bayesian filtering is the most effective anti-spam technology
Why Bayesian filtering is the most effective anti-spam technology Achieving a 98%+ spam detection rate using a mathematical approach This white paper describes how Bayesian filtering works and explains
Abstract. Find out if your mortgage rate is too high, NOW. Free Search
Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back
Adaptive Filtering of SPAM
Adaptive Filtering of SPAM L. Pelletier, J. Almhana, V. Choulakian GRETI, University of Moncton Moncton, N.B.,Canada E1A 3E9 {elp6880, almhanaj, choulav}@umoncton.ca Abstract In this paper, we present
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
A Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
Using News Articles to Predict Stock Price Movements
Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 [email protected] 21, June 15,
Bayesian Filtering. Scoring
9 Bayesian Filtering The Bayesian filter in SpamAssassin is one of the most effective techniques for filtering spam. Although Bayesian statistical analysis is a branch of mathematics, one doesn't necessarily
Spam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
IBM Express Managed Security Services for Email Security. Anti-Spam Administrator s Guide. Version 5.32
IBM Express Managed Security Services for Email Security Anti-Spam Administrator s Guide Version 5.32 Table of Contents 1. Service overview... 3 1.1 Welcome... 3 1.2 Anti-Spam (AS) features... 3 1.3 How
Tweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information
A semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
Configuring MDaemon for Centralized Spam Blocking and Filtering
Configuring MDaemon for Centralized Spam Blocking and Filtering Alt-N Technologies, Ltd 2201 East Lamar Blvd, Suite 270 Arlington, TX 76006 (817) 525-2005 http://www.altn.com July 26, 2004 Contents A Centralized
Understanding Proactive vs. Reactive Methods for Fighting Spam. June 2003
Understanding Proactive vs. Reactive Methods for Fighting Spam June 2003 Introduction Intent-Based Filtering represents a true technological breakthrough in the proper identification of unwanted junk email,
[email protected] ** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya
AN ANTI-SPAM SYSTEM USING ARTIFICIAL NEURAL NETWORKS AND GENETIC ALGORITHMS ABDUELBASET M. GOWEDER *, TARIK RASHED **, ALI S. ELBEKAIE ***, and HUSIEN A. ALHAMMI **** * The High Institute of Surman for
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
Spam Testing Methodology Opus One, Inc. March, 2007
Spam Testing Methodology Opus One, Inc. March, 2007 This document describes Opus One s testing methodology for anti-spam products. This methodology has been used, largely unchanged, for four tests published
IMF Tune Opens Exchange to Any Anti-Spam Filter
Page 1 of 8 IMF Tune Opens Exchange to Any Anti-Spam Filter September 23, 2005 10 th July 2007 Update Include updates for configuration steps in IMF Tune v3.0. IMF Tune enables any anti-spam filter to
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools
Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools Spam Track Wednesday 1 March, 2006 APRICOT Perth, Australia James Carpinter & Ray Hunt Dept. of Computer Science and Software
ContentCatcher. Voyant Strategies. Best Practice for E-Mail Gateway Security and Enterprise-class Spam Filtering
Voyant Strategies ContentCatcher Best Practice for E-Mail Gateway Security and Enterprise-class Spam Filtering tm No one can argue that E-mail has become one of the most important tools for the successful
Partitioned Logistic Regression for Spam Filtering
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois 201 N Goodwin Ave Urbana, IL, USA [email protected] Wen-tau Yih Microsoft Research One Microsoft Way Redmond, WA,
Why Content Filters Can t Eradicate spam
WHITEPAPER Why Content Filters Can t Eradicate spam About Mimecast Mimecast () delivers cloud-based email management for Microsoft Exchange, including archiving, continuity and security. By unifying disparate
Email Getting Started Guide Unix Platform
Edition/Issue Email Getting Started Guide Unix Platform One of the most important features of your new Web Hosting account is access to a personalized Email solution that includes individual Email addresses
Opus One PAGE 1 1 COMPARING INDUSTRY-LEADING ANTI-SPAM SERVICES RESULTS FROM TWELVE MONTHS OF TESTING INTRODUCTION TEST METHODOLOGY
Joel Snyder Opus One February, 2015 COMPARING RESULTS FROM TWELVE MONTHS OF TESTING INTRODUCTION The following analysis summarizes the spam catch and false positive rates of the leading anti-spam vendors.
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
AntiSpam QuickStart Guide
IceWarp Server AntiSpam QuickStart Guide Version 10 Printed on 28 September, 2009 i Contents IceWarp Server AntiSpam Quick Start 3 Introduction... 3 How it works... 3 AntiSpam Templates... 4 General...
III. DATA SETS. Training the Matching Model
A Machine-Learning Approach to Discovering Company Home Pages Wojciech Gryc Oxford Internet Institute University of Oxford Oxford, UK OX1 3JS Email: [email protected] Prem Melville IBM T.J. Watson
Deposit Identification Utility and Visualization Tool
Deposit Identification Utility and Visualization Tool Colorado School of Mines Field Session Summer 2014 David Alexander Jeremy Kerr Luke McPherson Introduction Newmont Mining Corporation was founded in
How To Protect Your Email From Spam On A Barracuda Spam And Virus Firewall
Comprehensive Email Filtering: Barracuda Spam & Virus Firewall Safeguards Legitimate Email Email has undoubtedly become a valued communications tool among organizations worldwide. With frequent virus attacks
Increasing the Accuracy of a Spam-Detecting Artificial Immune System
Increasing the Accuracy of a Spam-Detecting Artificial Immune System Terri Oda Carleton University [email protected] Tony White Carleton University [email protected] Abstract- Spam, the electronic
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM
ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, APRIL 212, VOLUME: 2, ISSUE: 3 AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM S. Arun Mozhi Selvi 1 and R.S. Rajesh 2 1 Department
REVIEW AND ANALYSIS OF SPAM BLOCKING APPLICATIONS
REVIEW AND ANALYSIS OF SPAM BLOCKING APPLICATIONS Rami Khasawneh, Acting Dean, College of Business, Lewis University, [email protected] Shamsuddin Ahmed, College of Business and Economics, United Arab
How to keep spam off your network
GFI White Paper How to keep spam off your network What features to look for in anti-spam technology A buyer s guide to anti-spam software, this white paper highlights the key features to look for in anti-spam
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
1 Accessing E-mail accounts on the Axxess Mail Server
1 Accessing E-mail accounts on the Axxess Mail Server The Axxess Mail Server provides users with access to their e-mail folders through POP3, and IMAP protocols, or OpenWebMail browser interface. The server
Improving the Performance of Heuristic Spam Detection using a Multi-Objective Genetic Algorithm. James Dudley
Improving the Performance of Heuristic Spam Detection using a Multi-Objective Genetic Algorithm James Dudley This report is submitted as partial fulfilment of the requirements for the Honours Programme
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Filtering Spam Using Search Engines
Filtering Spam Using Search Engines Oleg Kolesnikov, Wenke Lee, and Richard Lipton ok,wenke,rjl @cc.gatech.edu College of Computing Georgia Institute of Technology Atlanta, GA 30332 Abstract Spam filtering
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
Bayesian Spam Detection
Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional
