AN EVALUATION OF FILTERING TECHNIQUES A NAÏVE BAYESIAN ANTI-SPAM FILTER. Vikas P. Deshpande

Size: px
Start display at page:

Download "AN EVALUATION OF FILTERING TECHNIQUES A NAÏVE BAYESIAN ANTI-SPAM FILTER. Vikas P. Deshpande"

Transcription

1 AN EVALUATION OF FILTERING TECHNIQUES IN A NAÏVE BAYESIAN ANTI-SPAM FILTER by Vikas P. Deshpande A report submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Computer Science Approved: Dr. Robert F. Erbacher Major Professor Dr. Nicholas Flann Committee Member Dr. Hugo de Garis Committee Member UTAH STATE UNIVERSITY Logan, Utah 2004

2 ii ABSTRACT An Evaluation of Filtering Techniques in a Naïve Bayesian Anti-spam Filter by Vikas P. Deshpande, Master of Science Utah State University, 2004 Major Advisor: Dr. Robert F. Erbacher Department: Computer Science An efficient anti-spam filter that would block all unsolicited messages i.e. spam, without blocking any legitimate messages is a growing need. To address this problem, this report takes a statistically-based approach, employing a Bayesian anti-spam filter, because it is content-based and self-learning (adaptive) in nature. We train the filter, using a large corpus of legitimate messages and spam, and we test the filter using new incoming personal messages. We evaluate four effective filtering techniques available for a Bayesian filter for our purposes. We look at the effectiveness of the technique, and we evaluate its different configurations for different threshold values in order to find an optimal anti-spam filter configuration. Based on cost-sensitive measures, we conclude that additional safety precautions are needed for a Bayesian anti-spam filter to be put into practice. (74 pages)

3 iii ACKNOWLEDGMENTS I would like to thank my major professor, Dr. Robert F. Erbacher, for his guidance. I would also like to thank Dr. Nick Flann and Dr. Hugo de Garis for being members of my committee. I am grateful to my parents for their encouragement and moral support. Vikas P. Deshpande

4 iv CONTENTS Page ABSTRACT... ii ACKNOWLEDGEMENTS... iii LIST OF TABLES... vi LIST OF FIGURES... vii. INTRODUCTION ANTI-SPAM FILTERS Rule-based Approach Blacklist Approach Whitelists / Effective Filters Approach Signature-based Approach Filters Fight Back NAÏVE BAYESIAN BASED CLASSIFICATION Bayesian Networks Experimental Bayesian Filter Cost Evaluation Measures Independence Factor of Bayesian Networks FILTERING TECHNIQUES IN NAÏVE BAYESIAN FILTER Use of All Token Use of a Fixed Number of Tokens Use of Standard Deviation Use of a Relative Number of Tokens EXPERIMENTS EVALUATION RESULTS CONCLUSIONS FUTURE WORK...34

5 REFERENCES...37 APPENDIX...39 v

6 vi LIST OF TABLES Table Page 6. The results (false positives, false negatives and correct classifications) of four filtering techniques for different values of λ The results (TCR) of four filtering techniques for different values of λ The results (false positives, false negatives and correct classifications) of four configurations (5, 5, 20 and 25) of the fixed token approach for different values of λ The results (TCR) of four configurations (5, 5, 20 and 25) of the fixed token approach for different values of λ The results (false positives, false negatives and correct classifications) of four configurations (3, 7, 0 and 2) of the fixed token approach for different values of λ The results (TCR) of four configurations (3, 7, 0 and 2) of the fixed token approach for different values of λ...3

7 vii LIST OF FIGURES Figure Page 6. TCR for effective configurations of the fixed token approach at t = (λ = 999) TCR for effective configurations of the fixed token approach at t = 0.9 (λ = 9) TCR for effective configurations of the fixed token approach at t = 0.5 (λ =)...32

8 CHAPTER INTRODUCTION service is one of the advantages of the Internet. In recent times, however, this service has faced the serious problem of spam. Spam can be defined as unsolicited automated . It need not be sent solely for commercial purposes. Spam can be even used for political and social purposes. Direct marketers exploit the low cost advantage of service to mass mail their ideas to thousands of recipients. Spam cause frustration for users because these messages utilize a lot of their mailbox space. Further, users waste time in deleting these s. Spam clutter costs millions of dollars to service providers. Spam waste bandwidth for dial-up connected Internet users and may involve minors in some illegal businesses (e.g. pay XXX $ to get rich instantly) [4]. The basic problem in eliminating spam lies in differentiating a spam from a legitimate message. For example, if person A is looking for a car, and his neighbor, person B, (who is planning to sell his own car) happens to know that person A is looking for a car, then person B might send his neighbor, person A, an offering his own car at some price. This message is unsolicited and commercial, and, thus, it can easily be mistaken as spam. One of the features that differentiate spam from legitimate messages is that spam is mass mailed by automation. Therefore, even legitimate messages can be categorized as spam if blindly mass mailed. However, the message content of spam typically forms a distinct category rarely observed in legitimate messages, making it possible for text classifiers to be used for anti-spam filtering.

9 2 A naïve Bayesian anti-spam filter is a text categorization technique based on a machine-learning algorithm. Proposed by Sahami et al. [0], the text categorization technique shows some impressive results on new unseen incoming messages. The filter requires training that can be provided by a previous set of spam and legitimate messages. It keeps track of each word that occurs only in spam, only in legitimate messages, and in both. Based on these word occurrence statistics (also called as tokens), new, incoming, unseen messages are processed and classified accordingly. There are many filtering techniques available with a Naïve Bayesian approach. This report evaluates four effective filtering techniques currently being used in practical Naïve Bayesian anti-spam filters. Of these four, our evaluation found the fixed token approach to be the most successful. This report further evaluates the fixed token approach with different configurations to get an optimal approach. The results indicate that a contentbased filter based on a naïve Bayesian approach gives a peak performance when 5 to2 tokens are used to process a new message irrespective of the size of message. Hence, we believe that our filter can make a positive contribution in first-pass filters. This report is organized as follows. Chapter 2 discusses five significant approaches to anti-spam filtering along with their advantages and disadvantages. Chapters 3 and 4, respectively, describe in a detail Bayesian classifier along with the four filtering techniques required for this study. Our experiment set-up along with the procedure followed is presented in Chapter 5. Evaluation results and conclusions are projected in Chapters 6 and 7. In the last chapter, Chapter 8, we describe possible future work to extend our study.

10 3 CHAPTER 2 ANTI-SPAM FILTERS Many different spam-filtering approaches have been tried. Most of these have a degree of effectiveness, but they have not attained global popularity because of their drawbacks. The five most significant spam filters are discussed below, along with their strengths and weaknesses. 2. Rule-based Approach As the name suggests, in a rule-based approach, each is compared with a set of rules to determine whether it is a spam or not. A rule set contains rules with various weights given to each rule. Initially, each incoming message has a score of zero. The is, then, parsed to detect the presence of any rule. If any rule is found in the message, its weight is added to the final score of the . In the end, if the final score is found to be above some threshold value, the is declared as spam [0]. Rules are nothing but observations of features that are found more frequently in spam than in legitimate messages. Some examples include no sender s address, use of the color red above some threshold value, etc. Further, there are some spam features that remain constant over period of time, for example, forged headers and auto-executing JavaScript. Advantages: This approach can be very effective with a given set of rules. It can achieve 90 to 95 percent efficiency. The filter is easy to install, as it merely requires copying the rule set. It

11 4 requires neither training nor any sort of personal tuning. Further, the rule set can be updated by copying an additional set of rules to challenge the current trend of spam. Disadvantages: The disadvantage to the rule-based approach is that it is rigid. There is no selflearning facility available for the filter. Spammers with knowledge of the rule set can design a spam to deceive the method. For example, if there is a rule for classifying a message as a spam if the message contains the word Viagra more than five times, the spammer can easily circumvent the rule by using the term V*i*a*g*r*a instead of Viagra. Rules cannot be kept secret. The best option is to go through every spam and update the rule set by manually adding newfound rules. Unfortunately, this updating process is never ending, as the spammers continually devise new procedures to deceive the spam filters. This process requires personal effort, time, and some level of expertise, qualities not found in every user. The rule-based approach could be used in an integrated spam filter with some other approach. In a rule-based approach, decisions as to whether to classify an as spam or not are binary. This classification process on its own does not give continuous confidence. Such confidence is critical because the cost of a false positive classification (classifying legitimate message as spam) is very high. There is a need for a classification scheme based on probability, wherein all messages near a threshold value can be categorized as legitimate to avoid the danger of being false positives. The rule-based approach is faster than the use of blacklists, but it is slower than statistically-based approaches. SpamAssassin is the most successful spam filter available in the market that

12 uses this approach. The ReadMe file of SpamAssassin states that SpamAssassin differentiates between spam and non-spam mail correctly in percent of the cases Blacklist Approach In this approach, servers that are found to be the sources of spam are blacklisted. The s coming from blacklisted servers are marked as spam and deleted at the server level. The blacklist can also be maintained at the personal level. Advantages: The blacklist approach is helpful in cases in which servers are compromised and used for sending spam to hundreds of thousands of users. This is a better and cheaper option to use at the ISP level along with some other effective filtering technique. Tools like Razor and Pyzor can be used for this purpose. Disadvantages: The criterion of any spam filter is not only efficiency in filtering spam but also doing the job with the minimum amount of false positives. Marking a legitimate message as spam is a great mistake and is much more costly than marking a spam as legitimate. The blacklist approach generates a large amount of false positives. This being the case, its generalized approach of shunning a culprit server forever is not a good idea. A legitimate message arriving from a blacklisted server would always be considered a spam. MAPS RBL, probably the best-known blacklist, catches only 24 percent of all spam with a 34 percent of false positives. Moreover, there are many ethical issues involved in blacklisting a server. Probably the worst scenario of blacklisting a server is doing so without knowing whether that server is a source of spam or not. Moreover, a spammer is

13 6 a moving target. While a spammer might use a compromised computer to send spam, as soon as he learns his computer is being detected, he can use a different computer until that one is being detected. This can go on and on. The end result is that while servers are shunned, the spammer still keeps spamming. The solution to this approach has been the use of Distributed Adaptive Blacklists. Its basic working is to detect a spam message and inform all the recipients (which may run into millions) of that message about its status. Digests of spam are maintained at the server level. So, whenever a new message is received at the MTA, adaptive blacklists are called to detect whether the message is spam. There are tools that ensure that the messages, which are different versions of the same spam, do not get identified as legitimate. In addition, maintainers of distributed blacklists create honey-pot addresses, addresses never used for legitimate purposes. The basic disadvantage of this approach is that it generates a considerable amount of false negatives. Thus, it is recommended that this approach be used in conjunction with another effective filtering technique. 2.3 Whitelist / Effective Filters Approach Whitelists contain legitimate addresses. A whitelist filter is configured with an MTA. The messages arriving from any of these addresses are allowed to pass into the recipient s mailbox. The messages with sources that are not whitelisted are considered to be spam. It is difficult to maintain an exhaustive list of all legitimate addresses. The better option would be to share whitelists among friends and relatives. However, this, too, can be an easy route for a spammer to get a big list of legitimate addresses. The challenge response approach has been integrated with the whitelist approach to avoid such a

14 7 problem. The sender who is not whitelisted will receive a challenge response from the recipient for authentication purposes. The response might contain an image for decryption or a word to recognize and spell. Such a process would be simple with human intervention, but a machine would not be able to reply to the response. Once the sender replies to the response, his address would be added to the whitelist, and all his mails would directly reach the recipient s mailbox in the future. Advantages: Once all legitimate addresses are recorded, the whitelist approach coupled with the challenge-response approach has a 00 percent efficiency rate. The challenge-response component ensures that spammers do not reply to millions of responses and get registered on any whitelists integrated with anti-spam filters. The challenge-response approach requires human intervention for this very purpose. Spammers who try to respond to such challenges expose their purposed to the users seeking legal remedies against them. Disadvantages: The main disadvantage of the challenge-response approach is that it generates a vital amount of false positives, the reason being some addresses are not listed on a whitelist. There are many reasons for senders not replying to the challenge-response system. Doing so means extra effort for the senders. They may have unreliable ISPs, multiple addresses, or may not care to reply to the challenge. Such senders would not be whitelisted, and, hence, their mails would be classified as spam. There are cases in which users receive mails from automatic reply machines, for example online purchases, online registration, web list sign-ups, etc. Such systems would not be able to reply to challenges.

15 8 In addition, if a user were to add an incorrect address or to forget to add an address to his whitelist, he would generate false positives. Further, some people consider the challenge- -response system as rude. To avoid false positives, the strict action taken by the above approach can be toned down to a milder one. The s with unknown sources can be categorized in some folder (a low-priority mailbox) other than the inbox. This box could, then, be checked weekly. All unknown senders would receive replies stating that their s would not be read for a week. If they wanted their message to be read immediately, they would have to respond to the challenge sent. Thus, instead of using the whitelist approach as single tool of defense, it is more effective if it is used in conjunction with some other anti-spam tool. 2.4 Signature-based Approach The signature-based approach compares every new incoming with the known set of spam. Spam are derived from honey pots and deliberately created fake addresses. When any new spam message comes to light, all other services are alerted. The signature-based approach works in this way. Each character in an carries weight. So, the summation of all characters would give a final score that is used as the signature of that . Thus, every new message s signature is compared with that of a spam s signature. If the signatures match, then the new is classified as spam.

16 9 Advantages: The signature-based approach rarely generates false positives. Even false negatives generated by this approach are few as compared to other approaches. BrightMail is a successful spam filter that follows this approach. Disadvantages: These filters are easy to defeat. Since they are backward looking, they take action only after they become aware of a spam. By the time the honey pot has attracted a new spam, a signature has been assigned to it, and the updates have been sent and installed at all ISPs, the spammer has already sent millions of spam. Even a small change in s might make the filter useless. Just by adding some random characters to each spam, the signatures of each will be differed from the original spam. Thus, all such spam messages will pass for legitimate messages. Active research on the logic behind adding random characters to messages is being given a boost, but, even so, spammers are always ahead of these filters. The efficiency of these filters is found to be 50 to 70 percent. In addition, these filters can only be used at the ISP level as first-pass filters. 2.5 Filters Fight Back The filters fight back approach is the most aggressive among all the approaches adopted for filtering spam. It employs the policy of attack is the best self-defense. A spam message usually includes URLs for the readers to visit a site. The purpose may be commercial or social. The filters fight back approach works in this way. Once a message is detected as a spam, these filters send a number of requests to those URL-specified sites. A user can personally configure the number of requests. If any spam is sent to

17 0 thousands of users, there is a high possibility that the server hosting that site would receive millions of requests increasing the cost and the bandwidth, effectively shutting down all its services. Such filters are also known as auto-retrieving filters [6]. Advantages: Since spam itself is the reason for the spammer s loss, spammers would hesitate to send spam to unknown users. More recipients for the spam would create more loss to the spammer s web server. Disadvantages: The job prior to fighting back is to detect a spam. Any URL sent to thousands of users mainly indicates a spam. However, at the bottom of every message, there are many advertisements, such as Yahoo, MSN, etc., many of which are legitimate URLs. If the site turns out to be legitimate, negatively affecting the site might involve legal proceedings. To avoid such confusion, auto-retrieval filters should refer to blacklists for servers that are banned. Further, the servers need to be blacklisted by human intervention, thus ensuring that the auto-retrieval filters send requests only to web servers that are blacklisted. With this approach, there is an easy way out for spammers. They need to include only active unsubscribe links in their messages. In that way, the senders with auto-retrieval filters will be unsubscribed from the program, which is good news. However, the spam is not reduced globally. There is also the possibility that spammers might include their contact information and their image for marketing purposes instead of their URLs. Doing so, would wholly eliminate the danger of auto-retrieval filters.

18 To make this filter more effective, one needs to fine-tune the filter to each user s incoming messages. Fine-tuning a filter requires time and expertise, both of which are often hard to come by. Thus, one needs a filter that is adaptive in nature, one that selflearns from the given legitimate messages and spam [2].

19 2 CHAPTER 3 BAYESIAN CLASSIFIERS To understand the workings of Bayesian classifiers, one needs to know the underlying concept of Bayesian networks. A Bayesian classifier is nothing but the application of a Bayesian network to the process of text classification. 3. Bayesian Networks Bayesian networks are probabilistic networks. They are used as problem solving models in different fields. In a Bayesian network, nodes indicate the variables of the problem, and the directed nodes between the nodes indicate the relationships between the variables. A Bayesian network, in our case, is used to represent a probability distribution. In such a graph, a node represents a random variable, and a directed edge indicates a probabilistic dependency from the variable denoted by the parent node to that of the child. Hence, it is implied that any node in the network is conditionally independent of its non-descendents, given its parents. Each node is associated with a conditional probability table that indicates the distribution over that node with any possible assignment of values to its parents [0, 3]. Let s formulate the Bayesian network to solve our classification problem. Let C be the class variable that indicates to which class (legitimate / spam) a message belongs, and let node X i denote any attribute (token, in our case) in the message. We need to talk about the class nature. For our purposes we will say ck is the given of the specific values for the required attributes. In our case, the specific values would be either 0 or depending

20 on their presence in the message. The problem of class nature can be solved using Baye s theorem: 3 P(X = x C = ck) is difficult to calculate, as there is a high chance that the X attribute might be dependent on some other set of attributes. The easy solution is to assume that the attributes are conditionally independent of each other. Consequently, the probability will result in: P(X = x C = ck) = P (Xi = xi C = ck) If an message is considered to be a set of attributes (i.e., words), then using a Bayesian network, we can calculate the probability of whether a message belongs to a specific class, namely, a legitimate message or a spam. 3.2 Experimental Bayesian Filter The experimental Bayesian filter is a content-based approach. This attribute gives this approach an advantage over other approaches. Spammers cannot modify the content to deceive the filters, as content is the only reason to send spam at the first place. Content in this case includes headers and the message itself.

21 4 First the filter should be trained to work accordingly. A considerable amount of good mails and spam would be required to train the filter. Two tables would be maintained, one each for legitimate and spam. Let us call them the good table and the bad table, respectively. The good table would contain tokens that occur in the good s, along with their number of occurrences. Similarly, the interpretation of bad s would be maintained in the bad table. Based on these two tables, another table will be built using the Bayesian formula of probability [4, 5]: P (bad/token) = A A + B A = P (token/bad) * P (bad) B = P (token /good) * P (good) P (token/bad) = probability of a token given that it is present in spam . P (token /good) = probability of a token given that it is present in good . P (bad/token) = probability of being spam given that a specific token is present. Let s call this table the spam probability table. This table will contain all tokens that occur in all mails, along with the probability that will define the chances of the mail being spam with that token present. Ideally, the probability of mail being spam should be calculated with the given presence of the combination of tokens in an . But this probability is difficult to calculate, as the number of tokens is huge, giving rise to a lot of

22 5 combinations. To make the matter simpler, we will assume that the tokens are independent of each other. Thus, the probability of an is merely the combined probability of tokens. Hence, this implementation of the Bayesian formula is known as the naïve Bayesian rule. For every new , a fixed number of effective tokens would be collected to calculate the combined probability. This number can vary from 5 to 25 depending on the success of the filter based on one s personal messages. A = p (a) * p (b) B= (-p (a)) * (-p (b)) Score: A A + B If the score rises above a threshold value, the s would be declared as spam, else as a good . The selection of only 5 tokens is one of the filtering techniques used in the case of Bayesian filters. Effective tokens are those whose probabilities differ the most, on either side, from the threshold value. These tokens are either significantly good tokens or bad tokens, and they are responsible for deciding the overall status of the message. The challenges present are speed, efficiency, database size, and the need of training data. The larger the set of tokens the greater would be the size of the database and the longer the time of training. So, there is a need to consider only those tokens that make an impact in deciding the status of an . Since training and classification will

23 6 occur during the same phase of time, special care must be taken to make both operations as independent as possible. If the token is present only in the good table, its probability in the spam probability table would be recorded as 0.. If the token is present only in the bad table, its probability in the spam probability table would be recorded as Cost Evaluation Measures A false positive is mistakenly classifying a legitimate as a spam, and a false negative is mistakenly classifying a spam as a legitimate . The cost of a false positive is much higher than that of a false negative. The existence of false positives destroys the faith of the user in his spam filter because users tend to delete spam from a bulk folder without reading them, and deleting legitimate messages (due to spam filters) is unacceptable. In that case, it is acceptable to allow some false negatives rather than having any false positives. Let L S be false positive error type and S L be false negative error type. Assuming that L S is λ times costlier than S L, we classify a message as spam if: In our case wherein we are considering a naïve Bayesian filter s independency, the assumption holds. Therefore, P(C=spam X=x) = - P(C=legitimate X=x), which leads to the following criteria:

24 7 P(C=spam X=x) > t, where t = threshold value Thus t = λ / (+ λ) as λ = t / (-t) Depending on the action taken on a spam folder, the threshold value can be altered. If spam are deleted directly once they are classified, then t is held as high as (λ = 999), i.e. blocking a legitimate message is as bad as letting 999-spam messages pass the filter. Lower values of λ are acceptable depending on the different configurations made available for the spam folder. If the configuration is set up to resend the mail back to the sender asking him to send it to a private unfiltered address of the recipient, then λ = 9 (t=0.9) seems to be reasonable. Even λ = (t=0.5) is acceptable if the recipient happens to go through every in the bulk folder before manually deleting them. Two factors could be used in the context to measure the performance of a filter, namely, spam precision and spam recall. Let n (L S) and n (S L) be the numbers of L S and S L errors, and let n (L L) and n (S S) count the correctly treated legitimate and spam messages respectively. Spam recall (SR) and spam precision (SP) are defined as follows: SR = n S->S n S->L + n S->S SP = n S->S n L->S + n S->S 3.4 Independence Factor of a Bayesian Network Using a Bayesian network, we can model the complex dependencies between features to infer the solution class. As the number of features increases, it becomes increasingly difficult for a message to be classified with all its dependencies. As a result, spam filters

25 implement a naïve Bayesian concept wherein features are assumed to be independent of each other. This assumption is balanced by setting higher value to the threshold. 8 P(X = x C = ck) = P (Xi = xi C = ck) A naïve Bayesian model is the most restrictive form of the feature dependence spectrum. Research has been done regarding the performance of spam filters by allowing some degree of dependence between features. This study can be formalized by introducing the notion of k-dependence Bayesian classifiers. A k-dependence Bayesian classifier is a Bayesian network wherein each feature is allowed to have a maximum of k parents. Based on this definition, we can say that a naïve Bayesian filter is a 0-dependence Bayesian classifier. We can also state that an ideal Bayesian filter (i.e. full Bayesian filter with no independence) is an (N-)-dependence Bayesian classifier where N is the number of domain features. By varying the value of k, one can move step-by-step in the feature dependence spectrum and analyze the performance of the spam filter at every step. It is also worth noting that as k grows, there are more condition variables with the same amount of data. This implies a larger probability space for estimation with the same data, causing inaccuracy in probability estimates and leading to an overall decrease in performance. This performance problem has been observed in many domains while going from k=2 to k=3.

26 9 CHAPTER 4 CLASSIFICATION TECHNIQUES IN A NAÏVE BAYESIAN FILTER Once the naïve Bayesian filter is trained using huge datasets of spam and non-spam messages, it is now ready to perform its basic functionality of filtering, i.e. classifying new incoming unseen messages. Currently, there are many classification techniques used with naïve Bayesian filters available on the market. We discuss four significant techniques in detail in this section. 4. Use of All Tokens This technique demands use of all tokens from a new for classification. As each token is associated with a probability that determines the chances of the being a spam, tokens from each new would be used to calculate a combined probability to assign a final score to the . In the case of a new token in an (i.e. with no record in the database), it would be assigned a probability of 0.4. This assumption has been practically implemented and been found successful in naïve Bayesian filters. It implies that a new token is considered to be a good token rather than a part of a spam. It also indicates the positive approach adopted by spam filters, since the cost of a false positive is much higher than that of a false negative. However, we turn off this feature for the purposes of our evaluation, because we do not want to favor one technique (by taking a positive approach) over others. This global technique makes sense as we parsed all tokens from training datasets to build a database to be used for classification. Hence, it is logical to use the same technique for classification.

27 20 It should be noted that the classification phase is critical due to the heavy cost of a false positive as compared to the training phase wherein we know exactly whether an is a spam or not. This technique might be deceived by an in which there is a big story of how a person got rich instantly followed by a link to a spam site. Such s would contain a large amount of good tokens as compared to a spam. There is a high possibility that such s would deceive spam filters only to be categorized as a good . But it is equally true that spammers avoid writing a big story as it is very likely that readers would rather delete than read a big article from some unknown source. Thus, the use of the all tokens method is found to be effective in practical filters. For example, Bill Yerazunis has used this technique in his Controllable Regex Mutilator (CRM4). 4.2 Use of a Fixed Number of Tokens The use of a fixed number of tokens technique, successfully implemented by Paul Graham, takes only a fixed number of tokens into consideration from a new for assigning a final score to it. The number can vary from 5 to 20 to 25, but these tokens are assumed to be the most effective in the given . An effective token is one whose probability deviates the most from 0.5 on any side, i.e. it can be a good token or a bad one. The combined probability of these tokens would assign a final score to the given new . In that way, the most effective tokens are emphasized for the task. The technique directly attacks those words that are found most of the time in either legitimate s or spam. As a result, the final score would most probably end up near if the

28 2 is a spam or near 0, otherwise. Thus, this technique eliminates the doubt of classification if the final score ends up near 0.5. This method of effectiveness was proposed by Sahami et al. who calculated its effectiveness with the help of the mathematical formula of mutual information. It is recommended that the same token should not be counted more than once while calculating a final score. In that way, the filter makes an unbiased decision with no interference from any specific token even if it had occurred a few times in the message. The number of tokens (5/20/25) is a personal decision, based on the success of the spam filter on personal s. If the number of tokens in a new happens to be less than a fixed number, say 0, then the use of all tokens is the logical back-up technique to be used for classification. This technique has some advantages over other techniques. ) To avoid the problem of false positives, the threshold value can be raised to any value near to 0.9 from ) In the case of huge s, the classification would be much faster. 4.3 Use of a Standard Deviation This technique, like the previous technique, considers only the effective tokens. However, it also emphasizes the spam probability of tokens rather than the number of tokens. If a standard deviation (i.e. stddev) is of the value x, then all tokens with a spam probability in the range of 0.5-x to 0.5+x would be discarded. The remaining tokens would be the effective ones used to calculate the combined probability and assign a final score to the new . BogoFilter, a spam filter that is currently available on the market,

29 22 has adopted this approach. The value of the stddev can be varied based on the filter s success on one s personal messages. The value, which is found successful and recommended, is 0.4. Thus, tokens under consideration would be the ones with probabilities ( ) 0. and lower and ( ) 0.9 and higher. The specialty of the technique is that it assigns the score to the independent of its size. Based on the content of an , there might be only ten effective tokens, or sometimes there may be even more than 00. But for every classification, only effective tokens with probabilities 0.9 and above and 0. and lower would be considered. Like the previous technique, the score in this case would be near (if spam) or near to 0, otherwise. Thus, it is less likely that the score would end up near 0.5, and, thus, giving rise to the possibility of false positives. The same token should not be considered more than once to avoid the interference from any specific token if it had occurred a few times in the message. The threshold, like the previous technique, can be raised to 0.9 to reduce the possibilities of false positives. The processing time for classification would vary according to the size of the Use of a Relative Number of Tokens We would like to propose a technique and evaluate it along with other real-world successful techniques. Since the naïve Bayesian filter is trained with the contents of messages, it is logical to apply the same content-based approach for classification as well. In this technique, we select some percentage (say 30 percent) of effective tokens out of the total tokens of an message. These tokens will be used to calculate the combined

30 23 probability and assign a final score to the message. The percentage value can be tuned, based on the success of the filter on personal messages. This approach is the combination of both the above techniques: the use of a fixed number of tokens and the use of a standard deviation. It values both the effectiveness and number of tokens while classifying a message. So, if an contains 00 tokens, then the 30 most effective tokens among them will be used for classification. There is a high possibility that most of these 30 odd tokens would fall in the stddev of 0.4. In that way, we utilize the advantages of both the above techniques. As it is a content-based approach, there are chances that the final score of an might fall near 0.5. To avoid false possibilities, the threshold value can be raised to a higher value. The process time for classification depends on the size of the message.

31 24 CHAPTER 5 EXPERIMENTS Our experiment comprises of two phases: the training phase and the classification phase. In the training phase, the filter is trained using a known corpus of spam and good s. A database of tokens appearing in each corpus and their total occurrences are maintained in a database. Based on their occurrences in each set of spam and good s, each token is assigned a probability for its capacity of determining an as spam given its presence. Then, using this knowledge of tokens, the filter classifies every new incoming mail in the classification phase. Once the status of a new mail is confirmed, all its tokens are also recorded, thus updating the database. This self-learning function of our filter makes it unique among the other available spam filters. Even if the filter misclassifies any message, the user can rectify it, and the spam filter would update its database accordingly. Thus, the filter learns from its mistakes, too. We used 250 legitimate messages and 350 spam messages. Legitimate messages belong to my student webmail account assigned to me by Utah State University. Spam messages were collected from an archive provided by Nik Martin, available at the site hosted by Paul Graham ( [6]. The spam was collected over the last four to five years. The proportion of spam to legitimate messages is quite huge, making it more likely that legitimate messages can easily be misclassified as spam. This makes the situation more challenging, as the cost of false positives is much higher than that of false negatives. We feel that by minimizing the false positives in such a situation, we have achieved an efficient Bayesian spam filter. Moreover, by recording tokens from

32 25 such a huge number of spam, we have covered almost all the topics for spam and are in a pretty good position to classify new incoming mails for spam. Each word in each message is considered to be a token. The whole message including the header is parsed for tokens. The token separator is a blank space. Words quoted in double and single quotes, numbers, and all words separated by blank spaces are also considered as tokens. The tokens under study and used for classification are Since we are not using any type of lemmatizer, we consider different forms of the same word as different tokens. For example, run, running and runner would all be considered as different tokens even though they have stemmed from the single word run. There are studies [7] that prove the positive effect of a lemmatizer on a filter s performance. The implementation of a lemmatizer is one of the topics of our future study. See Chapter 8. Only the message content is used for classification purposes. Doing so eliminates the interference of tokens present in headers in determining the status of a message. In that way, there is no bias among the classification techniques that are considered for evaluation as some techniques consider only a few (or percentage of) tokens for assigning a final score to the message. Our evaluation was conducted in the classification phase. We evaluated four effective filtering techniques of the Bayesian spam filter for their classification performance. We evaluated these techniques using cost-sensitive measures, as we believe that the cost of a false positive is much higher than that of a false negative. Eighty new incoming messages were tested in a batch of two (first batch: 50; second batch: 30) to get the significant evaluation results. These tested messages belong to the same account (i.e. my

33 26 webmail account) previously used in the training phase. Thus, we avoided any type of erratic behavior from the anti-spam filter. The effective configuration of each technique was used for evaluation purposes. In the standard deviation technique, the value of standard deviation was set to 0.4, and in the percentage technique, 30 percent of total tokens were used to calculate the final score. The tabulated results and related plotted graphs are explained in the next section.

34 27 CHAPTER 6 COST- SENSITIVE EVALUATION The evaluation factors that are frequently used in case of classification are accuracy (Acc) and the error rate (Err = Acc). Accuracy can be defined as the number of correct classifications, i.e. spam correctly classified as spam and legitimate messages as legitimate out of the total messages. The error rate is the ratio of the sum of false positives and false negatives out of the total messages. Acc = n S->S + n L->L n L->S + n S->L N L + N Err = S N L + N S Where N L and N S are the number of legitimate and spam messages, respectively. In our cost-sensitive evaluation, we assume that the error of a false positive is much higher than that of false negative. Conversely, the above formulae of accuracy and error rate do not consider the cost-sensitive factor. Let s assume that the error type of a false positive is λ times greater than that of a false negative, the implication being that we treat a legitimate message as being worth λ messages. So, if a legitimate message is misclassified, it counts to λ errors, and if it is classified correctly, it counts to λ successes. This assumption can be formulated in the form of a weighted accuracy (WAcc) and a weighted error rate (WErr = -WAcc): WAcc = λ n S->S + n L->L λn L + N S WErr = λ n L->S + n S->L λn L + N S

35 28 To get a better idea of the filter s performance in terms of accuracy and error rate, we must compare these factors with a baseline approach [7]. In a baseline approach, we assume that no sort of filter is active, i.e. all spam pass the filter, and legitimate messages are never blocked. The weighted accuracy and error rate of the baseline are: WAcc b = λ N L N S λn L + N WErr b = S λn L + N S We calculate TCR (Total Cost Ratio) to compare with the baseline approach [7]: TCR = WErrb WErr = N S λn L->S + n S->L The higher the value of TCR, the better the performance. With TCR <, a baseline approach is a better option, implying that the absence of a filter gives better results than the use of a filter. If cost is relative to wasted time, then TCR measures the time wasted to delete manually all spam messages as compared to the sum of time wasted to delete manually all spam messages misclassified as legitimate (n S->L ) and time wasted by recovering all legitimate messages mistakenly classified as spam (λ n L->S ). Table lists false positives, false negatives, and correct classifications of all four techniques with different configurations for the threshold. Table 2 lists spam recall, spam precision, weighted accuracy, baseline-weighted accuracy, and total cost ratio (TCR) for the same. TCR is calculated for all techniques for different values of thresholds. In a cost-

36 29 sensitive evaluation, TCR can be used as a scale of better performance. Table indicates the fall in number of false positives and rise in number of false negatives by raising the threshold bars for all four techniques. Table 2 indicates that the fixed token technique outperforms for every value of λ. Unlike other techniques, the fixed token approach gives excellent results for λ=999. The all token approach is worst among them all. Our percentage approach performs better than the standard deviation for λ = and λ=999. Based on both the tables, we can say that by lowering the threshold value from to 0.5, we have risked an increase in number of false positives. But at the same time, the evaluation has shown the increase in TCR values, indicating that an increase in false positives does not prove costly to us. However, in practice, no user would like to use a threshold of 0.5 that implies that he has to go through every spam mail before deleting it. The filter would just be helping the user in locating the spam. An ideal filter would be one wherein spam messages are deleted without the supervision of the user and no legitimate message is deleted in the process. One can observe that number of n S->L is much lesser than that of n L->S. It is the due the fact the number of spam used in training phase is way greater than that of legitimate messages. Our filter, being a self-learner, would improve its performance in the future and would keep the number of n S->L as minimum as possible. We believe, after a period of time, our filter would perform at its peak performance and would remain constant thereafter. The ideal filter should give spam precision of 00 percent, spam recall of 00 percent and a positive value for TCR for all the values of λ.

37 Table 6.. The results (false positives, false negatives and correct classifications) of four filtering techniques for different values of λ. 30 Filter Technique λ n L->S n S->L n L->L n S->S a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech Table 6.2. The results (TCR) of four filtering techniques for different values of λ. Filter Technique λ Spam Recall a) All Token tech 00% b) Fixed Token tech 00% c) Std Deviation tech 96% d) Percentage tech 00% a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech % 00% 96% 00% 88% 80% 84% 76% Spam Precision 69.44% 80.65% 72.73% 73.53% 69.44% 83.33% 80% 73.53% 70.97% 00% 84% 86.36% Weighted Accuracy 78% 88% 80% 82% 60.4% 82% 78% 67.6% 64.02% 99.98% 84% 87.99% Baseline W.Acc. 50% 50% 50% 50% 90% 90% 90% 90% 99.9% 99.9% 99.9% 99.9% TCR An evaluation of the fixed token approach was conducted with 5 tokens. To get an optimal anti-spam filter, we further evaluated the fixed token approach with a different number of fixed tokens, i.e. 5, 5, 20, and 25. Tables 3 and 4 list their results.

38 Table 6.3. The results (false positives, false negatives and correct classifications) of four configurations (5, 5, 20 and 25) of Fixed token approach for different values of λ. 3 Filter Technique a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed λ n L->S n S->L n L->L n S->S Table 6.4. The results (TCR) of four configurations (5, 5, 20 and 25) of Fixed token approach for different values of λ. Filter Technique (Fixed Token tech) a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 λ Spam Recall 96% 96% 00% 00% 92% 96% 00% 00% 88% 96% 00% 00% Spam Precision 85.7% 77.42% 75.76% 7.43% 85.9% 80% 75.76% 7.43% 95.65% 85.7% 78.3% 73.53% Weighted Accuracy 90% 84% 84% 80% 84.8% 78% 7.2% 64% 95.99% 84.0% 72.03% 64.04% Baseline W.Acc. 50% 50% 50% 50% 90% 90% 90% 90% 99.9% 99.9% 99.9% 99.9% TCR The values of TCR and weighted accuracy prove the better performance of 5 tokens over others for each value of λ. The performance degrades as we consider a higher

39 32 number of tokens for the classification. However, the effective configuration still cannot be used as a stand-alone first-pass filter for λ=999 and λ=9. It needs the help of other techniques, such as blacklists and whitelists, for effective spam filtering. To get an optimal number of tokens, we further evaluated by covering the range of 5 to 5 tokens. Tables 5 and 6 list their results. The results of 5, 7, and 0 tokens remained the same to each other as well as remained constant for different values of λ. However, the results for 3 fixed tokens were the worst, and results of 2 fixed tokens were near to that of 5, 7, and 0 fixed tokens. It can be said that in the case of the fixed token approach, the filter reaches optimal performance in the range of 5 to2 tokens and degrades thereafter. This observation is confirmed by the plotted graphs (Figures through 3). They indicate the maximum peak (i.e. TCR value) in the range of 5 to2. Table 6.5. The results (false positives, false negatives and correct classifications) of four configurations (3, 7, 0 and 2) of the fixed token approach for different values of λ. Filter Technique a) Fixed - 3 b) Fixed - 7 c) Fixed - 0 d) Fixed - 2 a) Fixed - 3 b) Fixed - 7 c) Fixed - 0 d) Fixed - 2 a) Fixed - 3 b) Fixed - 7 c) Fixed - 0 d) Fixed λ n L->S n S->L n L->L n S->S

Anti Spamming Techniques

Anti Spamming Techniques Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods

More information

An Overview of Spam Blocking Techniques

An Overview of Spam Blocking Techniques An Overview of Spam Blocking Techniques Recent analyst estimates indicate that over 60 percent of the world s email is unsolicited email, or spam. Spam is no longer just a simple annoyance. Spam has now

More information

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques

More information

Adaptive Filtering of SPAM

Adaptive Filtering of SPAM Adaptive Filtering of SPAM L. Pelletier, J. Almhana, V. Choulakian GRETI, University of Moncton Moncton, N.B.,Canada E1A 3E9 {elp6880, almhanaj, choulav}@umoncton.ca Abstract In this paper, we present

More information

BARRACUDA. N e t w o r k s SPAM FIREWALL 600

BARRACUDA. N e t w o r k s SPAM FIREWALL 600 BARRACUDA N e t w o r k s SPAM FIREWALL 600 Contents: I. What is Barracuda?...1 II. III. IV. How does Barracuda Work?...1 Quarantine Summary Notification...2 Quarantine Inbox...4 V. Sort the Quarantine

More information

Antispam Security Best Practices

Antispam Security Best Practices Antispam Security Best Practices First, the bad news. In the war between spammers and legitimate mail users, spammers are winning, and will continue to do so for the foreseeable future. The cost for spammers

More information

AntiSpam QuickStart Guide

AntiSpam QuickStart Guide IceWarp Server AntiSpam QuickStart Guide Version 10 Printed on 28 September, 2009 i Contents IceWarp Server AntiSpam Quick Start 3 Introduction... 3 How it works... 3 AntiSpam Templates... 4 General...

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide This guide is designed to help the administrator configure the eprism Intercept Anti-Spam engine to provide a strong spam protection

More information

Software Engineering 4C03 SPAM

Software Engineering 4C03 SPAM Software Engineering 4C03 SPAM Introduction As the commercialization of the Internet continues, unsolicited bulk email has reached epidemic proportions as more and more marketers turn to bulk email as

More information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,

More information

How To Stop Spam From Being A Problem

How To Stop Spam From Being A Problem Solutions to Spam simple analysis of solutions to spam Thesis Submitted to Prof. Dr. Eduard Heindl on E-business technology in partial fulfilment for the degree of Master of Science in Business Consulting

More information

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Spam Filtering based on Naive Bayes Classification. Tianhao Sun Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous

More information

Panda Cloud Email Protection

Panda Cloud Email Protection Panda Cloud Email Protection 1. Introduction a) What is spam? Spam is the term used to describe unsolicited messages or messages sent from unknown senders. They are usually sent in large (even massive)

More information

Spam Filters Need the Human Touch. Mail-Filters.com, Inc.

Spam Filters Need the Human Touch. Mail-Filters.com, Inc. Spam Filters Need the Human Touch Mail-Filters.com, Inc. October, 2004 Executive Summary Readers of this or any other white paper on spam need no proof that spam is a problem. They know that; they want

More information

About this documentation

About this documentation Wilkes University, Staff, and Students have a new email spam filter to protect against unwanted email messages. Barracuda SPAM Firewall will filter email for all campus email accounts before it gets to

More information

Gordon State College. Spam Firewall. User Guide

Gordon State College. Spam Firewall. User Guide Gordon State College Spam Firewall User Guide Overview The Barracuda Spam Firewall is an integrated hardware and software solution that provides powerful and scalable spam and virus-blocking capabilities

More information

Antispam Evaluation Guide. White Paper

Antispam Evaluation Guide. White Paper Antispam Evaluation Guide White Paper Table of Contents 1 Testing antispam products within an organization: 10 steps...3 2 What is spam?...4 3 What is a detection rate?...4 4 What is a false positive rate?...4

More information

Spam Filtering Methods for Email Filtering

Spam Filtering Methods for Email Filtering Spam Filtering Methods for Email Filtering Akshay P. Gulhane Final year B.E. (CSE) E-mail: akshaygulhane91@gmail.com Sakshi Gudadhe Third year B.E. (CSE) E-mail: gudadhe.sakshi25@gmail.com Shraddha A.

More information

E-MAIL FILTERING FAQ

E-MAIL FILTERING FAQ V8.3 E-MAIL FILTERING FAQ COLTON.COM Why? Why are we switching from Postini? The Postini product and service was acquired by Google in 2007. In 2011 Google announced it would discontinue Postini. Replacement:

More information

Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development

Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development Analysis of Spam Filter Methods on SMTP Servers Category: Trends in Anti-Spam Development Author André Tschentscher Address Fachhochschule Erfurt - University of Applied Sciences Applied Computer Science

More information

Quarantined Messages 5 What are quarantined messages? 5 What username and password do I use to access my quarantined messages? 5

Quarantined Messages 5 What are quarantined messages? 5 What username and password do I use to access my quarantined messages? 5 Contents Paul Bunyan Net Email Filter 1 What is the Paul Bunyan Net Email Filter? 1 How do I get to the Email Filter? 1 How do I release a message from the Email Filter? 1 How do I delete messages listed

More information

Savita Teli 1, Santoshkumar Biradar 2

Savita Teli 1, Santoshkumar Biradar 2 Effective Spam Detection Method for Email Savita Teli 1, Santoshkumar Biradar 2 1 (Student, Dept of Computer Engg, Dr. D. Y. Patil College of Engg, Ambi, University of Pune, M.S, India) 2 (Asst. Proff,

More information

White Paper X-Spam for Exchange 2000-2003 Server

White Paper X-Spam for Exchange 2000-2003 Server White Paper X-Spam for Exchange 2000-2003 Server X-Spam for Exchange 2000-2003 (X-Spam) is a highly adaptive Anti-Spam Software that protects the Microsoft Exchange 2000-2003 servers from Spam. X-Spam

More information

Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

More information

Bayesian Learning Email Cleansing. In its original meaning, spam was associated with a canned meat from

Bayesian Learning Email Cleansing. In its original meaning, spam was associated with a canned meat from Bayesian Learning Email Cleansing. In its original meaning, spam was associated with a canned meat from Hormel. In recent years its meaning has changed. Now, an obscure word has become synonymous with

More information

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques

More information

EnterGroup offers multiple spam fighting technologies so that you can pick and choose one or more that are right for you.

EnterGroup offers multiple spam fighting technologies so that you can pick and choose one or more that are right for you. CONFIGURING THE ANTI-SPAM In this tutorial you will learn how to configure your anti-spam settings using the different options we provide like Challenge/Response, Whitelist and Blacklist. EnterGroup Anti-Spam

More information

Anti Spam Best Practices

Anti Spam Best Practices 39 Anti Spam Best Practices Anti Spam Engine: Time-Tested Scanning An IceWarp White Paper October 2008 www.icewarp.com 40 Background The proliferation of spam will increase. That is a fact. Secure Computing

More information

The Network Box Anti-Spam Solution

The Network Box Anti-Spam Solution NETWORK BOX TECHNICAL WHITE PAPER The Network Box Anti-Spam Solution Background More than 2,000 years ago, Sun Tzu wrote if you know yourself but not the enemy, for every victory gained you will also suffer

More information

ContentCatcher. Voyant Strategies. Best Practice for E-Mail Gateway Security and Enterprise-class Spam Filtering

ContentCatcher. Voyant Strategies. Best Practice for E-Mail Gateway Security and Enterprise-class Spam Filtering Voyant Strategies ContentCatcher Best Practice for E-Mail Gateway Security and Enterprise-class Spam Filtering tm No one can argue that E-mail has become one of the most important tools for the successful

More information

Comparing Industry-Leading Anti-Spam Services

Comparing Industry-Leading Anti-Spam Services Comparing Industry-Leading Anti-Spam Services Results from Twelve Months of Testing Joel Snyder Opus One April, 2016 INTRODUCTION The following analysis summarizes the spam catch and false positive rates

More information

Some fitting of naive Bayesian spam filtering for Japanese environment

Some fitting of naive Bayesian spam filtering for Japanese environment Some fitting of naive Bayesian spam filtering for Japanese environment Manabu Iwanaga 1, Toshihiro Tabata 2, and Kouichi Sakurai 2 1 Graduate School of Information Science and Electrical Engineering, Kyushu

More information

Reputation Network Analysis for Email Filtering

Reputation Network Analysis for Email Filtering Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler University of Maryland, College Park MINDSWAP 8400 Baltimore Avenue College Park, MD 20742 {golbeck, hendler}@cs.umd.edu

More information

escan Anti-Spam White Paper

escan Anti-Spam White Paper escan Anti-Spam White Paper Document Version (esnas 14.0.0.1) Creation Date: 19 th Feb, 2013 Preface The purpose of this document is to discuss issues and problems associated with spam email, describe

More information

SPAM FILTER Service Data Sheet

SPAM FILTER Service Data Sheet Content 1 Spam detection problem 1.1 What is spam? 1.2 How is spam detected? 2 Infomail 3 EveryCloud Spam Filter features 3.1 Cloud architecture 3.2 Incoming email traffic protection 3.2.1 Mail traffic

More information

Purchase College Barracuda Anti-Spam Firewall User s Guide

Purchase College Barracuda Anti-Spam Firewall User s Guide Purchase College Barracuda Anti-Spam Firewall User s Guide What is a Barracuda Anti-Spam Firewall? Computing and Telecommunications Services (CTS) has implemented a new Barracuda Anti-Spam Firewall to

More information

Spam blocking methods and experiences

Spam blocking methods and experiences Spam blocking methods and experiences Linuxdays Luxembourg 2003 christian mock http://www.tahina.priv.at/~cm/talks/spamblocking.{sxi,pdf} version 1.3 contents how spam is sent

More information

Why Bayesian filtering is the most effective anti-spam technology

Why Bayesian filtering is the most effective anti-spam technology Why Bayesian filtering is the most effective anti-spam technology Achieving a 98%+ spam detection rate using a mathematical approach This white paper describes how Bayesian filtering works and explains

More information

MXSweep Hosted Email Protection

MXSweep Hosted Email Protection ANTI SPAM SOLUTIONS TECHNOLOGY REPORT MXSweep Hosted Email Protection JANUARY 2007 www.westcoastlabs.org 2 ANTI SPAM SOLUTIONS TECHNOLOGY REPORT CONTENTS MXSweep www.mxsweep.com Tel: +44 (0)870 389 2740

More information

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

Increasing the Accuracy of a Spam-Detecting Artificial Immune System Increasing the Accuracy of a Spam-Detecting Artificial Immune System Terri Oda Carleton University terri@zone12.com Tony White Carleton University arpwhite@scs.carleton.ca Abstract- Spam, the electronic

More information

Copyright Information. Confidentiality Notice. Anti-Spam Evaluation Guide Confidential November 2009 Page 2 of 16

Copyright Information. Confidentiality Notice. Anti-Spam Evaluation Guide Confidential November 2009 Page 2 of 16 Copyright Information Kaspersky is a registered trademark of Kaspersky Lab. Other trademarks found in this publication have been used for identification purposes only and may be the trademarks of their

More information

Introduction. How does email filtering work? What is the Quarantine? What is an End User Digest?

Introduction. How does email filtering work? What is the Quarantine? What is an End User Digest? Introduction The purpose of this memo is to explain how the email that originates from outside this organization is processed, and to describe the tools that you can use to manage your personal spam quarantine.

More information

More Details About Your Spam Digest & Dashboard

More Details About Your Spam Digest & Dashboard TABLE OF CONTENTS The Spam Digest What is the Spam Digest? What do I do with the Spam Digest? How do I view a message listed in the Spam Digest list? How do I release a message from the Spam Digest? How

More information

Spam Classification Techniques

Spam Classification Techniques Spam Classification Techniques Executive Overview It costs companies nearly $2,000 per employee a year in lost productivity, double from a year ago. Nucleus Research, USA Today, June 2004 In its infancy,

More information

On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle

More information

eprism Email Security Suite

eprism Email Security Suite FAQ V8.3 eprism Email Security Suite 800-782-3762 www.edgewave.com 2001 2012 EdgeWave. All rights reserved. The EdgeWave logo is a trademark of EdgeWave Inc. All other trademarks and registered trademarks

More information

How does the Excalibur Technology SPAM & Virus Protection System work?

How does the Excalibur Technology SPAM & Virus Protection System work? How does the Excalibur Technology SPAM & Virus Protection System work? All e-mail messages sent to your e-mail address are analyzed by the Excalibur Technology SPAM & Virus Protection System before being

More information

Evaluation of Anti-spam Method Combining Bayesian Filtering and Strong Challenge and Response

Evaluation of Anti-spam Method Combining Bayesian Filtering and Strong Challenge and Response Evaluation of Anti-spam Method Combining Bayesian Filtering and Strong Challenge and Response Abstract Manabu IWANAGA, Toshihiro TABATA, and Kouichi SAKURAI Kyushu University Graduate School of Information

More information

CONFIGURING FUSEMAIL ANTI-SPAM

CONFIGURING FUSEMAIL ANTI-SPAM CONFIGURING FUSEMAIL ANTI-SPAM In this tutorial you will learn how to configure your anti-spam settings using the different options we provide like FuseFilter, Challenge/Response, Whitelist and Blacklist.

More information

Why Bayesian filtering is the most effective anti-spam technology

Why Bayesian filtering is the most effective anti-spam technology GFI White Paper Why Bayesian filtering is the most effective anti-spam technology Achieving a 98%+ spam detection rate using a mathematical approach This white paper describes how Bayesian filtering works

More information

MDaemon configuration recommendations for dealing with spam related issues

MDaemon configuration recommendations for dealing with spam related issues Web: Introduction MDaemon configuration recommendations for dealing with spam related issues Without a doubt, our most common support queries these days fall into one of the following groups:- 1. Why did

More information

Filtering Spam Using Search Engines

Filtering Spam Using Search Engines Filtering Spam Using Search Engines Oleg Kolesnikov, Wenke Lee, and Richard Lipton ok,wenke,rjl @cc.gatech.edu College of Computing Georgia Institute of Technology Atlanta, GA 30332 Abstract Spam filtering

More information

Tufts Technology Services (TTS) Proofpoint Frequently Asked Questions (FAQ)

Tufts Technology Services (TTS) Proofpoint Frequently Asked Questions (FAQ) Tufts Technology Services (TTS) Proofpoint Frequently Asked Questions (FAQ) What is Proofpoint?... 2 What is an End User Digest?... 2 In my End User Digest I see an email that is not spam. What are my

More information

6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME

6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME INTERNATIONAL International Journal of Computer JOURNAL Engineering OF COMPUTER and Technology ENGINEERING (IJCET), ISSN 0976-6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET)

More information

Track-able Bulk Management System

Track-able Bulk Management System Track-able Bulk Management System Table of Contents TABLE OF CONTENTS... 2 1. INTRODUCTION... 3 2. WHAT IS TRACK-ABLE BULK MANAGEMENT SYSTEM?... 3 3. TRACK-ABLE BULK MANAGEMENT SYSTEM... 4 4. CONCLUSION...13

More information

Objective This howto demonstrates and explains the different mechanisms for fending off unwanted spam e-mail.

Objective This howto demonstrates and explains the different mechanisms for fending off unwanted spam e-mail. Collax Spam Filter Howto This howto describes the configuration of the spam filter on a Collax server. Requirements Collax Business Server Collax Groupware Suite Collax Security Gateway Collax Platform

More information

PROOFPOINT - EMAIL SPAM FILTER

PROOFPOINT - EMAIL SPAM FILTER 416 Morrill Hall of Agriculture Hall Michigan State University 517-355-3776 http://support.anr.msu.edu support@anr.msu.edu PROOFPOINT - EMAIL SPAM FILTER Contents PROOFPOINT - EMAIL SPAM FILTER... 1 INTRODUCTION...

More information

Solutions IT Ltd Virus and Antispam filtering solutions 01324 877183 Info@solutions-it.co.uk

Solutions IT Ltd Virus and Antispam filtering solutions 01324 877183 Info@solutions-it.co.uk Contents Reduce Spam & Viruses... 2 Start a free 14 day free trial to separate the wheat from the chaff... 2 Emails with Viruses... 2 Spam Bourne Emails... 3 Legitimate Emails... 3 Filtering Options...

More information

Do you need to... Do you need to...

Do you need to... Do you need to... TM Guards your Email. Kills Spam and Viruses. Do you need to... Do you need to... Scan your e-mail traffic for Viruses? Scan your e-mail traffic for Viruses? Reduce time wasted dealing with Spam? Reduce

More information

Why Content Filters Can t Eradicate spam

Why Content Filters Can t Eradicate spam WHITEPAPER Why Content Filters Can t Eradicate spam About Mimecast Mimecast () delivers cloud-based email management for Microsoft Exchange, including archiving, continuity and security. By unifying disparate

More information

REPUTATION-BASED MAIL FLOW CONTROL

REPUTATION-BASED MAIL FLOW CONTROL WHITE PAPER REPUTATION-BASED MAIL FLOW CONTROL Blocking Extreme Spam and Reducing False Positives Blocking unsolicited commercial email or spam is an increasingly important but difficult task for IT staff.

More information

Opus One PAGE 1 1 COMPARING INDUSTRY-LEADING ANTI-SPAM SERVICES RESULTS FROM TWELVE MONTHS OF TESTING INTRODUCTION TEST METHODOLOGY

Opus One PAGE 1 1 COMPARING INDUSTRY-LEADING ANTI-SPAM SERVICES RESULTS FROM TWELVE MONTHS OF TESTING INTRODUCTION TEST METHODOLOGY Joel Snyder Opus One February, 2015 COMPARING RESULTS FROM TWELVE MONTHS OF TESTING INTRODUCTION The following analysis summarizes the spam catch and false positive rates of the leading anti-spam vendors.

More information

An Email Delivery Report for 2012: Yahoo, Gmail, Hotmail & AOL

An Email Delivery Report for 2012: Yahoo, Gmail, Hotmail & AOL EmailDirect is an email marketing solution provider (ESP) which serves hundreds of today s top online marketers by providing all the functionality and expertise required to send and track effective email

More information

Manual Spamfilter Version: 1.1 Date: 20-02-2014

Manual Spamfilter Version: 1.1 Date: 20-02-2014 Manual Spamfilter Version: 1.1 Date: 20-02-2014 Table of contents Introduction... 2 Quick guide... 3 Quarantine reports...3 What to do if a message is blocked inadvertently...4 What to do if a spam has

More information

OUTLOOK SPAM TUTORIAL

OUTLOOK SPAM TUTORIAL OUTLOOK SPAM TUTORIAL You can find this at http://www.sitedeveloper.ws/tutorials/spam.htm. Look for the yellow highlighting and red text in this article below to know where to add the EXODUSNetwork domain

More information

Intercept Anti-Spam Quick Start Guide

Intercept Anti-Spam Quick Start Guide Intercept Anti-Spam Quick Start Guide Software Version: 6.5.2 Date: 5/24/07 PREFACE...3 PRODUCT DOCUMENTATION...3 CONVENTIONS...3 CONTACTING TECHNICAL SUPPORT...4 COPYRIGHT INFORMATION...4 OVERVIEW...5

More information

Typical spam characteristics

Typical spam characteristics Typical spam characteristics How to effectively block spam and junk mail By Mike Spykerman CEO Red Earth Software This article discusses how spam messages can be distinguished from legitimate messages

More information

Deliverability Counts

Deliverability Counts Deliverability Counts 10 Factors That Impact Email Deliverability Deliverability Counts 2015 Harland Clarke Digital www.hcdigital.com 1 20% of legitimate commercial email is not being delivered to inboxes.

More information

Government of Canada Managed Security Service (GCMSS) Annex A-5: Statement of Work - Antispam

Government of Canada Managed Security Service (GCMSS) Annex A-5: Statement of Work - Antispam Government of Canada Managed Security Service (GCMSS) Date: June 8, 2012 TABLE OF CONTENTS 1 ANTISPAM... 1 1.1 QUALITY OF SERVICE...1 1.2 DETECTION AND RESPONSE...1 1.3 MESSAGE HANDLING...2 1.4 CONFIGURATION...2

More information

EFFECTIVE SPAM FILTERING WITH MDAEMON

EFFECTIVE SPAM FILTERING WITH MDAEMON EFFECTIVE SPAM FILTERING WITH MDAEMON Introduction The following guide provides a recommended method for increasing the overall effectiveness of MDaemon s spam filter to reduce the level of spam received

More information

Filtering E-mail for Spam: Macintosh

Filtering E-mail for Spam: Macintosh Filtering E-mail for Spam: Macintosh Last Revised: April 2003 Table of Contents Introduction... 1 Objectives... 1 Filtering E-mail for Spam... 2 What Is Spam?... 2 What Is UT Doing About Spam?... 2 What

More information

BULLGUARD SPAMFILTER

BULLGUARD SPAMFILTER BULLGUARD SPAMFILTER GUIDE Introduction 1.1 Spam emails annoyance and security risk If you are a user of web-based email addresses, then you probably do not need antispam protection as that is already

More information

REVIEW AND ANALYSIS OF SPAM BLOCKING APPLICATIONS

REVIEW AND ANALYSIS OF SPAM BLOCKING APPLICATIONS REVIEW AND ANALYSIS OF SPAM BLOCKING APPLICATIONS Rami Khasawneh, Acting Dean, College of Business, Lewis University, khasawra@lewisu.edu Shamsuddin Ahmed, College of Business and Economics, United Arab

More information

Filtering E-mail for Spam: PC

Filtering E-mail for Spam: PC Filtering E-mail for Spam: PC Last Revised: April 2003 Table of Contents Introduction... 1 Objectives... 1 Filtering E-mail for Spam... 2 What Is Spam?... 2 What Is UT Doing About Spam?... 2 What Can You

More information

EXPLANATION OF COMMON SPAM FILTERING TECHNIQUES WHITEPAPER

EXPLANATION OF COMMON SPAM FILTERING TECHNIQUES WHITEPAPER EXPLANATION OF COMMON SPAM FILTERING TECHNIQUES WHITEPAPER Every year, the amount of unsolicited email received by the average email user increases dramatically. According to IDC, spam has accounted for

More information

Version 5.x. Barracuda Spam & Virus Firewall User s Guide. Barracuda Networks Inc. 3175 S. Winchester Blvd Campbell, CA 95008 http://www.barracuda.

Version 5.x. Barracuda Spam & Virus Firewall User s Guide. Barracuda Networks Inc. 3175 S. Winchester Blvd Campbell, CA 95008 http://www.barracuda. Version 5.x Barracuda Spam & Virus Firewall User s Guide Barracuda Networks Inc. 3175 S. Winchester Blvd Campbell, CA 95008 http://www.barracuda.com? 1 Copyright Copyright 2005-2012, Barracuda Networks

More information

Messaging Assurance Gateway: The Next-Generation in Anti-Spam & Anti-Virus Solutions

Messaging Assurance Gateway: The Next-Generation in Anti-Spam & Anti-Virus Solutions Message Assurance Gateway: Next Generation in Anti-Spam & Anti-Virus Solutions: Messaging Assurance Gateway: The Next-Generation in Anti-Spam & Anti-Virus Solutions The Problem: Spam is Growing, Unchecked

More information

Protecting your business from spam

Protecting your business from spam Protecting your business from spam What is spam? Spam is the common term for electronic junk mail unwanted messages sent to a person s email account or mobile phone. Spam messages vary: some simply promote

More information

BACKSCATTER PROTECTION AGENT Version 1.1 documentation

BACKSCATTER PROTECTION AGENT Version 1.1 documentation BACKSCATTER PROTECTION AGENT Version 1.1 documentation Revision 1.3 (for ORF version 5.0) Date June 3, 2012 INTRODUCTION What is backscatter? Backscatter (or reverse NDR ) attacks occur when a spammer

More information

Junk Email Filtering System. User Manual. Copyright Corvigo, Inc. 2002-03. All Rights Reserved. 509-8282-00 Rev. C

Junk Email Filtering System. User Manual. Copyright Corvigo, Inc. 2002-03. All Rights Reserved. 509-8282-00 Rev. C Junk Email Filtering System User Manual Copyright Corvigo, Inc. 2002-03. All Rights Reserved 509-8282-00 Rev. C The Corvigo MailGate User Manual This user manual will assist you in initial configuration

More information

INBOX. How to make sure more emails reach your subscribers

INBOX. How to make sure more emails reach your subscribers INBOX How to make sure more emails reach your subscribers White Paper 2011 Contents 1. Email and delivery challenge 2 2. Delivery or deliverability? 3 3. Getting email delivered 3 4. Getting into inboxes

More information

Using the Barracuda Spam Firewall to Filter Your Emails

Using the Barracuda Spam Firewall to Filter Your Emails Using the Barracuda Spam Firewall to Filter Your Emails This chapter describes how end users interact with the Barracuda Spam Firewall to check their quarantined messages, classify messages as spam and

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

Eiteasy s Enterprise Email Filter

Eiteasy s Enterprise Email Filter Eiteasy s Enterprise Email Filter Eiteasy s Enterprise Email Filter acts as a shield for companies, small and large, who are being inundated with Spam, viruses and other malevolent outside threats. Spammer

More information

Spam Testing Methodology Opus One, Inc. March, 2007

Spam Testing Methodology Opus One, Inc. March, 2007 Spam Testing Methodology Opus One, Inc. March, 2007 This document describes Opus One s testing methodology for anti-spam products. This methodology has been used, largely unchanged, for four tests published

More information

eprism Email Security Suite

eprism Email Security Suite FAQ V8.3 eprism Email Security Suite 800-782-3762 www.edgewave.com 2001 2012 EdgeWave. All rights reserved. The EdgeWave logo is a trademark of EdgeWave Inc. All other trademarks and registered trademarks

More information

Spam filtering. Peter Likarish Based on slides by EJ Jung 11/03/10

Spam filtering. Peter Likarish Based on slides by EJ Jung 11/03/10 Spam filtering Peter Likarish Based on slides by EJ Jung 11/03/10 What is spam? An unsolicited email equivalent to Direct Mail in postal service UCE (unsolicited commercial email) UBE (unsolicited bulk

More information

CanIt-PRO End-User s Guide Roaring Penguin Software Inc. 9 September 2005

CanIt-PRO End-User s Guide Roaring Penguin Software Inc. 9 September 2005 CanIt-PRO End-User s Guide Roaring Penguin Software Inc. 9 September 2005 2 Contents 1 Introduction 5 2 Accessing The CanIt-PRO User Interface 7 3 CanIt-PRO Simplified Interface 9 4 CanIt-PRO Expert Interface

More information

Trend Micro Hosted Email Security Stop Spam. Save Time.

Trend Micro Hosted Email Security Stop Spam. Save Time. Trend Micro Hosted Email Security Stop Spam. Save Time. How Hosted Email Security Inbound Filtering Adds Value to Your Existing Environment A Trend Micro White Paper l March 2010 1 Table of Contents Introduction...3

More information

Collateral Damage. Consequences of Spam and Virus Filtering for the E-Mail System. Peter Eisentraut 22C3. credativ GmbH.

Collateral Damage. Consequences of Spam and Virus Filtering for the E-Mail System. Peter Eisentraut 22C3. credativ GmbH. Consequences of Spam and Virus Filtering for the E-Mail System 22C3 Introduction 12 years of spam... 24 years of SMTP... Things have changed: SMTP is no longer enough. Spam filters, virus filters are part

More information

Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

More information

Anti-Spam Grid: A Dynamically Organized Spam Filtering Infrastructure

Anti-Spam Grid: A Dynamically Organized Spam Filtering Infrastructure Anti-Spam Grid: A Dynamically Organized Spam Filtering Infrastructure PENG LIU 1,2, GUANGLIANG CHEN 2, LIANG YE 3, WEIMING ZHONG 4 1 Department of Computer Science and Technology, Tsinghua University,

More information

ModusMail Software Instructions.

ModusMail Software Instructions. ModusMail Software Instructions. Table of Contents Basic Quarantine Report Information. 2 Starting A WebMail Session. 3 WebMail Interface. 4 WebMail Setting overview (See Settings Interface).. 5 Account

More information

KUMC Spam Firewall: Barracuda Instructions

KUMC Spam Firewall: Barracuda Instructions KUMC Spam Firewall: Barracuda Instructions Receiving Messages from the KUMC Spam Firewall Greeting Message The first time the KUMC Spam Firewall quarantines an email intended for you, the system sends

More information

K7 Mail Security FOR MICROSOFT EXCHANGE SERVERS. v.109

K7 Mail Security FOR MICROSOFT EXCHANGE SERVERS. v.109 K7 Mail Security FOR MICROSOFT EXCHANGE SERVERS v.109 1 The Exchange environment is an important entry point by which a threat or security risk can enter into a network. K7 Mail Security is a complete

More information

Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. 2001 2014 EdgeWave. All rights reserved. The EdgeWave logo is a trademark of EdgeWave Inc. All other trademarks and registered trademarks are hereby acknowledged. Microsoft and Windows are either registered

More information

BULK MAIL CAMPAIGN RULES

BULK MAIL CAMPAIGN RULES BULK MAIL CAMPAIGN RULES No matter what you do, or how closely you follow the guidelines we provide, the issue of spam is an ever changing and always evolving problem it is estimated that more than 70%

More information

Anti-Spam Measures Survey 2009. Pascal Manzano ENISA

Anti-Spam Measures Survey 2009. Pascal Manzano ENISA Anti-Spam Measures Survey 2009 Pascal Manzano ENISA Do you remember what happen on June 25 th? Methodology Online questionnaire open from May until July 2009 Questionnaire used providers best practices

More information

Detecting spam using social networking concepts Honours Project COMP4905 Carleton University Terrence Chiu 100605339

Detecting spam using social networking concepts Honours Project COMP4905 Carleton University Terrence Chiu 100605339 Detecting spam using social networking concepts Honours Project COMP4905 Carleton University Terrence Chiu 100605339 Supervised by Dr. Tony White School of Computer Science Summer 2007 Abstract This paper

More information