AN EVALUATION OF FILTERING TECHNIQUES A NAÏVE BAYESIAN ANTI-SPAM FILTER. Vikas P. Deshpande

Transcription

1 AN EVALUATION OF FILTERING TECHNIQUES IN A NAÏVE BAYESIAN ANTI-SPAM FILTER by Vikas P. Deshpande A report submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Computer Science Approved: Dr. Robert F. Erbacher Major Professor Dr. Nicholas Flann Committee Member Dr. Hugo de Garis Committee Member UTAH STATE UNIVERSITY Logan, Utah 2004

2 ii ABSTRACT An Evaluation of Filtering Techniques in a Naïve Bayesian Anti-spam Filter by Vikas P. Deshpande, Master of Science Utah State University, 2004 Major Advisor: Dr. Robert F. Erbacher Department: Computer Science An efficient anti-spam filter that would block all unsolicited messages i.e. spam, without blocking any legitimate messages is a growing need. To address this problem, this report takes a statistically-based approach, employing a Bayesian anti-spam filter, because it is content-based and self-learning (adaptive) in nature. We train the filter, using a large corpus of legitimate messages and spam, and we test the filter using new incoming personal messages. We evaluate four effective filtering techniques available for a Bayesian filter for our purposes. We look at the effectiveness of the technique, and we evaluate its different configurations for different threshold values in order to find an optimal anti-spam filter configuration. Based on cost-sensitive measures, we conclude that additional safety precautions are needed for a Bayesian anti-spam filter to be put into practice. (74 pages)

3 iii ACKNOWLEDGMENTS I would like to thank my major professor, Dr. Robert F. Erbacher, for his guidance. I would also like to thank Dr. Nick Flann and Dr. Hugo de Garis for being members of my committee. I am grateful to my parents for their encouragement and moral support. Vikas P. Deshpande

4 iv CONTENTS Page ABSTRACT... ii ACKNOWLEDGEMENTS... iii LIST OF TABLES... vi LIST OF FIGURES... vii. INTRODUCTION ANTI-SPAM FILTERS Rule-based Approach Blacklist Approach Whitelists / Effective Filters Approach Signature-based Approach Filters Fight Back NAÏVE BAYESIAN BASED CLASSIFICATION Bayesian Networks Experimental Bayesian Filter Cost Evaluation Measures Independence Factor of Bayesian Networks FILTERING TECHNIQUES IN NAÏVE BAYESIAN FILTER Use of All Token Use of a Fixed Number of Tokens Use of Standard Deviation Use of a Relative Number of Tokens EXPERIMENTS EVALUATION RESULTS CONCLUSIONS FUTURE WORK...34

5 REFERENCES...37 APPENDIX...39 v

6 vi LIST OF TABLES Table Page 6. The results (false positives, false negatives and correct classifications) of four filtering techniques for different values of λ The results (TCR) of four filtering techniques for different values of λ The results (false positives, false negatives and correct classifications) of four configurations (5, 5, 20 and 25) of the fixed token approach for different values of λ The results (TCR) of four configurations (5, 5, 20 and 25) of the fixed token approach for different values of λ The results (false positives, false negatives and correct classifications) of four configurations (3, 7, 0 and 2) of the fixed token approach for different values of λ The results (TCR) of four configurations (3, 7, 0 and 2) of the fixed token approach for different values of λ...3

7 vii LIST OF FIGURES Figure Page 6. TCR for effective configurations of the fixed token approach at t = (λ = 999) TCR for effective configurations of the fixed token approach at t = 0.9 (λ = 9) TCR for effective configurations of the fixed token approach at t = 0.5 (λ =)...32

8 CHAPTER INTRODUCTION service is one of the advantages of the Internet. In recent times, however, this service has faced the serious problem of spam. Spam can be defined as unsolicited automated . It need not be sent solely for commercial purposes. Spam can be even used for political and social purposes. Direct marketers exploit the low cost advantage of service to mass mail their ideas to thousands of recipients. Spam cause frustration for users because these messages utilize a lot of their mailbox space. Further, users waste time in deleting these s. Spam clutter costs millions of dollars to service providers. Spam waste bandwidth for dial-up connected Internet users and may involve minors in some illegal businesses (e.g. pay XXX $ to get rich instantly) [4]. The basic problem in eliminating spam lies in differentiating a spam from a legitimate message. For example, if person A is looking for a car, and his neighbor, person B, (who is planning to sell his own car) happens to know that person A is looking for a car, then person B might send his neighbor, person A, an offering his own car at some price. This message is unsolicited and commercial, and, thus, it can easily be mistaken as spam. One of the features that differentiate spam from legitimate messages is that spam is mass mailed by automation. Therefore, even legitimate messages can be categorized as spam if blindly mass mailed. However, the message content of spam typically forms a distinct category rarely observed in legitimate messages, making it possible for text classifiers to be used for anti-spam filtering.

9 2 A naïve Bayesian anti-spam filter is a text categorization technique based on a machine-learning algorithm. Proposed by Sahami et al. [0], the text categorization technique shows some impressive results on new unseen incoming messages. The filter requires training that can be provided by a previous set of spam and legitimate messages. It keeps track of each word that occurs only in spam, only in legitimate messages, and in both. Based on these word occurrence statistics (also called as tokens), new, incoming, unseen messages are processed and classified accordingly. There are many filtering techniques available with a Naïve Bayesian approach. This report evaluates four effective filtering techniques currently being used in practical Naïve Bayesian anti-spam filters. Of these four, our evaluation found the fixed token approach to be the most successful. This report further evaluates the fixed token approach with different configurations to get an optimal approach. The results indicate that a contentbased filter based on a naïve Bayesian approach gives a peak performance when 5 to2 tokens are used to process a new message irrespective of the size of message. Hence, we believe that our filter can make a positive contribution in first-pass filters. This report is organized as follows. Chapter 2 discusses five significant approaches to anti-spam filtering along with their advantages and disadvantages. Chapters 3 and 4, respectively, describe in a detail Bayesian classifier along with the four filtering techniques required for this study. Our experiment set-up along with the procedure followed is presented in Chapter 5. Evaluation results and conclusions are projected in Chapters 6 and 7. In the last chapter, Chapter 8, we describe possible future work to extend our study.

10 3 CHAPTER 2 ANTI-SPAM FILTERS Many different spam-filtering approaches have been tried. Most of these have a degree of effectiveness, but they have not attained global popularity because of their drawbacks. The five most significant spam filters are discussed below, along with their strengths and weaknesses. 2. Rule-based Approach As the name suggests, in a rule-based approach, each is compared with a set of rules to determine whether it is a spam or not. A rule set contains rules with various weights given to each rule. Initially, each incoming message has a score of zero. The is, then, parsed to detect the presence of any rule. If any rule is found in the message, its weight is added to the final score of the . In the end, if the final score is found to be above some threshold value, the is declared as spam [0]. Rules are nothing but observations of features that are found more frequently in spam than in legitimate messages. Some examples include no sender s address, use of the color red above some threshold value, etc. Further, there are some spam features that remain constant over period of time, for example, forged headers and auto-executing JavaScript. Advantages: This approach can be very effective with a given set of rules. It can achieve 90 to 95 percent efficiency. The filter is easy to install, as it merely requires copying the rule set. It

11 4 requires neither training nor any sort of personal tuning. Further, the rule set can be updated by copying an additional set of rules to challenge the current trend of spam. Disadvantages: The disadvantage to the rule-based approach is that it is rigid. There is no selflearning facility available for the filter. Spammers with knowledge of the rule set can design a spam to deceive the method. For example, if there is a rule for classifying a message as a spam if the message contains the word Viagra more than five times, the spammer can easily circumvent the rule by using the term V*i*a*g*r*a instead of Viagra. Rules cannot be kept secret. The best option is to go through every spam and update the rule set by manually adding newfound rules. Unfortunately, this updating process is never ending, as the spammers continually devise new procedures to deceive the spam filters. This process requires personal effort, time, and some level of expertise, qualities not found in every user. The rule-based approach could be used in an integrated spam filter with some other approach. In a rule-based approach, decisions as to whether to classify an as spam or not are binary. This classification process on its own does not give continuous confidence. Such confidence is critical because the cost of a false positive classification (classifying legitimate message as spam) is very high. There is a need for a classification scheme based on probability, wherein all messages near a threshold value can be categorized as legitimate to avoid the danger of being false positives. The rule-based approach is faster than the use of blacklists, but it is slower than statistically-based approaches. SpamAssassin is the most successful spam filter available in the market that

12 uses this approach. The ReadMe file of SpamAssassin states that SpamAssassin differentiates between spam and non-spam mail correctly in percent of the cases Blacklist Approach In this approach, servers that are found to be the sources of spam are blacklisted. The s coming from blacklisted servers are marked as spam and deleted at the server level. The blacklist can also be maintained at the personal level. Advantages: The blacklist approach is helpful in cases in which servers are compromised and used for sending spam to hundreds of thousands of users. This is a better and cheaper option to use at the ISP level along with some other effective filtering technique. Tools like Razor and Pyzor can be used for this purpose. Disadvantages: The criterion of any spam filter is not only efficiency in filtering spam but also doing the job with the minimum amount of false positives. Marking a legitimate message as spam is a great mistake and is much more costly than marking a spam as legitimate. The blacklist approach generates a large amount of false positives. This being the case, its generalized approach of shunning a culprit server forever is not a good idea. A legitimate message arriving from a blacklisted server would always be considered a spam. MAPS RBL, probably the best-known blacklist, catches only 24 percent of all spam with a 34 percent of false positives. Moreover, there are many ethical issues involved in blacklisting a server. Probably the worst scenario of blacklisting a server is doing so without knowing whether that server is a source of spam or not. Moreover, a spammer is

13 6 a moving target. While a spammer might use a compromised computer to send spam, as soon as he learns his computer is being detected, he can use a different computer until that one is being detected. This can go on and on. The end result is that while servers are shunned, the spammer still keeps spamming. The solution to this approach has been the use of Distributed Adaptive Blacklists. Its basic working is to detect a spam message and inform all the recipients (which may run into millions) of that message about its status. Digests of spam are maintained at the server level. So, whenever a new message is received at the MTA, adaptive blacklists are called to detect whether the message is spam. There are tools that ensure that the messages, which are different versions of the same spam, do not get identified as legitimate. In addition, maintainers of distributed blacklists create honey-pot addresses, addresses never used for legitimate purposes. The basic disadvantage of this approach is that it generates a considerable amount of false negatives. Thus, it is recommended that this approach be used in conjunction with another effective filtering technique. 2.3 Whitelist / Effective Filters Approach Whitelists contain legitimate addresses. A whitelist filter is configured with an MTA. The messages arriving from any of these addresses are allowed to pass into the recipient s mailbox. The messages with sources that are not whitelisted are considered to be spam. It is difficult to maintain an exhaustive list of all legitimate addresses. The better option would be to share whitelists among friends and relatives. However, this, too, can be an easy route for a spammer to get a big list of legitimate addresses. The challenge response approach has been integrated with the whitelist approach to avoid such a

14 7 problem. The sender who is not whitelisted will receive a challenge response from the recipient for authentication purposes. The response might contain an image for decryption or a word to recognize and spell. Such a process would be simple with human intervention, but a machine would not be able to reply to the response. Once the sender replies to the response, his address would be added to the whitelist, and all his mails would directly reach the recipient s mailbox in the future. Advantages: Once all legitimate addresses are recorded, the whitelist approach coupled with the challenge-response approach has a 00 percent efficiency rate. The challenge-response component ensures that spammers do not reply to millions of responses and get registered on any whitelists integrated with anti-spam filters. The challenge-response approach requires human intervention for this very purpose. Spammers who try to respond to such challenges expose their purposed to the users seeking legal remedies against them. Disadvantages: The main disadvantage of the challenge-response approach is that it generates a vital amount of false positives, the reason being some addresses are not listed on a whitelist. There are many reasons for senders not replying to the challenge-response system. Doing so means extra effort for the senders. They may have unreliable ISPs, multiple addresses, or may not care to reply to the challenge. Such senders would not be whitelisted, and, hence, their mails would be classified as spam. There are cases in which users receive mails from automatic reply machines, for example online purchases, online registration, web list sign-ups, etc. Such systems would not be able to reply to challenges.

15 8 In addition, if a user were to add an incorrect address or to forget to add an address to his whitelist, he would generate false positives. Further, some people consider the challenge- -response system as rude. To avoid false positives, the strict action taken by the above approach can be toned down to a milder one. The s with unknown sources can be categorized in some folder (a low-priority mailbox) other than the inbox. This box could, then, be checked weekly. All unknown senders would receive replies stating that their s would not be read for a week. If they wanted their message to be read immediately, they would have to respond to the challenge sent. Thus, instead of using the whitelist approach as single tool of defense, it is more effective if it is used in conjunction with some other anti-spam tool. 2.4 Signature-based Approach The signature-based approach compares every new incoming with the known set of spam. Spam are derived from honey pots and deliberately created fake addresses. When any new spam message comes to light, all other services are alerted. The signature-based approach works in this way. Each character in an carries weight. So, the summation of all characters would give a final score that is used as the signature of that . Thus, every new message s signature is compared with that of a spam s signature. If the signatures match, then the new is classified as spam.

16 9 Advantages: The signature-based approach rarely generates false positives. Even false negatives generated by this approach are few as compared to other approaches. BrightMail is a successful spam filter that follows this approach. Disadvantages: These filters are easy to defeat. Since they are backward looking, they take action only after they become aware of a spam. By the time the honey pot has attracted a new spam, a signature has been assigned to it, and the updates have been sent and installed at all ISPs, the spammer has already sent millions of spam. Even a small change in s might make the filter useless. Just by adding some random characters to each spam, the signatures of each will be differed from the original spam. Thus, all such spam messages will pass for legitimate messages. Active research on the logic behind adding random characters to messages is being given a boost, but, even so, spammers are always ahead of these filters. The efficiency of these filters is found to be 50 to 70 percent. In addition, these filters can only be used at the ISP level as first-pass filters. 2.5 Filters Fight Back The filters fight back approach is the most aggressive among all the approaches adopted for filtering spam. It employs the policy of attack is the best self-defense. A spam message usually includes URLs for the readers to visit a site. The purpose may be commercial or social. The filters fight back approach works in this way. Once a message is detected as a spam, these filters send a number of requests to those URL-specified sites. A user can personally configure the number of requests. If any spam is sent to

17 0 thousands of users, there is a high possibility that the server hosting that site would receive millions of requests increasing the cost and the bandwidth, effectively shutting down all its services. Such filters are also known as auto-retrieving filters [6]. Advantages: Since spam itself is the reason for the spammer s loss, spammers would hesitate to send spam to unknown users. More recipients for the spam would create more loss to the spammer s web server. Disadvantages: The job prior to fighting back is to detect a spam. Any URL sent to thousands of users mainly indicates a spam. However, at the bottom of every message, there are many advertisements, such as Yahoo, MSN, etc., many of which are legitimate URLs. If the site turns out to be legitimate, negatively affecting the site might involve legal proceedings. To avoid such confusion, auto-retrieval filters should refer to blacklists for servers that are banned. Further, the servers need to be blacklisted by human intervention, thus ensuring that the auto-retrieval filters send requests only to web servers that are blacklisted. With this approach, there is an easy way out for spammers. They need to include only active unsubscribe links in their messages. In that way, the senders with auto-retrieval filters will be unsubscribed from the program, which is good news. However, the spam is not reduced globally. There is also the possibility that spammers might include their contact information and their image for marketing purposes instead of their URLs. Doing so, would wholly eliminate the danger of auto-retrieval filters.

18 To make this filter more effective, one needs to fine-tune the filter to each user s incoming messages. Fine-tuning a filter requires time and expertise, both of which are often hard to come by. Thus, one needs a filter that is adaptive in nature, one that selflearns from the given legitimate messages and spam [2].

19 2 CHAPTER 3 BAYESIAN CLASSIFIERS To understand the workings of Bayesian classifiers, one needs to know the underlying concept of Bayesian networks. A Bayesian classifier is nothing but the application of a Bayesian network to the process of text classification. 3. Bayesian Networks Bayesian networks are probabilistic networks. They are used as problem solving models in different fields. In a Bayesian network, nodes indicate the variables of the problem, and the directed nodes between the nodes indicate the relationships between the variables. A Bayesian network, in our case, is used to represent a probability distribution. In such a graph, a node represents a random variable, and a directed edge indicates a probabilistic dependency from the variable denoted by the parent node to that of the child. Hence, it is implied that any node in the network is conditionally independent of its non-descendents, given its parents. Each node is associated with a conditional probability table that indicates the distribution over that node with any possible assignment of values to its parents [0, 3]. Let s formulate the Bayesian network to solve our classification problem. Let C be the class variable that indicates to which class (legitimate / spam) a message belongs, and let node X i denote any attribute (token, in our case) in the message. We need to talk about the class nature. For our purposes we will say ck is the given of the specific values for the required attributes. In our case, the specific values would be either 0 or depending

20 on their presence in the message. The problem of class nature can be solved using Baye s theorem: 3 P(X = x C = ck) is difficult to calculate, as there is a high chance that the X attribute might be dependent on some other set of attributes. The easy solution is to assume that the attributes are conditionally independent of each other. Consequently, the probability will result in: P(X = x C = ck) = P (Xi = xi C = ck) If an message is considered to be a set of attributes (i.e., words), then using a Bayesian network, we can calculate the probability of whether a message belongs to a specific class, namely, a legitimate message or a spam. 3.2 Experimental Bayesian Filter The experimental Bayesian filter is a content-based approach. This attribute gives this approach an advantage over other approaches. Spammers cannot modify the content to deceive the filters, as content is the only reason to send spam at the first place. Content in this case includes headers and the message itself.

21 4 First the filter should be trained to work accordingly. A considerable amount of good mails and spam would be required to train the filter. Two tables would be maintained, one each for legitimate and spam. Let us call them the good table and the bad table, respectively. The good table would contain tokens that occur in the good s, along with their number of occurrences. Similarly, the interpretation of bad s would be maintained in the bad table. Based on these two tables, another table will be built using the Bayesian formula of probability [4, 5]: P (bad/token) = A A + B A = P (token/bad) * P (bad) B = P (token /good) * P (good) P (token/bad) = probability of a token given that it is present in spam . P (token /good) = probability of a token given that it is present in good . P (bad/token) = probability of being spam given that a specific token is present. Let s call this table the spam probability table. This table will contain all tokens that occur in all mails, along with the probability that will define the chances of the mail being spam with that token present. Ideally, the probability of mail being spam should be calculated with the given presence of the combination of tokens in an . But this probability is difficult to calculate, as the number of tokens is huge, giving rise to a lot of

22 5 combinations. To make the matter simpler, we will assume that the tokens are independent of each other. Thus, the probability of an is merely the combined probability of tokens. Hence, this implementation of the Bayesian formula is known as the naïve Bayesian rule. For every new , a fixed number of effective tokens would be collected to calculate the combined probability. This number can vary from 5 to 25 depending on the success of the filter based on one s personal messages. A = p (a) * p (b) B= (-p (a)) * (-p (b)) Score: A A + B If the score rises above a threshold value, the s would be declared as spam, else as a good . The selection of only 5 tokens is one of the filtering techniques used in the case of Bayesian filters. Effective tokens are those whose probabilities differ the most, on either side, from the threshold value. These tokens are either significantly good tokens or bad tokens, and they are responsible for deciding the overall status of the message. The challenges present are speed, efficiency, database size, and the need of training data. The larger the set of tokens the greater would be the size of the database and the longer the time of training. So, there is a need to consider only those tokens that make an impact in deciding the status of an . Since training and classification will

23 6 occur during the same phase of time, special care must be taken to make both operations as independent as possible. If the token is present only in the good table, its probability in the spam probability table would be recorded as 0.. If the token is present only in the bad table, its probability in the spam probability table would be recorded as Cost Evaluation Measures A false positive is mistakenly classifying a legitimate as a spam, and a false negative is mistakenly classifying a spam as a legitimate . The cost of a false positive is much higher than that of a false negative. The existence of false positives destroys the faith of the user in his spam filter because users tend to delete spam from a bulk folder without reading them, and deleting legitimate messages (due to spam filters) is unacceptable. In that case, it is acceptable to allow some false negatives rather than having any false positives. Let L S be false positive error type and S L be false negative error type. Assuming that L S is λ times costlier than S L, we classify a message as spam if: In our case wherein we are considering a naïve Bayesian filter s independency, the assumption holds. Therefore, P(C=spam X=x) = - P(C=legitimate X=x), which leads to the following criteria:

24 7 P(C=spam X=x) > t, where t = threshold value Thus t = λ / (+ λ) as λ = t / (-t) Depending on the action taken on a spam folder, the threshold value can be altered. If spam are deleted directly once they are classified, then t is held as high as (λ = 999), i.e. blocking a legitimate message is as bad as letting 999-spam messages pass the filter. Lower values of λ are acceptable depending on the different configurations made available for the spam folder. If the configuration is set up to resend the mail back to the sender asking him to send it to a private unfiltered address of the recipient, then λ = 9 (t=0.9) seems to be reasonable. Even λ = (t=0.5) is acceptable if the recipient happens to go through every in the bulk folder before manually deleting them. Two factors could be used in the context to measure the performance of a filter, namely, spam precision and spam recall. Let n (L S) and n (S L) be the numbers of L S and S L errors, and let n (L L) and n (S S) count the correctly treated legitimate and spam messages respectively. Spam recall (SR) and spam precision (SP) are defined as follows: SR = n S->S n S->L + n S->S SP = n S->S n L->S + n S->S 3.4 Independence Factor of a Bayesian Network Using a Bayesian network, we can model the complex dependencies between features to infer the solution class. As the number of features increases, it becomes increasingly difficult for a message to be classified with all its dependencies. As a result, spam filters

25 implement a naïve Bayesian concept wherein features are assumed to be independent of each other. This assumption is balanced by setting higher value to the threshold. 8 P(X = x C = ck) = P (Xi = xi C = ck) A naïve Bayesian model is the most restrictive form of the feature dependence spectrum. Research has been done regarding the performance of spam filters by allowing some degree of dependence between features. This study can be formalized by introducing the notion of k-dependence Bayesian classifiers. A k-dependence Bayesian classifier is a Bayesian network wherein each feature is allowed to have a maximum of k parents. Based on this definition, we can say that a naïve Bayesian filter is a 0-dependence Bayesian classifier. We can also state that an ideal Bayesian filter (i.e. full Bayesian filter with no independence) is an (N-)-dependence Bayesian classifier where N is the number of domain features. By varying the value of k, one can move step-by-step in the feature dependence spectrum and analyze the performance of the spam filter at every step. It is also worth noting that as k grows, there are more condition variables with the same amount of data. This implies a larger probability space for estimation with the same data, causing inaccuracy in probability estimates and leading to an overall decrease in performance. This performance problem has been observed in many domains while going from k=2 to k=3.

26 9 CHAPTER 4 CLASSIFICATION TECHNIQUES IN A NAÏVE BAYESIAN FILTER Once the naïve Bayesian filter is trained using huge datasets of spam and non-spam messages, it is now ready to perform its basic functionality of filtering, i.e. classifying new incoming unseen messages. Currently, there are many classification techniques used with naïve Bayesian filters available on the market. We discuss four significant techniques in detail in this section. 4. Use of All Tokens This technique demands use of all tokens from a new for classification. As each token is associated with a probability that determines the chances of the being a spam, tokens from each new would be used to calculate a combined probability to assign a final score to the . In the case of a new token in an (i.e. with no record in the database), it would be assigned a probability of 0.4. This assumption has been practically implemented and been found successful in naïve Bayesian filters. It implies that a new token is considered to be a good token rather than a part of a spam. It also indicates the positive approach adopted by spam filters, since the cost of a false positive is much higher than that of a false negative. However, we turn off this feature for the purposes of our evaluation, because we do not want to favor one technique (by taking a positive approach) over others. This global technique makes sense as we parsed all tokens from training datasets to build a database to be used for classification. Hence, it is logical to use the same technique for classification.

27 20 It should be noted that the classification phase is critical due to the heavy cost of a false positive as compared to the training phase wherein we know exactly whether an is a spam or not. This technique might be deceived by an in which there is a big story of how a person got rich instantly followed by a link to a spam site. Such s would contain a large amount of good tokens as compared to a spam. There is a high possibility that such s would deceive spam filters only to be categorized as a good . But it is equally true that spammers avoid writing a big story as it is very likely that readers would rather delete than read a big article from some unknown source. Thus, the use of the all tokens method is found to be effective in practical filters. For example, Bill Yerazunis has used this technique in his Controllable Regex Mutilator (CRM4). 4.2 Use of a Fixed Number of Tokens The use of a fixed number of tokens technique, successfully implemented by Paul Graham, takes only a fixed number of tokens into consideration from a new for assigning a final score to it. The number can vary from 5 to 20 to 25, but these tokens are assumed to be the most effective in the given . An effective token is one whose probability deviates the most from 0.5 on any side, i.e. it can be a good token or a bad one. The combined probability of these tokens would assign a final score to the given new . In that way, the most effective tokens are emphasized for the task. The technique directly attacks those words that are found most of the time in either legitimate s or spam. As a result, the final score would most probably end up near if the

28 2 is a spam or near 0, otherwise. Thus, this technique eliminates the doubt of classification if the final score ends up near 0.5. This method of effectiveness was proposed by Sahami et al. who calculated its effectiveness with the help of the mathematical formula of mutual information. It is recommended that the same token should not be counted more than once while calculating a final score. In that way, the filter makes an unbiased decision with no interference from any specific token even if it had occurred a few times in the message. The number of tokens (5/20/25) is a personal decision, based on the success of the spam filter on personal s. If the number of tokens in a new happens to be less than a fixed number, say 0, then the use of all tokens is the logical back-up technique to be used for classification. This technique has some advantages over other techniques. ) To avoid the problem of false positives, the threshold value can be raised to any value near to 0.9 from ) In the case of huge s, the classification would be much faster. 4.3 Use of a Standard Deviation This technique, like the previous technique, considers only the effective tokens. However, it also emphasizes the spam probability of tokens rather than the number of tokens. If a standard deviation (i.e. stddev) is of the value x, then all tokens with a spam probability in the range of 0.5-x to 0.5+x would be discarded. The remaining tokens would be the effective ones used to calculate the combined probability and assign a final score to the new . BogoFilter, a spam filter that is currently available on the market,

29 22 has adopted this approach. The value of the stddev can be varied based on the filter s success on one s personal messages. The value, which is found successful and recommended, is 0.4. Thus, tokens under consideration would be the ones with probabilities ( ) 0. and lower and ( ) 0.9 and higher. The specialty of the technique is that it assigns the score to the independent of its size. Based on the content of an , there might be only ten effective tokens, or sometimes there may be even more than 00. But for every classification, only effective tokens with probabilities 0.9 and above and 0. and lower would be considered. Like the previous technique, the score in this case would be near (if spam) or near to 0, otherwise. Thus, it is less likely that the score would end up near 0.5, and, thus, giving rise to the possibility of false positives. The same token should not be considered more than once to avoid the interference from any specific token if it had occurred a few times in the message. The threshold, like the previous technique, can be raised to 0.9 to reduce the possibilities of false positives. The processing time for classification would vary according to the size of the Use of a Relative Number of Tokens We would like to propose a technique and evaluate it along with other real-world successful techniques. Since the naïve Bayesian filter is trained with the contents of messages, it is logical to apply the same content-based approach for classification as well. In this technique, we select some percentage (say 30 percent) of effective tokens out of the total tokens of an message. These tokens will be used to calculate the combined

30 23 probability and assign a final score to the message. The percentage value can be tuned, based on the success of the filter on personal messages. This approach is the combination of both the above techniques: the use of a fixed number of tokens and the use of a standard deviation. It values both the effectiveness and number of tokens while classifying a message. So, if an contains 00 tokens, then the 30 most effective tokens among them will be used for classification. There is a high possibility that most of these 30 odd tokens would fall in the stddev of 0.4. In that way, we utilize the advantages of both the above techniques. As it is a content-based approach, there are chances that the final score of an might fall near 0.5. To avoid false possibilities, the threshold value can be raised to a higher value. The process time for classification depends on the size of the message.

31 24 CHAPTER 5 EXPERIMENTS Our experiment comprises of two phases: the training phase and the classification phase. In the training phase, the filter is trained using a known corpus of spam and good s. A database of tokens appearing in each corpus and their total occurrences are maintained in a database. Based on their occurrences in each set of spam and good s, each token is assigned a probability for its capacity of determining an as spam given its presence. Then, using this knowledge of tokens, the filter classifies every new incoming mail in the classification phase. Once the status of a new mail is confirmed, all its tokens are also recorded, thus updating the database. This self-learning function of our filter makes it unique among the other available spam filters. Even if the filter misclassifies any message, the user can rectify it, and the spam filter would update its database accordingly. Thus, the filter learns from its mistakes, too. We used 250 legitimate messages and 350 spam messages. Legitimate messages belong to my student webmail account assigned to me by Utah State University. Spam messages were collected from an archive provided by Nik Martin, available at the site hosted by Paul Graham ( [6]. The spam was collected over the last four to five years. The proportion of spam to legitimate messages is quite huge, making it more likely that legitimate messages can easily be misclassified as spam. This makes the situation more challenging, as the cost of false positives is much higher than that of false negatives. We feel that by minimizing the false positives in such a situation, we have achieved an efficient Bayesian spam filter. Moreover, by recording tokens from

32 25 such a huge number of spam, we have covered almost all the topics for spam and are in a pretty good position to classify new incoming mails for spam. Each word in each message is considered to be a token. The whole message including the header is parsed for tokens. The token separator is a blank space. Words quoted in double and single quotes, numbers, and all words separated by blank spaces are also considered as tokens. The tokens under study and used for classification are Since we are not using any type of lemmatizer, we consider different forms of the same word as different tokens. For example, run, running and runner would all be considered as different tokens even though they have stemmed from the single word run. There are studies [7] that prove the positive effect of a lemmatizer on a filter s performance. The implementation of a lemmatizer is one of the topics of our future study. See Chapter 8. Only the message content is used for classification purposes. Doing so eliminates the interference of tokens present in headers in determining the status of a message. In that way, there is no bias among the classification techniques that are considered for evaluation as some techniques consider only a few (or percentage of) tokens for assigning a final score to the message. Our evaluation was conducted in the classification phase. We evaluated four effective filtering techniques of the Bayesian spam filter for their classification performance. We evaluated these techniques using cost-sensitive measures, as we believe that the cost of a false positive is much higher than that of a false negative. Eighty new incoming messages were tested in a batch of two (first batch: 50; second batch: 30) to get the significant evaluation results. These tested messages belong to the same account (i.e. my

33 26 webmail account) previously used in the training phase. Thus, we avoided any type of erratic behavior from the anti-spam filter. The effective configuration of each technique was used for evaluation purposes. In the standard deviation technique, the value of standard deviation was set to 0.4, and in the percentage technique, 30 percent of total tokens were used to calculate the final score. The tabulated results and related plotted graphs are explained in the next section.

34 27 CHAPTER 6 COST- SENSITIVE EVALUATION The evaluation factors that are frequently used in case of classification are accuracy (Acc) and the error rate (Err = Acc). Accuracy can be defined as the number of correct classifications, i.e. spam correctly classified as spam and legitimate messages as legitimate out of the total messages. The error rate is the ratio of the sum of false positives and false negatives out of the total messages. Acc = n S->S + n L->L n L->S + n S->L N L + N Err = S N L + N S Where N L and N S are the number of legitimate and spam messages, respectively. In our cost-sensitive evaluation, we assume that the error of a false positive is much higher than that of false negative. Conversely, the above formulae of accuracy and error rate do not consider the cost-sensitive factor. Let s assume that the error type of a false positive is λ times greater than that of a false negative, the implication being that we treat a legitimate message as being worth λ messages. So, if a legitimate message is misclassified, it counts to λ errors, and if it is classified correctly, it counts to λ successes. This assumption can be formulated in the form of a weighted accuracy (WAcc) and a weighted error rate (WErr = -WAcc): WAcc = λ n S->S + n L->L λn L + N S WErr = λ n L->S + n S->L λn L + N S

35 28 To get a better idea of the filter s performance in terms of accuracy and error rate, we must compare these factors with a baseline approach [7]. In a baseline approach, we assume that no sort of filter is active, i.e. all spam pass the filter, and legitimate messages are never blocked. The weighted accuracy and error rate of the baseline are: WAcc b = λ N L N S λn L + N WErr b = S λn L + N S We calculate TCR (Total Cost Ratio) to compare with the baseline approach [7]: TCR = WErrb WErr = N S λn L->S + n S->L The higher the value of TCR, the better the performance. With TCR <, a baseline approach is a better option, implying that the absence of a filter gives better results than the use of a filter. If cost is relative to wasted time, then TCR measures the time wasted to delete manually all spam messages as compared to the sum of time wasted to delete manually all spam messages misclassified as legitimate (n S->L ) and time wasted by recovering all legitimate messages mistakenly classified as spam (λ n L->S ). Table lists false positives, false negatives, and correct classifications of all four techniques with different configurations for the threshold. Table 2 lists spam recall, spam precision, weighted accuracy, baseline-weighted accuracy, and total cost ratio (TCR) for the same. TCR is calculated for all techniques for different values of thresholds. In a cost-

36 29 sensitive evaluation, TCR can be used as a scale of better performance. Table indicates the fall in number of false positives and rise in number of false negatives by raising the threshold bars for all four techniques. Table 2 indicates that the fixed token technique outperforms for every value of λ. Unlike other techniques, the fixed token approach gives excellent results for λ=999. The all token approach is worst among them all. Our percentage approach performs better than the standard deviation for λ = and λ=999. Based on both the tables, we can say that by lowering the threshold value from to 0.5, we have risked an increase in number of false positives. But at the same time, the evaluation has shown the increase in TCR values, indicating that an increase in false positives does not prove costly to us. However, in practice, no user would like to use a threshold of 0.5 that implies that he has to go through every spam mail before deleting it. The filter would just be helping the user in locating the spam. An ideal filter would be one wherein spam messages are deleted without the supervision of the user and no legitimate message is deleted in the process. One can observe that number of n S->L is much lesser than that of n L->S. It is the due the fact the number of spam used in training phase is way greater than that of legitimate messages. Our filter, being a self-learner, would improve its performance in the future and would keep the number of n S->L as minimum as possible. We believe, after a period of time, our filter would perform at its peak performance and would remain constant thereafter. The ideal filter should give spam precision of 00 percent, spam recall of 00 percent and a positive value for TCR for all the values of λ.

37 Table 6.. The results (false positives, false negatives and correct classifications) of four filtering techniques for different values of λ. 30 Filter Technique λ n L->S n S->L n L->L n S->S a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech Table 6.2. The results (TCR) of four filtering techniques for different values of λ. Filter Technique λ Spam Recall a) All Token tech 00% b) Fixed Token tech 00% c) Std Deviation tech 96% d) Percentage tech 00% a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech a) All Token tech b) Fixed Token tech c) Std Deviation tech d) Percentage tech % 00% 96% 00% 88% 80% 84% 76% Spam Precision 69.44% 80.65% 72.73% 73.53% 69.44% 83.33% 80% 73.53% 70.97% 00% 84% 86.36% Weighted Accuracy 78% 88% 80% 82% 60.4% 82% 78% 67.6% 64.02% 99.98% 84% 87.99% Baseline W.Acc. 50% 50% 50% 50% 90% 90% 90% 90% 99.9% 99.9% 99.9% 99.9% TCR An evaluation of the fixed token approach was conducted with 5 tokens. To get an optimal anti-spam filter, we further evaluated the fixed token approach with a different number of fixed tokens, i.e. 5, 5, 20, and 25. Tables 3 and 4 list their results.

38 Table 6.3. The results (false positives, false negatives and correct classifications) of four configurations (5, 5, 20 and 25) of Fixed token approach for different values of λ. 3 Filter Technique a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed λ n L->S n S->L n L->L n S->S Table 6.4. The results (TCR) of four configurations (5, 5, 20 and 25) of Fixed token approach for different values of λ. Filter Technique (Fixed Token tech) a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 a) Fixed - 5 b) Fixed - 5 c) Fixed - 20 d) Fixed 25 λ Spam Recall 96% 96% 00% 00% 92% 96% 00% 00% 88% 96% 00% 00% Spam Precision 85.7% 77.42% 75.76% 7.43% 85.9% 80% 75.76% 7.43% 95.65% 85.7% 78.3% 73.53% Weighted Accuracy 90% 84% 84% 80% 84.8% 78% 7.2% 64% 95.99% 84.0% 72.03% 64.04% Baseline W.Acc. 50% 50% 50% 50% 90% 90% 90% 90% 99.9% 99.9% 99.9% 99.9% TCR The values of TCR and weighted accuracy prove the better performance of 5 tokens over others for each value of λ. The performance degrades as we consider a higher

39 32 number of tokens for the classification. However, the effective configuration still cannot be used as a stand-alone first-pass filter for λ=999 and λ=9. It needs the help of other techniques, such as blacklists and whitelists, for effective spam filtering. To get an optimal number of tokens, we further evaluated by covering the range of 5 to 5 tokens. Tables 5 and 6 list their results. The results of 5, 7, and 0 tokens remained the same to each other as well as remained constant for different values of λ. However, the results for 3 fixed tokens were the worst, and results of 2 fixed tokens were near to that of 5, 7, and 0 fixed tokens. It can be said that in the case of the fixed token approach, the filter reaches optimal performance in the range of 5 to2 tokens and degrades thereafter. This observation is confirmed by the plotted graphs (Figures through 3). They indicate the maximum peak (i.e. TCR value) in the range of 5 to2. Table 6.5. The results (false positives, false negatives and correct classifications) of four configurations (3, 7, 0 and 2) of the fixed token approach for different values of λ. Filter Technique a) Fixed - 3 b) Fixed - 7 c) Fixed - 0 d) Fixed - 2 a) Fixed - 3 b) Fixed - 7 c) Fixed - 0 d) Fixed - 2 a) Fixed - 3 b) Fixed - 7 c) Fixed - 0 d) Fixed λ n L->S n S->L n L->L n S->S