Spam Filtering using Spam Mail Communities

Transcription

1 Spam Filtering using Spam Mail Communities Deepak P 1, Jyothi John 1, Sandeep Parameswaran 2 1 Model Engg: College, Kochi, Kerala, India 2 IBM Global Services India Pvt. Ltd., Bangalore, India deepak-p@eth.net, jyothijohn@mec.ac.in sandeep_potty@yahoo.com Abstract We might have heard quite a few people say on seeing some new mails in their inboxes, Oh! That spam again. People who observe the kind of spam messages that they receive would perhaps be able to classify similar spam mails into communities. Such properties of spam messages can be used to filter spam. This paper describes an approach towards spam filtering that seeks to exploit the nature of spam messages that allow them to be classified into different communities. The working of a possible implementation of the approach is described in detail. The new approach does not base itself on any prejudices about spam and can be used to block nonspam nuisance mails also. It can also support users who would want selective blocking of spam mails based on their interests. The approach inherently is user-centric, flexible and user-friendly. The results of some tests done to check for the feasibility of such an approach have been evaluated as well. 1. Introduction Spam mail can be described as unsolicited or unsolicited commercial bulk . Spam is becoming a great problem today and survey reports show that in most cases, more than 25% of received is spam [1]. Spam is considered a serious problem since it causes huge losses to the organization due to bandwidth consumption, mail server processing load, user s productivity time spent responding, deleting or forwarding etc. [1]. It is also estimated by the same study that the cost incurred for each spam message received amounts to nearly 1$. Thus spam mail is becoming an increasing concern and the need to prevent it from continuing to clog the mailboxes is assuming greater significance. Spam mails are sent to addresses which spammers find either by means of spiders finding addresses directly put up in web pages, by means of references by other people, or by guesses. People use different techniques to prevent spam, examples include putting up the mail addresses in not easily recognizable forms in web pages such as user(a)domain(.)com for the mail address user@domain.com. The focus of this study is to filter spam mail, to shield the spam mail away from the users so that the waste of time due to time spent on detection and dealing with spam mails can be eliminated (or reduced atleast). The losses due to bandwidth consumption and mail server processing load are not considered here. Section 2 enumerates the different quality of service parameters for spam filters. Section 3 describes some of the current approaches towards spam filtering. Section 4 evaluates the current approaches and how much consideration they give to spam communities. Section 5 describes and evaluates a new approach towards spam filtering which is based on spam communities. Section 6 narrates some experiments conducted to evaluate core concepts of the new approach. Section 7 lists some conclusions and possible future work with Section 8 listing the references. 2. Considerations for spam filters Spam filters have certain considerations and certain quality parameters. Spam precision is the percentage of messages classified as spam that truly are. Spam recall is the proportion of actual spam messages that are classified as spam. Non-spam messages are usually called solicited messages or legitimate messages. Legitimate precision, analogously, is the percentage of messages classified as legitimate that truly are.

2 Legitimate recall is the proportion of actual legitimate messages that are classified as legitimate [2]. A bit of thought would reveal that spam precision is the parameter to be maximized. We do not want any legitimate messages to be classified as spam even if some errors occur the other way round. More plainly, the number of false positives should be reduced to a minimum. Paul Graham opines [3] that a filter that yields false positives is like an acne cure that carries the risk of death to the patient. 3. Approaches to filter spam The current techniques to filter spam mail do it by means of classifying a message as either spam or nonspam (legitimate). Most of them do statistical filtering using methods such as identifying keywords, phrases etc. Some of the different approaches have been reviewed in the subsections as under. 3.1 Naï ve Bayesian Filtering (2) Here, a message is classified into two categories, as either spam or legitimate based mostly on the message content. The message is represented using the set of words occurring in it as a vector. The probability of a message being a spam given its vector is calculated by the classical bayes theorem. Further the phrases in each message are also examined. Various domain specific details are also examined. The approach leads to the classification of a message as spam or legitimate. Further studies arrive at the conclusion that additional safety nets are needed for the naï ve Bayesian anti spam filter to be viable in practice [4]. Bayesian approaches have also been described by Paul Graham ([3],[5]). 3.2 Memory-based approach Different memory based approaches have been experimented and one system, the TiMBL (Tilburg Machine Based Learner) [6] based on memory based learning. Studies on memory based spam filtering ([7],[8]) are usually directed towards representing a mail as a vector, not based on words but based on various features of spam mails such as presence of words such as adult, sex or phrases such as be over 21, online pharmacy etc and by characters such as $,! etc. Training is done and a set of vectors are built and stored for classes of spam and legitimate mails. Given a new mail, the k-nearest neighbor algorithm is used to obtain a set of nearby vectors (based on hamming distance etc). Mails of that set are either made to vote for the new mail based on their similarity with the new mail or the new mail is put into that class to which the majority of the set belongs to. The major advantage is that the features of each and every mail of the training set can be used without having the entire mails stored. Moreover, the system can be made to learn during mail filtering also thus leading to automatic framing of implicit rules that are user-specific. The major challenge is to decide upon the number of elements in the vector as well as to decide on what each vector element represents. Example rules may include add 1 to the 4 th element for every occurrence of $ etc. 3.3 Neural-network based approach A neural network based approach which focuses on building vocabularies for spam s, has also been experimented [9], but the results of the study seem to be much less generalizable. Such vocabulary based approaches tend to be much more vulnerable to false positives in cases such as automatic newsletters etc. 3.4 Blacklist and whitelist approaches Certain spam filters contain just a pattern matching where a mail containing more than k from among a set of words or coming from an address resembling one from a list are considered to be junk. The whitelist approach uses lists for the opposite purposes such as storing words which are common in legitimate mails (such as salutation by name), or storing addresses which are known. The blacklist approach can assure of no false positives and the whitelist can assure to deliver zero false negatives. But the precision would be very low. Nowadays spam mails increasingly include salutations such as dear xyz for a mail directed to xyz@domain.com, which reduce the precision of such approaches to unacceptably low levels. 3.5 Miscallaneous approaches There are various other approaches for filtering spam at various levels. The ISP of whom the author is a client, uses a crude filtering technique whereby mails directed to more than 10 addresses are deleted as spam. Currently, a lot of spam which

3 contain exactly 10 addresses in the to field get through. It is evident that many spammers do use some learning algorithms that detect the behavior of such filters pretty quickly. A lot of approaches have been used where the user is asked to have more than one address, all of which forward mails to the same address, at which site, duplicates are deemed to be spam and deleted or flagged. Another approach, using extended addresses has also been described [10]. Many such techniques require the user to do certain activities periodically, resulting in the loss of transparency of the filtering technique. 3.5 The best, optimal, always-win, impossible approach The best approach to filtering would be collaborative filtering. People receiving similar e- mails are asked to judge whether an is spam or not, and the results are used for spam-filtering. It is obviously next to impossible. And another approach to collaborative filtering would be to present s to different people (in porn-related sites, we can periodically show a mail to the user and require him to judge whether the mail is spam before going to the next page or photograph), and use their judgments. This should succeed in all cases, but is impossible as secrecy is lost completely, the concern of which, makes the entire discussion in this paragraph futile. But we have to realize that no machine-based approach would obtain accuracies anywhere near to usage of human intelligence. 4. Spam mail communities and current approaches It would be a common observation that spam mails can be classified into various communities, some of them being, online pharmacies, mortgage, vacation offers etc. Such communities are obvious and identifiable on visual inspection, but there might be a lot of not-so-explicit communities that are machine-identifiable such as porn-mails bearing links to xyz.com etc. None of the current approaches classify mails to such extents. Some classify mails only as spam and legitimate whereas some classify spam mails as porn-spam and other-spam. Memory based approaches are naturally feasible to such classifications where each element of the vector can be used to indicate a class of spam, the first element may indicate the probability of it being a porn-spam, the second may indicate the probability of a message being a get-rich spam and so on. But clearly, the number of classifications that can be imposed by such techniques is limited to the number of elements in the vector. The other methods, which are mostly based on statistical clustering, cannot be imparted with such community identification techniques easily. The communities need not be hardwired into the system, and a spam filter may be imparted with the capability of automatic identification of such spam communities. If the system is to be built into the client end, the communities can even be very much userspecific, a system working to filter mails for a person receiving only online prescription related spam may build communities such as weight-loss, anti-aging, sexual enhancement, hair loss etc. A person who wants to receive anti-aging advertisements may mark that community as non-spam and thus, identification of such communities can be used to impart more flexibility or to make the filter more usercentric. 5. A community-based approach 5.1 Underlying concepts The main assumption or the foundation of this approach is that spam mails can be classified into a lot of communities. A rudiment of this approach has been used in some studies where a mail is classified as either legitimate, porn-spam or other-spam, and labeling the mails mapping to the latter two communities as spam. Communities of mails may be as precise as mails sent from mail addresses starting with abc and containing the word aging atleast two times in bold capitals (such descriptions would be implicit as the communities are identified by the algorithm) or as general as just porn-spam. The former kind of definitions may be appropriate in cases where the user receives spam from just two or three mailing lists. Another factor being addressed by such an approach is that of making the spam filter as usercentric as possible. This approach is most appropriate to be implemented on the mail client, and in whatever manner it is implemented, separate lists and tables have to be kept for each user. Yet another advantage of this approach is its flexibility. Nuisance mails (constant requests for help from a distant friend) can also be identified as a system implementing this approach does not come

4 hard coded with a set of rules such as a mail having the word sex would be spam 99% of the time. Thus a person who would like to receive porn-spam but not others also can be accommodated. The system need not have any prejudices, it can learn from the user over time. This property allows it to evolve and understand the changing nature of spam. 5.2 The approach and how it works The general working model of an application using this approach (and thus the approach) is presented as under. The different phases and how the algorithm works are presented under the different subheadings, with possible implementations listed as well. The algorithms used in our test implementation have been described in detail in apporporiate areas The phase of ignorance. Upon installation of the application, the system is ignorant of what spam is. The user has to mark the spam mails among the incoming ones and thus point to the system, hey, this is spam. The system records the entire message. This continues until about 50 messages are accumulated by the system. Even in this time, it can automatically filter and accumulate mails using trivial heuristics such as this is spam as he had marked a mail from this address as spam earlier The message similarity computation. One among the main algorithms to be used here is the computation of similarity between two messages. It may use heuristics such as add one to the similarity score if both have atleast two common names in their To address. Another efficient heuristic would be to represent a message as a vector of words occurring in it and taking the dot product of the vectors of the messages. Here we can include heuristics such as the similarity between the images in the messages which were not possible in cases such as statistical filtering. Spam mail is becoming increasingly image-centric; a lot of spam that the author receives have only a salutation and a remove link other than the image(s). Framing such mail similarity heuristics would require a lot of research into the general nature of spam mail. In our test implementation, we used a naï ve similaty computation algorithm which can be described as below. Algorithm Similarity-Score(Messages M1 and M2) Remove the repeated words in both messages to get messages N1 and N2; The number of intersections of words in the messages N1 and N2 is calculated and output as the similarity score; The identification of communities. After accumulating close to 50 spam messages on the advice of the user, the system can proceed to identify communities of similar messages. It can build a graph with the messages as nodes and each undirected edge connecting two messages being labeled by the similarity weight between them. The system should now find strongly connected communities of mails based on some threshold. This computation of densely connected communities is an NP-complete problem. Suitable approximation algorithms can be used for the said computation. The following algorithm was used in our test implementation. Algorithm Community-Identification() Build a graph with the 50-odd messages as nodes and undirected edges between them labeled by the similarity scores of the messages in question; Prune all edges which have a label value below a threshold T, resulting possibly in a disconnected graph; The connected components of the graph are enumerated as a set of communities N; For each pair of communities in N If each similarity-score between a message in a community and a message in the other community bears a label not less than a threshold T1, merge the communities; The merger in the previous step results in a set of communities N1; Output N1 as the set of communities of messages; The initial threshold T may be set to a higher value than T1. This is because, we do not want any unrelated messages to be falsely included as a community in N. Thus we expect N to consist of

5 highly coherent communities. But our urge to avoid false communities, may well have caused splits of logically coherent communities (which are coherent enough to levels of detail that we expect). The second spet of refinement of N to build the set N1 is a step towards merging such communities. We merge communities that are coherent enough such each message in a community bears atleast some relationship or similarity (enforced by T1) to each message in the other community. This step may be avoided if T is set to a low value, but the risk involved in such an approach is very obvious Community Cohesion Scores and Signatures. We have to compute a score for each community which indicates the cohesion within the community. (Such a score could also be used in the identification of communities in Section 5.2.3). It can be computed on the basis of some heuristics such as the sum of the weights of all edges within the community divided by the number of nodes in the community. Evidently, the aim should be to give high scores to communities of high cohesion. We also can assign signatures to communities which may consist of a set of words which occur very frequently in the community. The signature could also be a set of messages from the community. That set of messages should be as varied as possible. Suppose a community consists of 3 sets of 10 identical messages each, the signature should consist of atleast one representative from each set. The emphasis is that the signature set should not be computed as the densest connected subset (connected with strong edges) of the community, but perhaps one among the sparsely connected subsets (connected with weak edges) in the community. Although computing community cohesion scores would definitely improve the precision, we chose not to implement it in our test implementation, given that our aim was to demonstrate the feasibility of the approach rather than building a workable prototype. But our implementation refines the communities obtained in the previous step, by eliminating copies of fairly identical messages. The algorithm used is outlined as below Algorithm Refine(Set N1) while(1) For each message pair, P and Q Eliminate duplicate words in each message to form P1 and Q1, the sets of words in each message. If ((the cardinality of P1 intersection Q1)>(cardinality of the symmetric difference between P1 and Q1)) Choose P1 or Q1 arbitrarily and eliminate it from the community; If no message could be eliminated in a complete pass, break out of the loop; Return the newly formed set of messages N2, whose cardinality is less than or equal to N1; Copies of fairly identical messages are eliminated as they wouldn t be of much use in the actual spam filtering process. Many users consistently receive messages that are very identical, with the sole difference being in the random string that occurs in the beginning and/or end of most spam messages. We could readily identify such messages which arrive during the actual filtering process as they would have a very high similarity score with a message (or messages) in a community. This elimination of nearly identical messages saves space in the spam filter database and reduces the amount of computation to be done Spam Identification. Each incoming message is tested against the signatures of each spam community and if is found worthy enough of being included in the community, it is tested whether its inclusion would enhance the cohesion within the community. It can be added to the community and marked as spam if it either increases the cohesion of the community or has a very high similarity score with one or more of the community members, for obvious reasons. If not, it is marked as legitimate and passed to the user. Our test implementation used the following algorithm for the actual spam filtering process.

6 Algorithm Test(Message K) For each community C in N2 worthy-of-inclusion score = the mean similarityscore between K and a message in C; If (the maximum worthy-of-inclusion score obtained exceeds a threshold T2) include K in the community with which the maximum worthy-of-inclusion score was obtained and flag K as spam; else Flag K as legitimate; If (K was included in a community) perform the refine algorithm on N2 (or more specifically, on the community in which K was included) and assign the new set of communities to N2; perform the merge algorithm on N2 and assign the new set of communities to N2; The merge algorithm used is the same as the merging procedure in the community identification algorithm. However we reproduce the algorithm here once again. Algorithm Merge(N2) For each pair of communities in N2 If each similarity-score between a message in a community and a message in the other community bears a label not less than a threshold T1, merge the communities; The merger in the previous step results in a set of communities N3; output N3; Maintenance. If the user opines that a message delivered to him as legitimate was actually spam (a false negative), it can be added to the community to which it best fits or as a single member community. Periodically, if there is a proliferation of small communities, those can be gathered and processed just as the initial set of 50 odd spam messages to identify larger communities. If the user opines that a message marked as spam was legitimate (the dreaded false positive), the system can inspect the communities to find messages of very high similarity with the one in question and they can be deleted from the database of spam messages. Further it can show the user the community in which the false positive was put in and ask whether he feels that the community was actually something of interest to him. As more and more messages are identified as spam, they are added to the database. Periodically we have to clean the database. This can be done by considering communities, finding very dense subsets within them and deleting some of the messages which are connected to the communities by dense edges. This is extremely useful in purging identical messages from the set (which obviously is not dangerous). Periodically, the system can do a warm reboot, by dissolving all communities and identifying them from the entire set of messages using techniques used to process the initial set of 50 odd messages. A cold reboot would obviously, be to empty the database. Our test implementation worked in an environment with no interaction from the user. It was supplied with a set of 50 known spam messages and then with a set of messages to be identified as either spam or legitimate. The proliferation of messages in spam communities was avoided by the periodic application of the merge and refine algorithm as presented in the previous section. But when implemented as a workable prototype, more specialized algorithms for handling user input may have to be implemented Adaptation. Adaptability to changing nature of spam is to be taken care of. It can be done by the system by identifying and deleting communities that have had no admissions for a long time. Perhaps the user might have been taken off the list or the nature of spam sent by the spammer would have changed. In either case, holding the community in the database would be of no use. Further the user could be provided options to manually clean up or delete communities. Although handling adaptation would not be too difficult, we did not handle it in our implementation as the tests were performed on spam messages that came in within a short duration during which significant changes in the nature of spam would not have occurred. 5.3 Advantages

7 The system comes in with an empty memory and learns what spam is, from the user. The user is free to point to some nuisance mail (such as an old lover who is no more interesting) and mark it as spam. If the heuristics used for similarity computation give high weightage to the sender s address (or perhaps even content), the user stands a good chance of not being troubled by the nuisance mail in the future. The initial empty memory of the system provides some more advantages. A person entertaining some special spam category, e.g., porn-spam, can continue to keep himself entertained by not marking them as spam during the ignorance phase. The system provides little help in the phase of ignorance, but more importantly it does not come in the way. Further, even after the ignorance phase, he can view the communities and mark one that he is interested in as non-spam. In cases where spam comes to a user from only a few spammers, each community might get precisely mapped to a single spammer. In such cases, small changes made by the spammer in his mails would not lead to them being recognized as false negatives, thus providing increased precision over conventional statistical filters. Further, as the system is implemented per user, the implicit rules may be more user-specific, thus providing more flexibility to the user. 5.4 Disadvantages The user is provided with little or no support during the ignorance phase. The mails themselves are stored in the database, thus increasing storage requirements. Bandwidth wastage is not prevented. Initially, user has to mark the spam, thus giving no indication of the presence of a filter atleast in the early stages. The system might take a lot of time to start filtering mails very effectively. 6. Experiments and results The main aim of the experiment was to test the feasibility of the application of the concept of community clustering of spam mails to implement spam filtering. The implementation done was tested on a non-interactive environment with no user input possible amidst the process. The testing was done on 2 test sets, each of 100 mails, which would be referred to as Set 1 and Set 2 hereafter. 50 of those mails were marked as spam to be used as an initial set, and the rest of the messages were a collection of both spam and legitimate messages and is henceforth referred to as the test set. The value of T & T1 were set to 12 and 6 respectively (Section 5.2.3). The value of T2 was set to 13 (Section 5.2.5). The isolated nodes were considered as singleton communities in N. Singleton communities which could not be merged with any other ones, were discarded in N1. The rest of the algorithms are not parameterized and were included as such. Each message apart from the initial set of 50 messages were subjected to the algorithm Test and the results were logged. The results table given below are the values obtained from the log file. The number of communities does not change in the course of the algorithm no user input is sought in real-time. Thus this test just demonstarates the feasibility of the approach. Tests on Set 1 Number of communities in N1 10 Total messages in N1 initially 42 Total messages in N1 after Refine 37 Proportion of initial set clustered 74% Number of spam messages in test set 35 Number of legitiamate messages in test set 15 Spam Precision 84.0% Legitimate Precision 44.0% Spam Recall 60.0% Legitimate Recall 73.3% False Positives 04 False Negatives 10 Tests on Set 2 Number of communities in N1 09 Total messages in N1 initially 39 Total messages in N1 after Refine 35 Proportion of initial set clustered 70% Number of spam messages in test set 40 Number of legitimate messages in test set 10 Spam Precision 89.3% Legitimate Precision 31.8% Spam Recall 62.5% Legitimate Recall 70.0% False Positives 03 False Negatives 15 We consider the spam precision results as very good considering the fact that no hard-coded ruiles were used. Very low legitimate precision is infact of not too much concern as the number of false negatives wouldn t have disastrous consequences. The legitimate

8 recall is a bit lower than expected, and the number of false positives is a cause for concern and calls for finetuning of the algorithm to reduce false positives. The spam precision testifies that the approach is feasible in the real world. Further, in the real-world, the database could well be tuned based on the user-inputs to provide better results. Further, tehse experiments considered only the texts of the messages, image similarity measures and subject line similarity computations may well enhance the performance. The next experiment was conducted to test whether the inclusion of a non-related message into a community would decrease its cohesion. The test was conducted on a community of 5 messages taken from community1 in the above table. A matrix was formed in which element (i,j) holds a measure of similarity between the i th and j th message. Obviously, the matrix would be symmetric and the values of the principal diagonal elements would be useless. The measure for similarity used was the number of common words in the messages, which although crude, would aid in providing a rough idea of the situation. The matrix formed by the community of 5 messages is given as below. Table2. Similarity matrix of community *** *** *** *** *** The row sums (which are equal to the column sums) expressed as a tuple would be a justifiable estimate of the cohesion within the community. The tuple for this matrix is <177, 156, 112, 155, 146>. Then the first message was replaced by a non-related message and the similarity matrix changed to: Table 3: Similarity matrix after replacement of a message by a non-community message *** *** *** *** *** The cohesion indicator tuple, evidently has changed to <44, 122, 88, 111, 115>. This has much weaker values, with the first element of the tuple having a very low value, indicative of the fact that the first message does not deserve to be a member of the community. Such experiments were performed on a number of communities and each of them demonstrated such sharp deviations due to inclusions of unrelated messages. 7. Conclusions and future work As indicated by the experiments, it can be concluded that community-based detection of spam can prove to be a useful technique. It can be implemented as a mail client add-on, whereby the complex matching algorithms can be done at the client machine (implementing such computationally intensive algorithms on the server might not be inviting). The experiments above indicate that the above approach explained at Section 5.2 would perhaps be feasible. Future work may be directed towards developing better algorithms for spam message similarity computation, for selecting victims to be purged off to limit database size, to enable the system to self-adapt to the changing nature of spam mails, and approximation algorithms for identification of communities from a corpus. This approach treats spam and legitimate mails asymmetrically, in that it clusters spam mails into communities, but doesn t deal with legitimate in any sophisticated manner. Studies have to be performed as to whether legitimate mails can be dealt with in the same manner (by building communities). Feasibility of such an approach depends on the clusterability of legitimate mails which, even if it does exist, is not obvious. 8. References [1]. Surf-Control s Anti-Spam Prevalence Study 2002, URL: Spam_Study_v2.pdf [2]. A bayesian approach to filtering junk , Sahami, Dumais, Heckerman & Horvitz, Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin [3]. A plan for Spam, Paul Graham, August 2002 URL: URL: [4]. An evaluation of naï ve Bayesian anti-spam filtering, Androutsopoulos et. al., Proc. of the workshop on Machine Learning in the New Information Age, 2000 [5]. Better Bayesian Filtering, Paul Graham, January 2003 URL: [6]. TiMBL: Tilburg Machine Based Learner version 4.0 Reference Guide, Daelemans et. al. (2001)

9 [7]. Learning to filter spam A comparison of naï ve Bayesian and a memory based approach, Androutsopoulos et. al., In Workshop on Machine Learning & Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2000). [8]. A learning content-based spam filter, Tim Hemel [9]. Junk Detection using neural networks, Michael Vinther, URL: n.pdf [10]. Curbing junk mail via secure classification, Bleichenbacher et. al. Financial Cryptography, 1998, pp