An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal Messages
|
|
|
- Dora Butler
- 10 years ago
- Views:
Transcription
1 An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal Messages Ion Androutsopoulos, John Koutsias, Konstantinos V. Cbandrinos and Constantine D. Spyropoulos Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications National Centre for Scientific Research "Demokritos" 53 l0 Ag. Paraskevi, Athens, Greece e-maih {ionandr, jkoutsi, kostel, Abstract The growing problem of unsolicited bulk , also known as "spare", has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal messages, which we make publicly available in "encrypted" form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attributeset size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used reader. Keywords filtering/routing; text categorization; machine learning and IR; evaluation (general); test collections I. Introduction In recent years, the increasing popularity and low cost of e- mail have attracted the attention of direct marketers. Using readily available bulk-mailing software and large lists of e- mail addresses, typically harvested from web pages and newsgroup archives, it is now possible to send blindly unsolicited messages to thousands of recipients at essentially no cost. As a result, it is becoming increasingly eomrnon for users to receive daily large quantities of unsolicited bulk e- mail, known as spare, advertising anything from vacations to get-rich schemes. The term Unsolicited Commercial (UCE) is also used in the literature. We use "spare" with a broader meaning, that does not exclude unsolicited bulk sent for non-commercial purposes (e.g. to communicate a message from a sectarian group). Permission to make digital or hard copras of all or pert of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to repubhsh, to post on servers or to redistribute to hsts, requires prior specific permission arid/or a fee. SIGIR Athens. Greece 2000 ACM /O0/ $5.00 Spare messages are annoying to most users, as they waste their time and clutter their mailboxes. They also cost money to users with dial-up connections, waste bandwidth, and may expose minors to unsuitable content (e.g. when advertising pornographic sites). A 997 study [3] found that spare messages constituted approximately 0% of the incoming messages to a corporate network. The situation seems to be worsening, and without appropriate counter-measures, spare messages could eventually undermine the usability of . Anti-spam legal measures are gradually being adopted, but they have had a very limited effect so far) Of more direct value are anti-spamfilters, software tools that attempt to block automatically spare messages. 2 Apart from blacklists of frequent spammers and lists of trusted users, which can be incorporated into any anti-spam strategy, these filters have so far relied mostly on manually constructed keyword patterns (e.g. blocking messages whose bodies contain "be over 2"). To be most effective and to avoid the risk of accidentally deleting non-sparn messages, hereafter called legitimate messages, these patterns need to be manually tuned to the incoming of each user. Fine-tuning the patterns, however, requires time and expertise that are not always available. Even worse, the characteristics of spam messages (e.g. topics, frequent terms) change over time, requiring the keyword patterns to be updated periodically [3]. It is, therefore, desirable to develop anti-spare filters that will learn automatically how to block spare messages by processing previously received spare and legitimate messages. Sahami et al. [23] recently proposed using a machine learning algorithm to construct a filter of this type. They trained a Naive Bayesian classifier [7] [20] on manually categorized legitimate and spare messages, reporting impressive performance on unseen messages. Although machine learning algorithms have been applied to several text categorization tasks (e.g. [], [5], [7], [8], [9]), including applications where the goal is to thread [8], to classify into folders [2] [2], or to identify interesting news articles ([4]; see also []), to the best of our knowledge Sahami et al.'s experiments constitute the only previous attempt to apply machine learning to antispare filtering. It may come as a surprise that text categorization techniques can be effective in anti-spare filtering. Unlike most text categorization tasks, it is the act of blindly mass-mailing an unsolicited message that makes it spare, not its actual content: t Consult and unk .org. 2 See, for example, 60
2 any otherwise legitimate message becomes spare if blindly mass-mailed. Nevertheless, it seems that the language of current spare messages constitutes a distinctive genre, and that the topics of most current spare messages are rarely mentioned in legitimate messages, making it possible to train successfully a text classifier for anti-spare filtering. Past experience in text categorization has proven the beneficial role of publicly available manually categorized document collections, like the Reuters corpus [6], that can be used as benchmarks to compare alternative techniques. For the purposes of anti-spare filtering, such benchmark collections would ideally consist of manually categorized legitimate and spam messages as received by real users. Making a collection of this sort, however, publicly available in raw form would violate the privacy of the recipients and senders of legitimate messages. We show, however, that it is possible to "encrypt" the collection's messages in a way that respects privacy issues, while still leading to useful public collections. 3 We test Saharni et al.'s approach on an encrypted collection of legitimate and spam messages, dubbed PU corpus, which we make publicly available contributing towards standard benchmarks. 4 Unlike previous experiments, we employ tenfold cross-validation, whieh makes our results less prone to random variation. Our investigation also examines the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists, issues that were not explored in Sahami et al.'s experiments. Legitimate messages that can be easily identified using lists of trusted users have not been included in PU, leading to a "tougher" corpus than the one used by Sahami et al.; nevertheless, our results confirm Sahami et al.'s high precision and recall figures. We argue, however, that anti-spam filters must be evaluated using measures that incorporate a decision-theoretic notion of cost; and a cost-sensitive evaluation reveals that additional safety nets are necessary for anti-spam filtering to be viable in practice. Further evaluation shows that the Naive Bayesian filter is by far superior to a keyword-based anti-sparn filter that is included in a widely used reader. The remaining of this paper is organized as follows: section 2 introduces the Naive Bayesian Classifier and Sahami et al.'s results; section 3 discusses how the PU corpus was assembled, and presents the results we obtained with the Naive Bayesian classifier and the keyword-based filter; section 4 introduces cost-sensitive evaluation measures; and section 5 concludes. Following Sahami et al., we use binary attributes, i.e. X, = if the message has the property represented by X,, and X, = 0 otherwise. Our experiments were conducted using only wordattributes, i.e. each attribute shows whether or not a particular word (eg. "adult") is present in the message. It is possible, however, to use additional attributes corresponding to phrases (e.g. "be over 2") or non-textual properties (e.g. whether or not the message contains attachments); we return to this point below. Following Sahami et al., we use mutual information ( M ) to select among all possible attributes (in our case, all possible word-attributes). We compute the mutual information M/(X;C) of each candidate attribute X with the categorydenoting variable C as: P(X = x,c = c). log P(X = x,c = c) x e {0,}, c e {spare. legtt,mate} P(X = x)" e(c = c) We then select the attributes with the highest mutual information values. The probabilities P(X,C), P(X), and P(C) are estimated from a training corpus as frequency ratios. 5 From Bayes' theorem and the theorem of total probability, it follows that the probability that a document with vector =(x... x,/ belongs to category c is: e(c = c IT = ~) = P(C = c). P(X = x I c = c) ~P(C = k). P(~ = ~lc = k) k {spare, legttzmate} In practice, the probabilities P(~{C) are impossible to estimate without simplifying assumptions, because the possible values of ~ are too many and there are also data sparseness problems. The Naive Bayesian classifier assumes that X... X. are conditionally independent given the category C, which allows us to compute P(C = c L X" = ) as: n e(c = c).i IP(x, -- x, I c = c) t=l n P(C= k).n P(X, = x, IC=k) k e {apam, legmmate} t=l 2. The Naive Bayesian classifier The Naive Bayesian classifier [7] [20] assumes that each document (in our case, each message) is represented by a vector ~ = (xl,x2,x 3...,xn), where x I... x n are the values of attributes X...,X n, much as in the vector space model [24]. P(X, I C) and P(C) are easy to estimate from the frequencies of the training corpus. A large number of empirical studies have found the Naive Bayesian classifier to be surprisingly effective [6] [5], despite the fact that the assumption that X,...,X n are conditionally independent is usually (including our word-attributes case) overly simplistic. 6 3 Another approach is to use a collection consisting of spare messages, and legitimate messages extracted from publicly available archives of lists. We explored this direction in previous work to be reported elsewhere. 4 PU is available from the publications section of hap:// 5 Consult [20] for more elaborate estimates that we hope to investigate in future work. 6 Consult [0] for forms of Bayesian classifiers with less restrictive independence assumptions. 6
3 Attributes Total Messages Testing Messages Spam Messages (%) Spam Precision (%) Spam Recall (%) words only words + phrases words + phrases + non-textual words + phrases + non-textual Table : Results of Sahami et al. (500 attributes, t = 0.999, 2' = 999 ) In anti-spare filtering, mistakenly blocking a legitimate message (classifying a legitimate message as spare) is generally more severe an error than letting a spam message pass the filter (classifying a spare message as legitimate). We use LoS (legitimate to spam) and SoL (spare to legitimate) to denote the two error types, respectively, and invoking a decision-theoretic notion of cost, we assume that L ~ S is 2' times more costly than S--~ L. We classify a message as spare if the following classification criterion is met: P(C=spam[~=:) P(C = legitimate l X = x) >2' To the extent that the independence assumption holds and the probability estimates are accurate, a classifier based on this criterion achieves optimal results [7]. In our case, P(C = spam [ X = x) = - P(C = legitimate [ X = x), and the criterion above is equivalent to: P(C=spam[~=:)>t,with t=-- 2', 2'= t +2, -t In the experiments of Sahami et al., the threshold t was set to 0.999, which corresponds to 2=999. This means that mistakenly blocking a legitimate message was taken to be as bad as letting 999 spare messages pass the filter. When blocked messages are discarded without further processing, setting 2' to such a high value is reasonable, because in that case most users would consider losing a legitimate message unacceptable. Configurations with additional safety nets are possible, however, and lower 2' values would be reasonable in those cases. For example, rather than being deleted, a blocked message could be returned to the sender, along with an automatically inserted apology paragraph. The extra paragraph would explain that the message was blocked by an anti-spam filter, and it would ask the sender to forward the message to a different, private un-filtered address of the recipient (see also [2]). The private address would never be advertised (e.g. on web pages or newsgroups), making it unlikely to receive spare mail directly. The apology paragraph could also include a frequently changing riddle (e.g. "Include in the subject the capital of France.") to ensure that spare messages are not forwarded automatically to the private address by robots that scan returned messages for new addresses. Messages sent to the private address without the correct riddle answer would be deleted automatically. (Spammers cannot afford the time to answer by hand thousands of riddles.) In the scenario of the previous paragraph, 2' = 9 (t = 0.9 ) seems more reasonable: blocking a legitimate message is penalized mildly more than letting a spam message pass, to account for the fact that recovering from a blocked legitimate message requires overall more work (counting the sender's extra work to repost it) than recovering from a spare message that passed the filter (deleting it manually). If the recipient does not care about the extra work imposed on the sender, then even 2' = ( t = 0.5 ) can be acceptable. Table shows the results that Sahami et al. reported, using 2' = 999 (t = ). (We omit an experiment on detecting spare subcategories, whieh did not show promising results.) Assuming that nl~ s and ns_~l are the numbers of L ~ S and SoL errors, and that nl_~l and nso s count the correctly treated legitimate and sparn messages respectively, spare recall (SR) and sparn precision (SP) are defined as follows: SR = ns-~s SP = ns-~s ns~ S + ns_~l ns~s + nl~ S The first three experiments of table were performed with a collection of 789 messages, consisting of 2 legitimate messages that users had saved and 578 spare messages. In the first experiment only word-attributes were used. In the second experiment, phrasal candidate attributes were added (e.g. corresponding to the phrases "be over 2", "only $"). In the third experiment, additional non-textual candidate attributes were inserted (e.g. whether or not the message contains attachments, or a high proportion of non alphanumeric characters). The phrasal and non-textual candidate attributes were constructed by hand. All candidate attributes (word, phrasal, non-textual) were subjected to a single M/-based selection. The fourth experiment was similar to the third one, but it was performed with a different collection, containing all the messages a user had received over an entire year. The 500 attributes with the highest M/ were used in all cases, but no rigorous testing was conducted to select this number of attributes. While phrasal and non-textual attributes were found to improve results, they introduce a manual configuration phase, sinee one has to select manually phrases and non-textual properties to be treated as candidate attributes. Our target was to explore fully automatic anti-spare filtering, and hence we limited ourselves to word-attributes. We hope to incorporate in 62
4 future work automatic techniques, similar to those used in term extraction [9], to locate candidate phrasal attributes automatically. 3. Experiments with the PU corpus We now turn to the corpus that we used, PU, and our experiments. The PU corpus consists of 099 messages: 48 spain messages. These are all the spare messages that the first author received over a period of 22 months, excluding non-english messages (so far, these are very rare). Duplicates of sparn messages sent on the same day were also not included (these are common, but they are very easy to detect with conventional technology). 68 legitimate messages. These were derived as follows. First, all the English legitimate messages that the first author had received and saved (excluding self-addressed messages) over a period of 36 months were collected (82 messages). Many of the collected messages were from people with which the first author has (or had) regular correspondence, mostly colleagues and friends, that are unlikely to send spare messages. To ensure that the anti-sparn filter never misclassifies messages from those senders, the filter's users can be instructed to insert into their address books people they find they correspond regularly with (this is very easy in modem readers). Messages received from senders in a user's address book can then be classified blindly as legitimate, without applying the anti-spare filter on them. To emulate this mechanism, we deleted from the 82 messages all but the earliest five messages of each sender (leaving 68). We assume that by the time the sixth legitimate message of a particular sender arrives, the sender will have been inserted into the address book, and the anti-spam filter will not have to examine messages from this sender any more. For our experiments, we implemented the Naive Bayesian classifier on the GATE platform [4]. Our implementation includes an English lemmatizer that converts each word to its base form (e.g. "earned" becomes "earn"), and a stop-list module that removes from each message the 00 most frequent words of the British National Corpus (BNC). 7 These two modules can be enabled or disabled to measure their effect. Attachments and HTML tags were removed from all messages. To respect privacy issues, in the publicly available version of PU, fields other than "Subject:" were removed, and each token (word, number, punctuation symbol, etc.) in the bodies or subjects of the messages was replaced by a unique number, the same number throughout all the messages. For example: From: [email protected] To: [email protected] Subject: Get rich now! Click here to get rich! Try it now! 7 GATE, including the iemmatizer we used, is available from BNC frequencies are available from ftp://ftp.itri.bton.ae.uk/pub/bnc. We also experimented with stop-lists containing frequent words of particular parts of speech (e.g. verbs only), but results showed no significant difference. becomes: Subject: There are actually four "encrypted" versions of the publicly available PU corpus, one for each combination of enabled/disabled stop-list and lemmatizer. The correspondence between tokens and numbers is not released, making it very difficult to figure out what the messages say. s This eneryption scheme places some limitations on the resources that can be exploited when experimenting with PU. For example, thesauri or part-of-speech taggers cannot be used, since the exact words of the messages are unknown. PU, however, can still be used to experiment with any classification technique that relies only on frequency and co-occurrence statistics (the Naive Bayesian classifier and most of the machine learning techniques cited in section fall into this category). In all our experiments with the Naive Bayesian classifier, we used ten-fold cross validation to reduce random variation. That is, PU was partitioned randomly into ten parts, and each experiment was repeated ten times, each time reserving a different part for testing, and using the remaining nine parts for training. Results were then averaged over the ten runs. In our first series of experiments, we varied the number of retained arributes (attributes with highest M/) from 50 to 700 by a step of 50, for all four combinations of enabled/disabled lemmatizer and stop-list, and for three thresholds: t = (.g = 999 ), t = 0.9 ( g. = 9 ), and t = 0.5 ( g, = ). As discussed in section 2, these thresholds are taken to represent three usage scenarios of the anti-spare filter: deleting blocked messages immediately; asking the sender to re-post them to a private address and accounting for the sender's extra work; and asking the sender to re-post them to a private address ignoring the sender's extra work. Figures - 3 show the sparn recall and spam precision that the Naive Bayesian classifier achieved. There are different points for different numbers of retained attributes. We also implemented a simpler filter that uses the (presumably, manually constructed) keyword-based patterns of the anti-spare filter of Microsoft Outlook These are 58 patterns, looking for particular keywords in the body or header fields of the messages (e.g. "Body contains ',000' AND Body contains '!' AND Body contains '$'"). The keyword-based filter was applied on an unencrypted version of the PU corpus, achieving spare precision 95.5% and spare recall 53.0%. (These results were obtained using a case-insensitive version of the patterns. The original case-sensitive patterns achieved spam precision 95.45% and spam recall 39.29%.) Decryption experts may be able to recover chunks of the original text based on frequency, and co-occurrence patterns. Given the uninteresting nature of the messages in PU, however, it is difficult to imagine why one would want to waste time on this exercise. 9 "Microsoft Outlook" is a trademark of Microsoft Corporation. Outlook's on-line documentation points to a file that contains the patterns of its anti-sparn filter. 63
5 z '... [ i... i [ e..., p-!!, with lemmatizer, no stop-list I Awith lemmatizer, top-00 stop-list [ [ [ i 0,93... J... I....~... J...,t... ~... _ spare recall Figure : Spare precision and recall of the Naive Bayesian filter for t = 0.5 ( 3, = ) blocks. Overall, the Naive Bayesian filter achieves impressive spam recall and precision combinations at all three thresholds, verifying similar findings by Sahami et al. The Naive Bayesian filter also outperforms the keyword-based filter at most numbers of retained attributes (most of the points in figures - 3 are over 95.5% precision and 53.0% recall). The spread of the points in the figures, however, shows that the number of retained attributes has an important influence on spare recall and precision, and hence it cannot be chosen without careful experimentation. Without a single evaluation measure, that would be used instead of sparn recall and precision, it is difficult to decide which number of attributes (or equivalently which recall-precision combination) leads to the best results, and whether or not the lemmatizer and stop-list have statistically significant effects. The single evaluation measure must also be sensitive to the fact that the L ---~ S and S --~ L errors are not assigned equal eosts. We introduce appropriate cost-sensitive evaluation measures in the next section., ! i... i o.g6...,... ~ -~t-~- - I 4. Cost-sensitive evaluation In classification tasks, performance is often measured in terms of accuracy (Acc ) or error rate ( Err = - Acc ). If N L and N s are the total numbers of legitimate and spare messages to be classified by the filter, respectively, and nl~l, nl_~s, ns_~l are as in section 2, then: ns~s, ,... I... 0 no lemmatizer, no stop-list X no lemmatizer, top-lo0 stop.list with lemmatizer, no stop-list & with lemmatizer, top-lo0 stop-list I 0.93! I I spare recall Figure 2: Spare precision and recall of the Naive Bayesian filter for t = 0.9 ( 2 = 9 ) g ~ o. E i a ~ ~ -~-. ~-I,~...!... i _ 0.95 O no lemmatizer, no stop-list Xno lemmatizer, top-00 stop-list with lemmatizer, no stop-list I l &with lemmatizer, top-00 stop-list [ 0.93 ~ I I! spam recall Figure 3: Spare precision and recall of the Naive Bayesian filter for t = ( A = 999 ) Figures - 3 show a gradual increase of spam precision and decrease of sparn recall as 2 increases. This is due to the fact that for higher 2 values the classifier needs to be more "certain" before blocking a message as sparn, which increases its precision but reduces the number of spare messages it Ace = nl-+l + ns->s NL + Ns Err = nlos + ns~l NL + Ns Accuracy and error rate assign equal weights to the two error types (LoS and S~L). When formulating the classification criterion (section 2), however, we assumed that L-,S is A times more costly than S---~L. To make accuracy and error rate sensitive to this cost difference, we treat each legitimate message as if it were 2 messages. That is, when a legitimate message is blocked, this counts as 2 errors; and when it passes the filter, this counts as successes. This leads to the following definitions of weighted accuracy ( WAce ) and weighted error rate ( WErr = - WAcc ): WAcc= A'nL-*L +ns--~s 2.NL +N s WErr= A'nz~s +ns~l 2"NL +N s The values of accuracy and error rate (or our weighted versions of them) are often misleadingly high. To get a clear picture of the performance of a classifier, it is common to compare its accuracy or error rate to those of a simplistic "baseline" approach. We take the case where no filter is present to be our baseline: legitimate messages are (correctly) never blocked, and spam messages (mistakenly) always pass the filter. The weighted accuracy and weighted error rate of the baseline are: 0 The F-measure, often used in information retrieval and extraction to combine recall and precision (e.g. [22]) cannot be used here, because its weighting factor cannot be easily related to our notion of cost. 64
6 Filter used ("NB" is Naive Ba),esian) (a) NB bare (b) NB with stop-list (c) NB with lemmatizer (d) NB with lemmatizer and stop-list Keyword patterns (case insensitive) Baseline (no filter) (e) NB bare (f) NB with stop-list (g) NB with lemmatizer (h) NB with lemmatizer and stop-list Keyword patterns (case insensitive) Baseline (no filter) (i) NB bare (j) NB with stop-list (k) NB with lemmatizer () NB with lemmatizer and stop-list Keyword patterns (case insensitive) Baseline (no filter) 2 no. of attr spam recall (%) spam precision (%) weighted accuracy (%) TCR Table 2: Results on the PU corpus using the best number of attributes (099 total messages, 43.8% spare, 0-fold cross validation, number of attributes ranging from 50 to 700 by 50) WAccb = ~. N L WErr b Ns g. NL +N s 2"NL +N s We also introduce the total cost ratio ( TCR ), defined below, which allows the performance of an anti-sparn filter to be compared easily to that of the baseline: WEre b TCR = - - WErr N s A. nl.os + ns~ L Greater TCR values indicate better performance. For TCR <, the baseline (not using the filter) is better. If cost is proportional to wasted time, an intuitive meaning for TCR is the following: it measures how much time is wasted to delete manually all spare messages when no filter is used (N s ), compared to the time wasted to delete manually any spam messages that passed the filter ( ns~ L ) plus the time needed to recover from mistakenly blocked legitimate messages ( 2. nl-,s ). Table 2 summarizes the results we obtained with the Naive Bayesian filter on PU for 2 =, 9, 999, with the lemmatizer and stop-list enabled or disabled, and using the best numbers of attributes (those that lead to the highest TCR scores in each case). It also shows the corresponding results of the keywordbased filter and the baseline) Figures 4-6 show the TCR scores for the three.~ values at different numbers of attributes. In all the experiments with the Naive Bayesian (NB) filter, tenfold cross validation was used, and average WAcc is reported. In ten-fold cross validation experiments, TCR is computed as WErr b divided by the average WErr.2 For 2. =, the Naive Bayesian filter achieves TCR > at all numbers of attributes (figure 4). The keyword-based filter also scores TCR >, but the Naive Bayesian filter is better up to 550 attributes. Increasing the number of attributes of the Naive Bayesian filter beyond a certain point degrades its performance, because attributes with low M/ do not discriminate well between spare and legitimate messages. Paired single-tailed t-tests on WAcc confirm at p < 0.00 that configurations (a), (b), (c), and (d) of the Naive Bayesian filter (table 2) are all significantly better than the keyword-based filter and the baseline. Enabling the lemmatizer or the stop-list does not seem to have any significant effect. Indeed, paired single-tailed t-tests at p < 0.05 do not confirm the hypotheses of table 2 about (a), (b), (c), (d) themselves (e.g. that (b) is better than (c)); and at p<0., the only one of these hypotheses that is confirmed is that (d) is better than (c). For 2 = 9, again both filters achieve constantly TCR >, with the Naive Bayesian being always better (figure 5). The stop-list has hardly any effect, while the lemmatizer seems to lead to a noticeable improvement. The improvement of the lemmatizer, however, is not significant from a statistical point of view: the t-tests confirm the hypothesis of table 2 that (g) is better than (e) and (If)) (and also (h)) only at p < 0., not p < No other hypothesis of table 2 about (e), (If), (g), (h) themselves (e.g. that (h) is better than (e)) is confirmed at p <0.. The t- tests, however, confirm at p < 0.00 that all the configurations The TCR scores with case sensitive keyword-pattems were.60,.29, and 0.05, respectively. 2 It is important not to compute the overall TCR as the average of the TCRs of the individual folds (repetitions), as this effectively ignores folds with TCR <<. 65
7 (e), (f), (g), and (h) of the Naive Bayesian filter are better than the keyword-based filter and the baseline ~. I ^.-?_='".~'7'""='7 '''~''T'=., ~... IJ = 3.0 l... ~-~... i '--'~...,, " - - r ~ -,, ---~-'---.~.-.,, ----~ number of retained attrlbutee Figure 4: TCR of the filters for t = 0.5 (2 = ) [ ~ NB, no lemmatizer, no stop-list I'"X'" N B, no lemmatizer, top- O0 stop-list 4.5 ] --t--- NB, with lemmatizer, no stop-list 4.0 I'" ~'" NB, with lemmatizer, top- O0 stopqisl 3.5 [--)E-- keyword patterns ~z5 ~ i.5,p.0... ~ ~ 0.0, number of retained attributes Figure 5: TCR of the filters for t = 0.9 ( 2 = 9 ) 'O-- NB, no Iommatizer, no stopjist x-- NB, no lernrnatizer, top-00 stop-list I--NB, with lernmatlzer, no stop-list 0.25 i? ~ ~'- NB, with lemmatizer, top- O0 stopqist keyword patterns ' ~ number of retained attributes Figure 6: TCR of the filters for t = ( 2 = 999 ) For 2=999, the t-tests again confirm at p<0.00 that configurations (i), (j), (k), and () of the Naive Bayesian filter (table 2) are all significantly better than the keyword-based filter, though none of the other hypotheses of table 2 is confirmed at p < 0.. Both filters, however, score constantly TCR < (figure 6; notice the different scale from figures 4 and 5). This is due to the very high penalty on the L ~ S errors, and the fact that none of the two filters manages to eliminate these errors completely. We conclude that for 2 = 999, not using any of the two filters is actually better, despite their high spam precision and recall. A remaining question is how much training data the Naive Bayesian filter needs to be effective for 2 = and 2 = 9. To answer this question, we conducted a second series of experiments with the Naive Bayesian filter, this time varying the size of the training corpus. The corpus was again divided into ten parts, and one part was reserved for testing at every run. From each one of the remaining nine parts, only x % was used for training, with X ranging from 0 to 00 by 0. The experiments were performed with the best configurations of table 2, i.e. (b) and (g). Figure 7 shows the resulting TCR scores for 2 = and 2 = 9. It also shows the TCR score that the keyword-based filter achieved for 2 = and 2 = 9 on the entire corpus. It can be seen that the Naive Bayesian (NB) filter achieves TCR > and outperforms the keyword-based filter even with very small training corpora. 6.0 r... I.,,-e,-- N B, la/'nbdj) =, 50 attributes, no Iil~nmatizsr. with s~pqjst I 5,05"5 ~l " "~-'-" """" " "" N~ keyword lambda pllttenl~, pazttllms, = 9' 00 laml>da bzmbdll attn'butes, = : g., enure entire valb lemrnatize corpus (~orp~,l list I T ~,.C [3 ~ D ~ C,.~, % 20% 30% 40% 50% 60% 70% 80% 90% 00% size of training corpus (00% Is 98) messages) Figure 7: TCR for variable size of training corpus 5. Conclusions Our cost-sensitive evaluation suggests that neither the Naive Bayesian nor the keyword-based filter perform well enough to be used when 2 = 999, a 2 value we employed to model the scenario where blocked messages are deleted without further processing. The performance of both filters, however, is satisfactory for 2 = and 2 = 9. We used the latter values to model the case where a mechanism is available to re-post blocked messages to private addresses. The Naive Bayesian filter outperforms by far the keywordbased filter, even with very small training corpora. For 2 = 9 and A = 999, its performance showed signs of improvement when a lemmatizer was added, but the improvements were not statistically significant. Adding a stop-list does not seem to have any noticeable effect, presumably because the M/-based attribute selection rarely picks words that are so common as those of the stop-list. 66
8 We interpret these results as implying that, with additional safety nets, automatically trainable anti-spare filters are practically viable and can have a significant effect. We plan to implement alternative anti-spare filters, based on other machine learning algorithms, and evaluate them on publicly available benchmark corpora. We expect anti-spam filtering to become an important member of an emerging family of junkfiltering tools for the Internet, which will include - among others - tools to remove unwanted advertisements [3], and block hostile or pornographic material [8] [25]. Acknowledgments The authors wish to thank George Paliouras for comments on an earlier version of this paper, and Stavros Perantonis for his help with statistical significance tests. The authors are also grateful to Antonis Argyros and Lena Gaga, who contributed a collection of spam messages that was used in earlier experiments. References [] C. Apte and F. Damerau. Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems, 2(3):233-25, 994. [2] W.W. Cohen. Learning Rules that Classify . In Proc. of the AAAI Spring Symposium on Machine Learning in Information Access, Stanford, California, 996. [3] L.F. Cranor and B.A. LaMacchia. Spare! Communications oftheacm, 4(8):74--83, 998. [4] H. Cunningham, Y. Wilks and R. Gaizauskas. GATE - a General Architecture for Text Engineering. In Proc. of the 6 th International Conference on Computational Linguistics, Copenhagen, Denmark, 996. [5] I. Dagan, Y. Karov and D. Roth. Mistake-Driven Learning in Text Categorization. In Proc. of the 2 na Conference on Empirical Methods in Natural Language Processing, pp , Providence, Rhode Island, 997. [6] P. Domingos and M. Pazzani. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. In Proc. of the 3 th International Conference on Machine Learning, pp. 05-2, Bari, Italy, 996. [7] R.O. Duda and P.E. Hart. Bayes Decision Theory. Chapter 2 in Pattern Classification and Scene Analysis, pp John Wiley, 973. [8] D. Forsyth. Finding Naked People. In Proc. of the 4 th European Conference on Computer Vision, Cambridge, England, 996. [9] K.T. Frantzi. Automatic Recognition of Multi-Word Terms. PhD Thesis, Manchester Metropolitan University, England, 998. [0] N. Friedman, D. Geiger and M. Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29(2/3):3-63, 997. [] C.L. Green and P. Edwards. Using Machine Learning to Enhance Software Tools for lnternet Information Management. In Proc. of the AAAI Workshop on Internet-Based Information Systems, pp , Portland, Oregon, 996. [2] R.J. Hall. How to Avoid Unwanted . Communications of the ACM, 4 (3):88-95, 998. [3] N. Kushmerick. Learning to Remove Internet Advertisements. In Proc. of the 3 "a International Conference on Autonomous Agents, pp. 75-8, Seattle, Washington, 999. [4] K. Lang. Newsweeder: Learning to Filter Netnews. In Proc. of the 2 th International Conference on Machine Learning, pp , Stanford, California, 995. [5] P. Langley, I. Wayne and K. Thompson. An Analysis of Bayesian Classifiers. In Proc. of the 0 ~h National Conference on Artificial Intelligence, pp , San Jose, Califomia, 992. [6] D. Lewis. Feature Selection and Feature Extraction for Text Categorization. In Proc. of the DARPA Workshop on Speech and Natural Language, pp , Harriman, New York, 992. [7] D. Lewis. Training Algorithms for Linear Text Classifiers. In Proe. of the 9 th Annual International ACM- SIGIR Conference on Research and Development in Information Retrieval, pp , Konstanz, Germany, 996. [8] D. Lewis and K.A. Knowles. Threading Electronic Mail: A Preliminary Study. Information Processing and Management, 33(2):209-27, 997. [9] H. Li and K. Yamanishi. Document Classification Using a Finite Mixture Model. In Proc. of the 35 th Annual Meeting of the ACL and the 8 th Conference of the EACL, pp, , Madrid, Spain, 997. [20] T.M. Mitchell. Bayesian Learning. Chapter 6 in Machine Learning, pp McGraw-Hill, 997. [2] T.R. Payne and P. Edwards. Interface Agents that Learn: An Investigation of Learning Issues in a Mail Agent Interface. Applied Artificial Intelligence, (): -32, 997. [22] E. Riloff and W. Lehnert. Information Extraction as a Basis for High-Precision Text Classification. ACM Transactions on Information Systems, 2(3): , 994. [23] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk . In Learning for Text Categorization - Papers fi'om the AAA Workshop, pp , Madison Wisconsin. AAAI Technical Report WS-98-05, 998. [24] G. Saiton and M.J. MeGill. Introduction to Modern Information Retrieval. McGraw-Hill, 983. [25] E. Spertus. Smokey: Automatic Recognition of Hostile Messages. In Proc. of the 4 th National Conference on AI and th the 9 Conference on Innovative Applications of A, pp , Providence, Rhode Island,
Three-Way Decisions Solution to Filter Spam Email: An Empirical Study
Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University
Naive Bayes Spam Filtering Using Word-Position-Based Attributes
Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden [email protected] Abstract This paper
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: [email protected], Abstract- Unethical
Spam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
Filtering Junk Mail with A Maximum Entropy Model
Filtering Junk Mail with A Maximum Entropy Model ZHANG Le and YAO Tian-shun Institute of Computer Software & Theory. School of Information Science & Engineering, Northeastern University Shenyang, 110004
A LVQ-based neural network anti-spam email approach
A LVQ-based neural network anti-spam email approach Zhan Chuan Lu Xianliang Hou Mengshu Zhou Xu College of Computer Science and Engineering of UEST of China, Chengdu, China 610054 [email protected]
A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
An Efficient Spam Filtering Techniques for Email Account
American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang
Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang Graduate Institute of Information Management National Taipei University 69, Sec. 2, JianGuo N. Rd., Taipei City 104-33, Taiwan
Adaptive Filtering of SPAM
Adaptive Filtering of SPAM L. Pelletier, J. Almhana, V. Choulakian GRETI, University of Moncton Moncton, N.B.,Canada E1A 3E9 {elp6880, almhanaj, choulav}@umoncton.ca Abstract In this paper, we present
Detecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]
Filtering Spam E-Mail with Generalized Additive Neural Networks
Filtering Spam E-Mail with Generalized Additive Neural Networks Tiny du Toit School of Computer, Statistical and Mathematical Sciences North-West University, Potchefstroom Campus Private Bag X6001, Potchefstroom,
A Three-Way Decision Approach to Email Spam Filtering
A Three-Way Decision Approach to Email Spam Filtering Bing Zhou, Yiyu Yao, and Jigang Luo Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 {zhou200b,yyao,luo226}@cs.uregina.ca
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
Impact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
Non-Parametric Spam Filtering based on knn and LSA
Non-Parametric Spam Filtering based on knn and LSA Preslav Ivanov Nakov Panayot Markov Dobrikov Abstract. The paper proposes a non-parametric approach to filtering of unsolicited commercial e-mail messages,
About this documentation
Wilkes University, Staff, and Students have a new email spam filter to protect against unwanted email messages. Barracuda SPAM Firewall will filter email for all campus email accounts before it gets to
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Image Spam Filtering by Content Obscuring Detection
Image Spam Filtering by Content Obscuring Detection Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli Dept. of Electrical and Electronic Eng., University of Cagliari Piazza d Armi, 09123 Cagliari,
Abstract. Find out if your mortgage rate is too high, NOW. Free Search
Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back
Email Filters that use Spammy Words Only
Email Filters that use Spammy Words Only Vasanth Elavarasan Department of Computer Science University of Texas at Austin Advisors: Mohamed Gouda Department of Computer Science University of Texas at Austin
Purchase College Barracuda Anti-Spam Firewall User s Guide
Purchase College Barracuda Anti-Spam Firewall User s Guide What is a Barracuda Anti-Spam Firewall? Computing and Telecommunications Services (CTS) has implemented a new Barracuda Anti-Spam Firewall to
Savita Teli 1, Santoshkumar Biradar 2
Effective Spam Detection Method for Email Savita Teli 1, Santoshkumar Biradar 2 1 (Student, Dept of Computer Engg, Dr. D. Y. Patil College of Engg, Ambi, University of Pune, M.S, India) 2 (Asst. Proff,
Image Spam Filtering Using Visual Information
Image Spam Filtering Using Visual Information Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli, Dept. of Electrical and Electronic Eng., Univ. of Cagliari Piazza d Armi, 09123 Cagliari, Italy
A Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Journal of Information Technology Impact
Journal of Information Technology Impact Vol. 8, No., pp. -0, 2008 Probability Modeling for Improving Spam Filtering Parameters S. C. Chiemeke University of Benin Nigeria O. B. Longe 2 University of Ibadan
Increasing the Accuracy of a Spam-Detecting Artificial Immune System
Increasing the Accuracy of a Spam-Detecting Artificial Immune System Terri Oda Carleton University [email protected] Tony White Carleton University [email protected] Abstract- Spam, the electronic
Handling Unsolicited Commercial Email (UCE) or spam using Microsoft Outlook at Staffordshire University
Reference : USER 190 Issue date : January 2004 Revised : November 2007 Classification : Staff Originator : Richard Rogers Handling Unsolicited Commercial Email (UCE) or spam using Microsoft Outlook at
Anti Spamming Techniques
Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods
Tweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information
1 Introductory Comments. 2 Bayesian Probability
Introductory Comments First, I would like to point out that I got this material from two sources: The first was a page from Paul Graham s website at www.paulgraham.com/ffb.html, and the second was a paper
Reputation Network Analysis for Email Filtering
Reputation Network Analysis for Email Filtering Jennifer Golbeck, James Hendler University of Maryland, College Park MINDSWAP 8400 Baltimore Avenue College Park, MD 20742 {golbeck, hendler}@cs.umd.edu
Adaption of Statistical Email Filtering Techniques
Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
A Case-Based Approach to Spam Filtering that Can Track Concept Drift
A Case-Based Approach to Spam Filtering that Can Track Concept Drift Pádraig Cunningham 1, Niamh Nowlan 1, Sarah Jane Delany 2, Mads Haahr 1 1 Department of Computer Science, Trinity College Dublin 2 School
How To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
Why Content Filters Can t Eradicate spam
WHITEPAPER Why Content Filters Can t Eradicate spam About Mimecast Mimecast () delivers cloud-based email management for Microsoft Exchange, including archiving, continuity and security. By unifying disparate
IMF Tune Opens Exchange to Any Anti-Spam Filter
Page 1 of 8 IMF Tune Opens Exchange to Any Anti-Spam Filter September 23, 2005 10 th July 2007 Update Include updates for configuration steps in IMF Tune v3.0. IMF Tune enables any anti-spam filter to
E-MAIL FILTERING FAQ
V8.3 E-MAIL FILTERING FAQ COLTON.COM Why? Why are we switching from Postini? The Postini product and service was acquired by Google in 2007. In 2011 Google announced it would discontinue Postini. Replacement:
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
Intercept Anti-Spam Quick Start Guide
Intercept Anti-Spam Quick Start Guide Software Version: 6.5.2 Date: 5/24/07 PREFACE...3 PRODUCT DOCUMENTATION...3 CONVENTIONS...3 CONTACTING TECHNICAL SUPPORT...4 COPYRIGHT INFORMATION...4 OVERVIEW...5
Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian
www..org 104 Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian 1 Abha Suryavanshi, 2 Shishir Shandilya 1 Research Scholar, NIIST Bhopal, India. 2 Prof. (CSE), NIIST Bhopal,
Quarantined Messages 5 What are quarantined messages? 5 What username and password do I use to access my quarantined messages? 5
Contents Paul Bunyan Net Email Filter 1 What is the Paul Bunyan Net Email Filter? 1 How do I get to the Email Filter? 1 How do I release a message from the Email Filter? 1 How do I delete messages listed
WE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
Email Security 8.0 User Guide
Email Security 8.0 User Guide 1 Notes, Cautions, and Warnings NOTE: A NOTE indicates important information that helps you make better use of your system. CAUTION: A CAUTION indicates potential damage to
An Overview of Spam Blocking Techniques
An Overview of Spam Blocking Techniques Recent analyst estimates indicate that over 60 percent of the world s email is unsolicited email, or spam. Spam is no longer just a simple annoyance. Spam has now
A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide
eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide This guide is designed to help the administrator configure the eprism Intercept Anti-Spam engine to provide a strong spam protection
Fuzzy Logic for E-Mail Spam Deduction
Fuzzy Logic for E-Mail Spam Deduction P.SUDHAKAR 1, G.POONKUZHALI 2, K.THIAGARAJAN 3,R.KRIPA KESHAV 4, K.SARUKESI 5 1 Vernalis systems Pvt Ltd, Chennai- 600116 2,4 Department of Computer Science and Engineering,
Why Bayesian filtering is the most effective anti-spam technology
Why Bayesian filtering is the most effective anti-spam technology Achieving a 98%+ spam detection rate using a mathematical approach This white paper describes how Bayesian filtering works and explains
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach
Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Kaspersky Anti-Spam 3.0
Kaspersky Anti-Spam 3.0 Whitepaper Collecting spam samples The Linguistic Laboratory Updates to antispam databases Spam filtration servers Spam filtration is more than simply a software program. It is
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng
Cloud Services. Email Anti-Spam. Admin Guide
Cloud Services Email Anti-Spam Admin Guide 10/23/2014 CONTENTS Introduction to Anti- Spam... 4 About Anti- Spam... 4 Locating the Anti- Spam Pages in the Portal... 5 Anti- Spam Best Practice Settings...
SpamNet Spam Detection Using PCA and Neural Networks
SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India [email protected]
Spam Filtering based on Naive Bayes Classification. Tianhao Sun
Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous
Anti-Spam Configuration in Outlook 2003 INDEX. Webmail settings Page 2. Client settings Page 6. Creation date 12.12.06 Version 1.2
Anti-Spam Configuration in Outlook 2003 INDEX Webmail settings Page 2 Client settings Page 6 Creation date 12.12.06 Version 1.2 Webmail settings In this section, you will learn how to activate and configure
BARRACUDA. N e t w o r k s SPAM FIREWALL 600
BARRACUDA N e t w o r k s SPAM FIREWALL 600 Contents: I. What is Barracuda?...1 II. III. IV. How does Barracuda Work?...1 Quarantine Summary Notification...2 Quarantine Inbox...4 V. Sort the Quarantine
eprism Email Security Suite
FAQ V8.3 eprism Email Security Suite 800-782-3762 www.edgewave.com 2001 2012 EdgeWave. All rights reserved. The EdgeWave logo is a trademark of EdgeWave Inc. All other trademarks and registered trademarks
Comprehensive Anti-Spam Service
Comprehensive Anti-Spam Service Chapter 1: Document Scope This document describes how to implement and manage the Comprehensive Anti-Spam Service. This document contains the following sections: Comprehensive
Objective This howto demonstrates and explains the different mechanisms for fending off unwanted spam e-mail.
Collax Spam Filter Howto This howto describes the configuration of the spam filter on a Collax server. Requirements Collax Business Server Collax Groupware Suite Collax Security Gateway Collax Platform
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Filtering Spam Using Search Engines
Filtering Spam Using Search Engines Oleg Kolesnikov, Wenke Lee, and Richard Lipton ok,wenke,rjl @cc.gatech.edu College of Computing Georgia Institute of Technology Atlanta, GA 30332 Abstract Spam filtering
Facilitating Business Process Discovery using Email Analysis
Facilitating Business Process Discovery using Email Analysis Matin Mavaddat [email protected] Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process
Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.
2001 2014 EdgeWave. All rights reserved. The EdgeWave logo is a trademark of EdgeWave Inc. All other trademarks and registered trademarks are hereby acknowledged. Microsoft and Windows are either registered
An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization
International Journal of Network Security, Vol.9, No., PP.34 43, July 29 34 An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization Jyh-Jian Sheu Department of Information Management,
How To Manage Your Quarantine Email On A Blackberry.Com
Barracuda Spam Firewall User s Guide 1 Copyright Copyright 2005, Barracuda Networks www.barracudanetworks.com v3.2.22 All rights reserved. Use of this product and this manual is subject to license. Information
A Comparison of Event Models for Naive Bayes Text Classification
A Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum [email protected] Just Research 4616 Henry Street Pittsburgh, PA 15213 Kamal Nigam [email protected] School of Computer
