Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering
|
|
|
- Abel Patterson
- 10 years ago
- Views:
Transcription
1 Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering Christian Siefkes 1, Fidelis Assis 2, Shalendra Chhabra 3, and William S. Yerazunis 4 1 Berlin-Brandenburg Graduate School in Distributed Information Systems Database and Information Systems Group, Freie Universität Berlin Berlin, Germany [email protected] 2 Empresa Brasileira de Telecomunicações Embratel Rio de Janeiro, RJ, Brazil [email protected] 3 Computer Science and Engineering University of California, Riverside California, USA [email protected] 4 Mitsubishi Electric Research Laboratories Cambridge, MA, USA [email protected] Abstract. Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus. Keywords: Classification, Text Classification, Spam Filtering, , Incremental Learning, Online Learning, Feature Generation, Feature Representation, Winnow, Bigrams, Orthogonal Sparse Bigrams. 1 Introduction Spam filtering can be viewed as a classic example of a text categorization task with a strong practical application. While keyword, fingerprint, whitelist/blacklist, and heuristic based filters such as SpamAssassin [11] have been successfully deployed, these filters have experienced a decrease in accuracy as spammers introduce specific countermeasures. The current best-of-breed antispam filters are all probabilistic systems. Most of them are based on Naive Bayes as described by Graham [6] and implemented in SpamBayes [12]; others such as The work of this author is supported by the German Research Society (DFG grant no. GRK 316).
2 the CRM114 Discriminator can be modeled by a Markov Random Field [15]. Other approaches such as Maximum Entropy Modeling [16] lack a property that is important for spam filtering they are not incremental, they cannot adapt their classification model in a single pass over the data. As a statistical, but non-probabilistic alternative we examine the incremental Winnow algorithm. Our experiments show that Winnow reduces the error rate by 75% compared to Naive Bayes and by more than 50% compared to CRM114. The feature space considered by most current methods is limited to individual tokens (unigrams) or bigrams. The sparse binary polynomial hashing (SBPH) technique (cf. Sec. 4.1) introduced by CRM114 is more expressive but imposes a large runtime and memory overhead. We propose orthogonal sparse bigrams (OSB) as an alternative that retains the expressivity of SBPH, but avoids most of the cost. Experimentally OSB leads to equal or slightly better filtering than SBPH. We also analyze the preprocessing and tokenization steps and find that further improvements are possible here. In the next section we present the Winnow algorithm. The following two sections are dedicated to feature generation and combination. In Section 5 we detail our experimental results. Finally we discuss related methods and future work. 2 The Winnow Classification Algorithm The Winnow algorithm introduced by [7] is a statistical, but not a probabilistic algorithm, i.e. it does not directly calculate probabilities for classes. Instead it calculates a score for each class. 5 Our variant of Winnow is suitable for both binary (two-class) and multi-class (three or more classes) classification. It keeps an n-dimensional weight vector w c = (w1, c w2, c..., wn) c for each class c, where wi c is the weight of the ith feature. The algorithm returns 1 for a class iff the summed weights of all active features (called the score Ω) surpass a predefined threshold θ: n a Ω = wj c > θ. j=1 Otherwise (Ω θ) the algorithm returns 0. n a n is the number of active (present) features in the instance to classify. The goal of the algorithm is to learn a linear separator over the feature space that returns 1 for the true class of each instance and 0 for all other classes on this instance. The initial weight of each feature is 1.0. The weights of a class are updated whenever the value returned for this class is wrong. If 0 is returned instead of 1, the weights of all active features are increased by multiplying them with a promotion factor α, α > 1: wj c α wc j. If 1 is returned instead of 0, the active weights are multiplied with a demotion factor β, 0 < β < 1: wj c β wc j. 5 There are ways to convert the scores calculated by Winnow into confidence estimates, but these are not discussed here since they are not of direct relevance for this paper.
3 In text classification, the number of features depends on the length of the text, so it varies enormously from instance to instance. Thus instead of using a fixed threshold we set the threshold to the number n a of features that are active in the given instance: θ = n a. Thus initial scores are equal to θ since the initial weight of each feature is 1.0. In multi-label classification, where an instance can belong to several classes at once, the algorithm would predict all classes whose score is higher than the threshold. But for the task at hand, there is exactly one correct class for each instance, thus we employ a winner-takes-all approach where the class with the highest score is predicted. This means that there are situations where the algorithm will be trained even though it did not make a mistake. This happens whenever the scores of both classes 6 are at the same side of the threshold and the score of the true class is higher than the other one in this case the prediction of Winnow will be correct but it will still promote/demote the weights of the class that was at the wrong side of the threshold. The complexity of processing an instance depends only on the number of active features n a, not on the number of all features n t. Similar to SNoW [1], a sparse architecture is used where features are allocated whenever the need to promote/demote them arises for the first time. In sparse Winnow, the number of instances required to learn a linear separator (if exists) depends linearly on the number of relevant features n r and only logarithmically on the number of active features, i.e. it scales with O(n r log n a ) (cf. [8, Sec. 2]). Winnow is a non-parametric approach; it does not assume a particular probabilistic model underlying the training data. Winnow is a linear separator in the Perceptron sense, but by providing a feature space that itself allows conjunction and disjunction, complex non-linear features may be recognized by the composite feature-extractor + Winnow system. 2.1 Thick Threshold In our implementation of Winnow, we use a thick threshold for learning (cf. [4, Sec. 4.2]). Training instances are re-trained even if the classification was correct if the determined score was near the threshold. Two additional thresholds θ + and θ with θ < θ < θ + are defined and each instance whose score falls in the range [θ, θ + ] is considered a mistake. In this way, a large margin classifier will be trained that is more robust when classifying borderline instances. 2.2 Feature Pruning The feature combination methods discussed in Section 4 generate enormous numbers of features. To keep the feature space tractable, features are stored in an LRU (least recently used) cache. The feature store is limited to a configurable number of elements; whenever it is full, the least recently seen feature is deleted. 6 resp. two or more classes in other tasks involving more than two classes
4 When a deleted feature is encountered again, it will be considered as a new feature whose weights are still at their default values. 3 Feature Generation 3.1 Preprocessing We did not perform language-specific preprocessing techniques such as word stemming, stop word removal, or case folding, since other researchers found that such techniques tend to hurt spam-filtering accuracy [6, 16]. We did compare three types of -specific preprocessing. Preprocessing via mimedecode, a utility for decoding typical mail encodings (Base64, Quoted-Printable etc.) Preprocessing via Jaakko Hyvatti s normalizemime [9]. This program converts the character set to UTF-8, decoding Base64, Quoted-Printable and URL encoding and adding warn tokens in case of encoding errors. It also appends a copy of HTML/XML message bodies with most tags removed, decodes HTML entities and limits the size of attached binary files. No preprocessing. Use the raw mail including large blocks of Base64 data in the encoded form. Except for the comparison of these alternatives, all experiments were performed on normalizemime-preprocessed mails. 3.2 Tokenization Tokenization is the first stage in the classification pipeline; it involves breaking the text stream into tokens ( words ), usually by means of a regular expression. We tested four different tokenization schemas: P (Plain): Tokens contain any sequences of printable characters; they are separated by non-printable characters (whitespace and control characters). C (CRM114): The current default pattern of CRM114 tokens start with a printable character; followed by any number of alphanumeric characters + dashes, dots, commas and colons; optionally ended by a printable character. S (Simplified): A modification of the CRM114 pattern that excludes dots, commas and colons from the middle of the pattern. With this pattern, domain names and mail addresses will be split at dots, so the classifier can recognize a domain even if subdomains vary. X (XML/HTML+header-aware): A modification of the S schema that allows matching typical XML/HTML markup 7, mail headers (terminated by : ), and protocols such as in a token. Punctuation marks such as 7 Start/end/empty tags: <tag> </tag> <br/>; Doctype declarations: <!DOCTYPE; processing instructions: <?xml-stylesheet; entity + character references: ; attributes terminated by = ; attribute values surrounded by quotes.
5 . and, are not allowed at the end of tokens, so normal words will be recognized no matter where in a sentence they occur without being contaminated by trailing punctuation. The X schema was used for all tests unless explicitly stated otherwise. The actual tokenization schemas are defined as the regular expressions given in Table 1. These patterns use Unicode categories [^\p{z}\p{c}] means everything except whitespace and control chars (POSIX class [:graph:]); \p{l}\p{m}\p{n} collectively match all alphanumerical characters ([:alnum:] in POSIX). Table 1. Tokenization Patterns Name Regular Expression P [^\p{z}\p{c}]+ C [^\p{z}\p{c}][-.,:\p{l}\p{m}\p{n}]*[^\p{z}\p{c}]? S [^\p{z}\p{c}][-\p{l}\p{m}\p{n}]*[^\p{z}\p{c}]? X [^\p{z}\p{c}][/!?#]?[-\p{l}\p{m}\p{n}]*(?:[" =;] /?> :/*)? 4 Feature Combination 4.1 Sparse Binary Polynomial Hashing Sparse binary polynomial hashing (SBPH) is a feature combination technique introduced by the CRM114 Discriminator [3, 14]. SBPH slides a window of length N over the tokenized text. For each window position, all of the possible in-order combinations of the N tokens are generated; those combinations that contain at least the newest element of the window are retained. For a window of length N, this generates 2 N 1 features. Each of these joint features can be mapped to one of the odd binary numbers from 1 to 2 N 1 where original features at 1 positions are visible while original features at 0 positions are hidden and marked as skipped. It should be noted that the features generated by SBPH are not linearly independent and that even a compact representation of the feature stream generated by SBPH may be significantly longer than the original text. 4.2 Orthogonal Sparse Bigrams Since the expressivity of SBPH is sufficient for many applications, we now consider if it is possible to use a smaller feature set and thereby increase speed and decrease memory requirements. For this, we consider only word pairs containing a common word inside the window, and requiring the newest member of the window to be one of the two words in the pair. The idea behind this approach is to gain speed by working only with an orthogonal feature set inside the window, rather that the prolific and probably redundant features generated by SBPH.
6 Instead of all odd numbers, only those with two bits 1 in their binary representations are used: 2 n + 1, for n = 1 to N 1. With this restriction, only N 1 combinations with exactly two words are produced. We call them orthogonal sparse bigrams (OSB) sparse because most combinations have skipped words; only the first one is a conventional bigram. With a sequence of five words, w 1,..., w 5, OSB produces four combined features: w4 w5 w3 <skip> w5 w2 <skip> <skip> w5 w1 <skip> <skip> <skip> w5 Because of the reduced number of combined features, N 1 in OSB versus 2 N 1 in SBPH, text classification with OSB can be considerably faster than with SBPH. Table 2 shows an example of the features generated by SBPH and OSB side by side. Table 2. Features Generated by SBPH and OSB Number SBPH OSB 1 (1) today? 3 (11) lucky today? lucky today? 5 (101) feel <skip> today? feel <skip> today? 7 (111) feel lucky today? 9 (1001) you <skip> <skip> today? you <skip> <skip> today? 11 (1011) you <skip> lucky today? 13 (1101) you feel <skip> today? 15 (1111) you feel lucky today? 17 (10001) Do <skip> <skip> <skip> today? Do <skip> <skip> <skip> today? 19 (10011) Do <skip> <skip> lucky today? 21 (10101) Do <skip> feel <skip> today? 23 (10111) Do <skip> feel lucky today? 25 (11001) Do you <skip> <skip> today? 27 (11011) Do you <skip> lucky today? 29 (11101) Do you feel <skip> today? 31 (11111) Do you feel lucky today? Note that the orthogonal sparse bigrams form an almost complete basis set by ORing features in the OSB set, any feature in the SBPH feature set can be obtained, except for the unigram (the single-word feature). However, there is no such redundancy in the OSB feature set; it is not possible to obtain any OSB feature by adding, ORing, or subtracting any other pairs of other OSB features; all of the OSB features are unique and not redundant. Since the first term, unigram w 5, cannot be obtained by ORing OSB features it seems reasonable to add it as an extra feature. However the experiments reported in Section 5.4 show that adding unigrams does not increase accuracy; in fact, it sometimes decreased accuracy.
7 5 Experimental Results 5.1 Testing Procedure In order to test our multiple hypotheses, we used a standardized spam/nonspam test corpus from SpamAssassin [11]. This test corpus is extraordinarily difficult to classify, even for humans. It consists of 1397 spam messages, 250 hard nonspams, and 2500 easy nonspams, for a total of 4147 messages. These 4147 messages were shuffled into ten different standard sequences; results were averaged over these ten runs. We re-used the corpus and the standard sequences from [15]. Each test run begins with initializing all memory in the learning system to zero. Then the learning system was presented with each member of a standard sequence, in the order specified for that standard sequence, and required to classify the message. After each classification the true class of the message was revealed and the classifier had the possibility to update its prediction model accordingly prior to classifying the next message. 8 The training system then moved on to the next message in the standard sequence. The final 500 messages of each standard sequence were the test set used for final accuracy evaluation; we also report results on an extended test set containing the last 1000 messages of each run and on all (4147) messages. Systems were permitted to train on any messages, including those in the test set, after classifying them; at no time a system ever had the opportunity to learn on a message before predicting the class of this message. For evaluation we calculated the error rate E = number of misclassifications number of all classifications ; occasionally we mention the accuracy A = 1 E. This process was repeated for each of the ten standard sequences. Each complete set of ten standard sequences (41470 messages) required approximately minutes of processor time on a 1266 MHz Pentium III for OSB-5. 9 The average number of errors per test run is given in parenthesis. 5.2 Parameter Tuning We used a slightly different setup for tuning the Winnow parameters since it would have been unfair to tune the parameters on the test set. The last 500 messages of each run were reserved as test set for evaluation, while the preceding 1000 messages were used as development set for determining the best parameter values. The S tokenization was used for the tests in the section. Best performance was found with Winnow using 1.23 as promotion factor, 0.83 as demotion factor, and a threshold thickness of 5%. 10 These parameter values turned out to be best for both OSB and SBPH the results reported in Tables 3 and 4 are for OSB. 8 In actual usage training will not be quite as incremental since mail is read in batches. 9 For SBPH-5 it was about two hours which it not surprising since SBPH-5 generates four times as many features as OSB In either direction, i.e. θ = 0.95 θ, θ + = 1.05 θ.
8 Table 3. Promotion and Demotion Factors Promotion Demotion Test Set 0.44% (2.2) 0.36% (1.8) 0.44% (2.2) 0.32% (1.6) 0.44% (2.2) 0.48% (2.4) Devel. Set 0.52% (5.2) 0.51% (5.1) 0.52% (5.2) 0.49% (4.9) 0.51% (5.1) 0.62% (6.2) All 1.26% (52.4) 1.31% (54.3) 1.33% (55.1) 1.32% (54.7) 1.34% (55.4) 1.50% (62.2) Table 4. Threshold Thickness Threshold Thickness 0% 5% 10% Test Set 0.68% (3.4) 0.32% (1.6) 0.44% (2.2) Development Set 0.88% (8.8) 0.49% (4.9) 0.56% (5.6) All 1.77% (73.5) 1.32% (54.7) 1.38% (57.1) 5.3 Feature Store Size and Comparison With SBPH Table 5 compares orthogonal sparse bigrams and SBPH for different sizes of the feature store. OSB reached best results with 600,000 features (with an error rate of 0.32%), while SBPH peaked at 1,600,000 features (with a slightly higher error rate of 0.36%). Further increasing the number of features permitted in the store negatively affects accuracy. This indicates that the LRU pruning mechanism is efficient at discarding irrelevant features that are mostly noise. Table 5. Comparison of SBPH and OSB with Different Feature Storage Sizes OSB Store Size Last % (1.8) 0.38% (1.9) 0.32% (1.6) 0.44% (2.2) 0.44% (2.2) Last % (3.7) 0.37% (3.7) 0.33% (3.3) 0.37% (3.7) 0.37% (3.7) All 1.26% (52.3) 1.29% (53.4) 1.24% (51.4) 1.26% (52.2) 1.27% (52.5) SBPH Store Size (2 21 ) Last % (1.9) 0.36% (1.8) 0.42% (2.1) 0.44% (2.2) 0.42% (2.1) Last % (3.7) 0.34% (3.4) 0.38% (3.8) 0.39% (3.9) 0.38% (3.8) All 1.35% (55.8) 1.28% (53.1) 1.30% (54) 1.30% (54) 1.31% (54.2) 5.4 Unigram Inclusion The inclusion of individual tokens (unigrams) in addition to orthogonal sparse bigrams does not generally increase accuracy, as can be seen in Table 6, showing OSB without unigrams peaking at 0.32% error rate, while adding unigrams pushes the error rate up to 0.38%.
9 Table 6. Utility of Single Tokens (Unigrams) OSB only OSB + Unigrams Store Size Last % (1.6) 0.38% (1.9) 0.42% (2.1) Last % (3.3) 0.33% (3.3) 0.36% (3.6) All 1.24% (51.4) 1.22% (50.6) 1.24% (51.4) 5.5 Window Sizes The results of varying window size as a system parameter are shown in Table 7. Again, we note that the optimal combination for the test set uses a window size of five tokens (our default setting, yielding a 0.32% error rate), with both shorter and longer windows producing worse error rates. Table 7. Sliding Window Size Window Size Unigrams 2 (Bigrams) Store Size All (ca.55000) Last % (2.3) 0.48% (2.4) 0.42% (2.1) 0.44% (2.2) 0.32% (1.6) 0.38% (1.9) 0.42% (2.1) Last % (5) 0.43% (4.3) 0.39% (3.9) 0.40% (4) 0.33% (3.3) 0.38% (3.8) 0.37% (3.7) All 1.43% (59.2) 1.23% (51.2) 1.24% (51.4) 1.26% (52.2) 1.24% (51.4) 1.28% (53) 1.22% (50.8) Store Size All (ca ) All (ca ) Last % (2.4) 0.42% (2.1) 0.42% (2.1) 0.40% (2) 0.46% (2.3) Last % (4.3) 0.38% (3.8) 0.38% (3.8) 0.38% (3.8) 0.40% (4) All 1.24% (51.3) 1.22% (50.6) 1.25% (51.8) 1.27% (52.5) 1.25% (51.7) This U curve is not unexpected on an information-theoretic basis. English text has a typical entropy of around bits per character and around five characters per word. If we assume that a text contains mainly letters, digits, and some punctuation symbols, most characters can be represented in six bits, yielding a word content of 30 bits. Therefore, at one bit per character, English text becomes uncorrelated at a window length of six words or longer, and features obtained at these window lengths are not significant. These results also show that using OSB-5 is significantly better then using only single tokens (error rate of 0.46%) or conventional bigrams (0.48%). 5.6 Preprocessing and Tokenization Results with normalizemime were generally better than the other two options, reducing the error rate by up to 25% (Table 8). Accuracy on raw and mimedecoded mails was roughly comparable. The S tokenization schema initially learns more slowly (the overall error rate is somewhat higher) but is finally just as good as the X schema (Table 9). P and C both result in lower accuracy, even though they initially learn quickly.
10 Table 8. Preprocessing Preprocessing none mimedecode normalizemime Last % (2.1) 0.46% (2.3) 0.32% (1.6) Last % (3.7) 0.35% (3.5) 0.33% (3.3) All 1.27% (52.5) 1.26% (52.1) 1.24% (51.4) Table 9. Tokenization Schemas Schema X S C P Last % (1.6) 0.32% (1.6) 0.44% (2.2) 0.42% (2.1) Last % (3.3) 0.33% (3.3) 0.39% (3.9) 0.38% (3.8) All 1.24% (51.4) 1.32% (54.7) 1.28% (52.9) 1.23% (51.1) 5.7 Comparison With CRM114 and Naive Bayes The results for CRM114 and Naive Bayes on the last 500 mails are the best results reported in [15] for incremental (single-pass) training. For a fair comparison, these tests were all run using the C tokenization schema on raw mails without preprocessing. The best reported CRM114 weighting model is based on empirically derived weightings and is a rough approximation of a Markov Random Field. This model reduces to a Naive Bayes Model when the window size is set to 1. To avoid the different pruning mechanisms (CRM114 uses a randomdiscard algorithm) from distorting the comparison, we disabled LRU pruning for Winnow and also reran the CRM114 tests using all features (Table 10). Table 10. Comparison With Naive Bayes and CRM114 Naive Bayes CRM114 CRM114 Winnow+OSB Store Size All ( ) All All Last % (9.2) 1.12% (5.6) 1.16% (5.8) 0.46% (2.3) All 3.44% (142.8) 2.71% (112.5) 2.73% (113.2) 1.30% (53.9) 5.8 Speed of Learning The learning rate for the Winnow classifier combined with the OSB feature generator is shown in Fig. 1. Note that the rightmost column shows the incremental error rate on new messages. After having classified 1000 messages, Winnow+OSB achieves error rates below 1% on new mails. 6 Related Work Cohen and Singer [2] use a Winnow-like multiplicative weight update algorithm called sleeping experts with a feature combination technique called sparse
11 Mails Error Rate New Error Rate (Avg. Errors) (Avg. New Errors) % (7.7) 30.80% (7.7) % (10.7) 12.00% (3) % (14) 6.60% (3.3) % (19.5) 5.50% (5.5) % (25.5) 3.00% (6) % (29.8) 2.15% (4.3) % (32.7) 1.45% (2.9) % (35) 1.15% (2.3) % (36.5) 0.75% (1.5) % (39.7) 0.80% (3.2) % (42.3) 0.65% (2.6) % (44.4) 0.53% (2.1) % (46.2) 0.45% (1.8) % (48.2) 0.50% (2) % (49.7) 0.38% (1.5) % (51.1) 0.35% (1.4) % (51.4) 0.20% (0.3) Error Rate Number of Messages in the Corpus Error Rate (Average Errors) New Error Rate (Average New Errors) Fig. 1. Learning Curve for the best setting (Winnow 1.23,0.83,5% with 600,000 features, OSB-5, X tokenization). phrases which seems to be essentially equivalent to SBPH. 11 Bigrams and n- grams are a classical technique; SBPH has been introduced in [14] and sparse phrases in [2]. We propose orthogonal sparse bigrams as a minimalistic alternative that is new, to the best of our knowledge. An LRU mechanism for feature set pruning has been employed by the first author in [10]. We suppose that others have done the same since the idea seems to suggest itself; but currently we are not aware of such usage. 7 Conclusion and Future Work We have introduced orthogonal sparse bigrams (OSB) as a new feature combination technique for text classification that combines a high expressivity with relatively low computational load. By combining OSB with the Winnow algorithm we more than halved the error rate compared to a state-of-the-art spam filter, while still retaining the property of incrementality. By refining the preprocessing and tokenization steps we were able to further reduce the error rate by 30%. 12 In this study we have measured the accuracy without taking the different costs of misclassifications into account (it can be tolerated to let a few spam mails through, but it is bad to classify a regular as spam). This could be 11 Thanks to an anonymous reviewer for pointing us to this work. 12 Our algorithm is freely available as part of the TiEs system [13].
12 addressed by using a cost metric as discussed in [5]. Winnow could be biased in favor of classifying a borderline mail as nonspam by multiplying the spam score by a factor < 1 (e.g. 99%) when classifying. Currently our Winnow implementation supports only binary features; how often a feature (sparse bigram) appears in a text is not taken into account. We plan to address this by introducing a strength for each feature (cf. [4, Sec. 4.3]). Also of interest is the difference in performance between the LRU (leastrecently-used) pruning algorithm used here and the random-discard algorithm used in CRM114 [15]. When the random-discard algorithm in CRM114 triggered, it almost always resulted in a decrease in accuracy; here we found that an LRU algorithm could act to provide an increase in accuracy. Analysis and determination of the magnitude of this effect will be a concern in future work. Acknowledgments We thank the anonymous reviewers for their helpful comments and suggestions. References [1] A. J. Carlson, C. M. Cumby, N. D. Rizzolo, J. L. Rosen, and D. Roth. SNoW user manual. Version: January, Technical report, UIUC, [2] W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2): , [3] CRM114: The controllable regex mutilator. [4] I. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization. In EMNLP-97, [5] J. M. Gómez Hidalgo, E. Puertas Sanz, and M. J. Maña López. Evaluating costsensitive unsolicited bulk categorization. In JADT-02, Madrid, ES, [6] P. Graham. Better Bayesian filtering. In MIT Spam Conference, [7] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2: , [8] M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. A learning approach to shallow parsing. Technical Report UIUCDCS-R , Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, [9] normalizemime v [10] C. Siefkes. A toolkit for caching and prefetching in the context of Web application platforms. Diplomarbeit, TU Berlin, [11] SpamAssassin. [12] SpamBayes. [13] Trainable Incremental Extraction System. ag-db/software/ties/. [14] W. S. Yerazunis. Sparse binary polynomial hashing and the CRM114 discriminator. In 2003 Spam Conference, Cambridge, MA, MIT. [15] W. S. Yerazunis. The spam-filtering accuracy plateau at 99.9% accuracy and how to get past it. In 2004 Spam Conference, Cambridge, MA, MIT. [16] L. Zhang and T. Yao. Filtering junk mail with a maximum entropy model. In 20th International Conference on Computer Processing of Oriental Languages, 2003.
Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering
Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering Christian Siefkes 1, Fidelis Assis 2, Shalendra Chhabra 3, and William S. Yerazunis 4 1 Berlin-Brandenburg Graduate School
Highly Scalable Discriminative Spam Filtering
Highly Scalable Discriminative Spam Filtering Michael Brückner, Peter Haider, and Tobias Scheffer Max Planck Institute for Computer Science, Saarbrücken, Germany Humboldt Universität zu Berlin, Germany
Adaption of Statistical Email Filtering Techniques
Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Not So Naïve Online Bayesian Spam Filter
Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China [email protected] Congfu Xu Institute of Artificial
Spam Filtering with Naive Bayesian Classification
Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011
An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models
Dissertation (Ph.D. Thesis) An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models Christian Siefkes Disputationen: 16th February
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: [email protected], Abstract- Unethical
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
Immunity from spam: an analysis of an artificial immune system for junk email detection
Immunity from spam: an analysis of an artificial immune system for junk email detection Terri Oda and Tony White Carleton University, Ottawa ON, Canada [email protected], [email protected] Abstract.
How to Get Past It.* ( MERL )
The Spam Filtering Plateau at 99.9% Accuracy and How to Get Past It.* William S. Yerazunis, PhD Mitsubishi Electric Research Laboratories ( MERL ) Cambridge, MA [email protected] * includes clarifications and
Solido Spam Filter Technology
Solido Spam Filter Technology Written August 2008 by Kasper J. Jeppesen, Software Developer at Solido Systems Solido Spam Filter Technology (SSFT) is a stand alone filter that can be integrated into a
Filtering Spams using the Minimum Description Length Principle
Filtering Spams using the Minimum Description Length Principle ABSTRACT Tiago A. Almeida, Akebo Yamakami School of Electrical and Computer Engineering University of Campinas UNICAMP 13083 970, Campinas,
Spam Filtering based on Naive Bayes Classification. Tianhao Sun
Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous
Anti Spamming Techniques
Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information
Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
On Attacking Statistical Spam Filters
On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Abstract. The efforts of anti-spammers
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007
Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering
Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules
Content based Hybrid SMS Spam Filtering
Content based Hybrid SMS Spam Filtering System T. Charninda', T.T. Dayaratne', H.K.N. Amarasinghc', J.M.R.S. Jayakody" thilakcwuom.lk', Faculty of Information Technology, University of Moratuwa, Sri Lanka
Intercept Anti-Spam Quick Start Guide
Intercept Anti-Spam Quick Start Guide Software Version: 6.5.2 Date: 5/24/07 PREFACE...3 PRODUCT DOCUMENTATION...3 CONVENTIONS...3 CONTACTING TECHNICAL SUPPORT...4 COPYRIGHT INFORMATION...4 OVERVIEW...5
An Introduction to Machine Learning and Natural Language Processing Tools
An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo) 8/24/2010-8/26/2010 Some reasonably reliable
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
Data Deduplication in Slovak Corpora
Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Bratislava, Slovakia Abstract. Our paper describes our experience in deduplication of a Slovak corpus. Two methods of deduplication a plain
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Intelligent Log Analyzer. André Restivo <[email protected]>
Intelligent Log Analyzer André Restivo 9th January 2003 Abstract Server Administrators often have to analyze server logs to find if something is wrong with their machines.
Twitter sentiment vs. Stock price!
Twitter sentiment vs. Stock price! Background! On April 24 th 2013, the Twitter account belonging to Associated Press was hacked. Fake posts about the Whitehouse being bombed and the President being injured
Cosdes: A Collaborative Spam Detection System with a Novel E- Mail Abstraction Scheme
IOSR Journal of Engineering (IOSRJEN) e-issn: 2250-3021, p-issn: 2278-8719, Volume 2, Issue 9 (September 2012), PP 55-60 Cosdes: A Collaborative Spam Detection System with a Novel E- Mail Abstraction Scheme
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Investigation of Support Vector Machines for Email Classification
Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software
Single-Pass Online Learning: Performance, Voting Schemes and Online Feature Selection
Single-Pass Online Learning: Performance, Voting Schemes and Online Feature Selection ABSTRACT Vitor R. Carvalho a a Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue,Pittsburgh,
LCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
Tweaking Naïve Bayes classifier for intelligent spam detection
682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. [email protected] 2 School of Computing, Information
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.
White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,
Micro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
A semi-supervised Spam mail detector
A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach
Naive Bayes Spam Filtering Using Word-Position-Based Attributes
Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden [email protected] Abstract This paper
Cloud Services. Email Anti-Spam. Admin Guide
Cloud Services Email Anti-Spam Admin Guide 10/23/2014 CONTENTS Introduction to Anti- Spam... 4 About Anti- Spam... 4 Locating the Anti- Spam Pages in the Portal... 5 Anti- Spam Best Practice Settings...
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Introduction to Web Services
Department of Computer Science Imperial College London CERN School of Computing (icsc), 2005 Geneva, Switzerland 1 Fundamental Concepts Architectures & escience example 2 Distributed Computing Technologies
Private Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
Index Terms Domain name, Firewall, Packet, Phishing, URL.
BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet
SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
An Evaluation of Naive Bayes Variants in. Content-Based Learning for Spam Filtering
An Evaluation of Naive Bayes Variants in Content-Based Learning for Spam Filtering Alexander K. Seewald Tel. +43(664) 110 68 86, Fax. +43(1) 2533033-2764 Seewald Solutions, A-1180 Vienna,
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Applying Machine Learning to Stock Market Trading Bryce Taylor
Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,
Increasing the Accuracy of a Spam-Detecting Artificial Immune System
Increasing the Accuracy of a Spam-Detecting Artificial Immune System Terri Oda Carleton University [email protected] Tony White Carleton University [email protected] Abstract- Spam, the electronic
SpamNet Spam Detection Using PCA and Neural Networks
SpamNet Spam Detection Using PCA and Neural Networks Abhimanyu Lad B.Tech. (I.T.) 4 th year student Indian Institute of Information Technology, Allahabad Deoghat, Jhalwa, Allahabad, India [email protected]
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
Mining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
WE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
How To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
Chapter 7. Language models. Statistical Machine Translation
Chapter 7 Language models Statistical Machine Translation Language models Language models answer the question: How likely is a string of English words good English? Help with reordering p lm (the house
Adaptive Filtering of SPAM
Adaptive Filtering of SPAM L. Pelletier, J. Almhana, V. Choulakian GRETI, University of Moncton Moncton, N.B.,Canada E1A 3E9 {elp6880, almhanaj, choulav}@umoncton.ca Abstract In this paper, we present
E-MAIL DEFENDER SERVICES
E-MAIL DEFENDER SERVICES Email Defender User Guide 2015-02-12 What does E-Mail Defender do? Anti-Virus testing to eliminate known and suspected viruses. Blacklist services check distributed lists for fingerprints
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
A Mixed Trigrams Approach for Context Sensitive Spell Checking
A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA [email protected], [email protected]
