FEATURE 1. HASHING-TRICK
|
|
|
- Gabriel Bridges
- 10 years ago
- Views:
Transcription
1 FEATURE COLLABORATIVE SPAM FILTERING WITH THE HASHING TRICK Josh Attenberg Polytechnic Institute of NYU, USA Kilian Weinberger, Alex Smola, Anirban Dasgupta, Martin Zinkevich Yahoo! Research, USA User feedback is vital to the quality of the collaborative spam filters frequently used in open membership systems such as Yahoo Mail or Gmail. Users occasionally designate s as spam or non-spam (often termed as ham), and these labels are subsequently used to train the spam filter. Although the majority of users provide very little data, as a collective the amount of training data is very large (many millions of s per day). Unfortunately, there is substantial deviation in users notions of what constitutes spam and ham. Additionally, the open membership policy of these systems makes it vulnerable to users with malicious intent spammers who wish to see their s accepted by any spam filtration system can create accounts and use these to give malicious feedback to train the spam filter in giving their s a free pass. When combined, these realities make it extremely difficult to assemble a single, global spam classifier. The aforementioned problems could be avoided entirely if we could create a completely separate classifier for each user based solely on that user s feedback. Unfortunately, few users provide the magnitude of feedback required for this approach (many not providing any feedback at all). The number of s labelled by an individual user approximates a power law distribution. Purely individualized classifiers offer the possibility of excellent performance to a few users with many labelled s, at the expense of the great many users whose classifiers will become unreliable due to a lack of training data. This article illustrates a simple and effective technique that is able to balance the wide coverage provided by a global spam filter with the flexibility provided by personalized filters. By using ideas from multi-task learning, we build a hybrid method that combines both global and personalized filters. By training both the collection of personal and global classifiers simultaneously we are able to accommodate the idiosyncrasies of each user, as well as provide a global classifier for users that label few s. In fact, as is well known in multi-task learning [2], in addition to improving the experience of users who label many examples, this multi-task learning approach actually mitigates the impact of malicious users on the global filter. By offering specific consideration to the intents of the most active and unusual users, a global classifier is created that focuses on the truly common aspects of the classification problem. The end result is improved classifier performance for everyone, including users who label relatively few s. With large-scale open membership systems such as Yahoo Mail, one of the main hurdles to a hybrid personal/ global spam filter is the enormous amount of memory required to store individual classifiers for every user. We circumvent this obstacle with the use of the hashing trick [1, 5]. The hashing trick allows a fixed amount of memory to store all of the parameters for all the personal classifiers and a global classifier by mapping all personal and global features into a single low-dimensional feature space, in a way which bounds the required memory independently of the input. In this space, a single parameter vector, w, is trained which captures both global spam activity and the individual aspects of all active users. Feature hashing provides an extremely simple means of dimensionality reduction, eliminating the large word-to-dimension dictionary data structure typically needed for text-based classification, providing substantial savings in both complexity and available system memory. 1. HASHING-TRICK The standard way to represent instances (i.e. s) in text classification is the so-called bag-of-words approach. This method assumes the existence of a dictionary that contains all possible words and represents an as a very large vector,, with as many entries as there are words in the dictionary. For a specific , the i th entry in the vector contains the number of occurrences of word i in the . Naturally, this method lends itself to very sparse representations of instances and examples as the great majority of words do not appear in any specific text, almost all entries in the data vectors are zero. However, when building a classifier one often has to maintain information on all words in the entire corpus (e.g. in a weight vector), and this can become unmanageable in large corpora. The hashing trick is a dimensionality reduction technique used to give traditional learning algorithms a foothold in high dimensional input spaces (i.e. in settings with large dictionaries), by reducing the memory footprint of learning, and reducing the influence of noisy features. The main idea behind the hashing trick is simple and intuitive: instead of generating bag-of-word feature vectors through a dictionary that maps tokens to word indices, one uses a hash function that hashes words directly into a feature vector of size b. The hash function 18 NOVEMBER 2009
2 h : {Strings} [1..b] operates directly on strings and should be approximately uniform 1. In [5] we propose using a second independent hash function ξ : {Strings} {-1, 1}, that determines whether the particular hashed dimension of a token should be incremented or decremented. This causes the hashed feature vectors to be unbiased, since the expectation of the noise for any entry is zero. The algorithm below shows a pseudo-code implementation of the hashing trick that generates a hashed bag-of-words feature vector for an hashingtrick([string] ) = for word in do i = h(word) = + ξ(word) end for return The key point behind this hashing is that every hashed feature effectively represents an infinite number of unhashed features. It is the mathematical equivalent of a group of homographs (e.g. lie and lie) or homophones (e.g. you re and your) words with different meanings that look or sound alike. It is important to realize that having two meanings of the same feature is no more and no less of an issue than a homograph or homophone: if a computer can guess the meaning of the feature, or more importantly, the impact of the feature on the label of the message, it will change its decision based upon the feature. If not, then it will try to make its decision based on the rest of the . The wonderful thing about hashing is that instead of trying to cram a lot of different conflicting meanings into short words as humans do, we are trying to randomly spread the meanings evenly into over a million different features in our hashed language. So, although a word like antidisestablishmentarianism might accidentally run into the, our hashing function is a lot less likely to make two meaningful words homographs in our hashed language than those already put there by human beings. Of course there are so many features in our hashed language, that in the context of spam detection most features won t mean anything at all. In the context of spam filtering, the hashing trick by itself has several great advantages over the traditional dictionary-based bag-of-words method: 1. It considers even low-frequency tokens that might traditionally be ignored to keep the dictionary manageable this is especially useful in view of attacks by spammers using rare variants of words 1 For the experiments in this paper we used a public domain implementation of a hash function from hash/doobs.html. (e.g. misspellings like viogra ). 2. Hashing the terms makes a classifier agnostic to changes in the set of terms used, and if the spam classifier is used in an online setting, the hashing trick equips the classifier with a dictionary of effectively infinite size, which helps it adapt naturally to changes in the language of spam and ham. 3. By associating many raw words in its infinite dictionary (most of which never occur) with a single parameter, the meaning of this parameter changes depending upon which words are common, rare, or absent from the corpus. So, how large a language can this hashing trick handle? As we show in the next section, if it allows us to square the number of unhashed features we may possibly see, it could help us handle personalization. 2. PERSONALIZATION As the hashing trick frees up a lot of memory, the number of parameters a spam classifier can manage increases. In fact, we can train multiple classifiers which share the same parameter space [5]. For a set of users, U, and a dictionary size d, our goal is to train one global classifier,, that is shared amongst all users and one local classifier,, for each user u U. In a system with U users, we need U + 1 classifiers. When an arrives, it is classified by the combination of the recipient s local classifier and the global classifier + we call this the hybrid classifier. Traditionally, this goal would be very hard to achieve, as each classifier has d parameters, and hence the total number of parameters we need to store becomes ( U + 1)d. Systems like Yahoo Mail handle billions of s for hundreds of millions of users per day. With millions of users and millions of words, storing all vectors would require hundreds of terabytes of parameters. Further, to load the appropriate classifier for any given user in time when an arrives would be prohibitively expensive. The hashing trick provides a convenient solution to the aforementioned complexity, allowing us to perform personalized and global spam filtration in a single hashed bag-of-words representation. Instead of training U + 1 classifiers, we train a single classifier with a very large feature space. For each , we create a personalized bag of words by concatenating the recipient s user id to each word of the 2, and add to this the traditional global bag of words. All the elements in these bags are hashed into one of b buckets to form a b-dimensional representation of the , which is then fed into the classifier. Effectively, this process allows U +1 classifiers to share a b-dimensional parameter space nicely [5]. It is important to point out that 2 We use the º symbol to indicate string concatenation. NOVEMBER
3 the one classifier over b hashed features is trained after hashing. Because b will be much smaller than d x U, there will be many hash collisions. However, because of the sparsity and high redundancy of each , we can show that the theoretical number of possible collisions does not really matter for most of the s. Moreover, because the classifier is aware of any collisions before the weights are learned, the classifier is not likely to put weights of high magnitude on features with an ambiguous meaning. Figure 1: Global/personal hybrid spam filtering with feature hashing. Intuitively, the weights on the individualized tokens (i.e. those that are concatenated with the recipient s id) indicate the personal eccentricities of the particular users. Imagine for example that user barney likes s containing the word viagra, whereas the majority of users do not. The personalized hashing trick will learn that viagra itself is a spam indicative word, whereas viagra_barney is not. The entire process is illustrated in Figure 1. See the algorithm below for details on a pseudo-code implementation of the personalized hashing trick. Note that with the personalized hashing trick, using a hash function h : {Strings} [1..b], we only require b parameters independent of how many users or words appear in our system. personalized_hashingtrick(string userid, [string] ) = for word in do i = h(word) = + ξ(word) j = h(word userid) = + ξ(word userid) end for return 3. EXPERIMENTAL SET-UP AND RESULTS To assess the validity of our proposed techniques, we conducted a series of experiments on the freely distributed Figure 2: The results of the global and hybrid classifi ers applied to a large-scale real-world data set of 3.2 million s. trec07p benchmark data set, and on a large-scale proprietary data set representing the realities of an open-membership system. The trec data set contains 75,419 labelled and chronologically ordered s taken from a single server over four months in 2007 and compiled for trec spam filtering competitions [3]. Our proprietary data was collected over 14 days and contains n = 3.2 million anonymized s from U = 400,000 anonymized users. Here the first ten days are used for training, and the last four days are used for experimental validation. s are either spam (positive) or ham (non-spam, negative). All spam filter experiments utilize the Vowpal Wabbit (VW) [4] linear classifier trained with stochastic gradient descent on a squared loss. Note that the hashing trick is independent of the classification scheme used; the hashing trick could apply equally well with many learning-based spam filtration solutions. To analyse the performance of our classification scheme we evaluate the spam catch rate (SCR, the percentage of spam s detected) of our classifier at a fixed 1% ham misclassification rate (HMR, the percentage of good s erroneously labelled as spam). We note that the proprietary nature of the latter data set precludes publishing of exact performance numbers. Instead we compare the performance to a baseline classifier, a global classifier hashed onto b = 2 26 dimensions. Since 2 26 is far larger than the actual number of terms used, d = 40M, we believe this is representative of full-text classification without feature hashing. 4. THE VALIDITY OF HASHING IN SPAM FILTERING To measure the performance of the hashing trick and the influence of aggressive dimensionality reduction on classifier quality, we compare global classifier performance to that of our baseline classifier when hashing onto spaces of dimension b = {2 18, 2 20, 2 22, 2 24, 2 26 } on our proprietary data 20 NOVEMBER 2009
4 Figure 3: The amount of spam left in users inboxes, relative to the baseline. The users are binned by the amount of training data they provide. set. The results of this experiment are displayed as the blue line in Figure 2. Note that using 2 18 bins results in only an 8% reduction in classifier performance, despite large numbers of hash collisions. Increasing b to 2 20 improves the performance to within 3% of the baseline. Given that our data set has 40M unique tokens, this means that using a weight vector of 0.6% of the size of the full data results in approximately the same performance as a classifier using all dimensions. Previously, we have proposed using hybrid global/personal spam filtering via feature hashing as a means for effectively mitigating the effects of differing opinions of spam and ham amongst a population of users. We now seek to verify the efficacy of these techniques in a realistic setting. On our proprietary data set, we examine the techniques illustrated in Section 2 and display the results as the red line in Figure 2. Considering that our hybrid technique results from the cross product of U = 400K users and d = 40M tokens, a total of 16 trillion possible features, it is understandable that noise induced by collisions in the hash table adversely affects classifier performance when b is small. As the number of hash bins grows to 2 22, personalization already offers a 30% spam reduction over the baseline, despite aggressive hashing. In any open system, the number of s labelled as either spam or non-spam varies greatly among users. Overall, the labelling distribution approximates a power law distribution. With this in mind, one possible explanation for the improved performance of the hybrid classifier in Figure 2 could be that we are heavily benefiting those few users with a rich set of personally labelled examples, while the masses of users those with few labelled examples actually suffer. In fact, many users do not appear at all during training time and are only present in our test set. For these users, personalized features are mapped into hash buckets with weights set exclusively by other examples, resulting in some interference being added to the global spam prediction. In Section 2, we hypothesized that using a hybrid spam classifier could mitigate the idiosyncrasies of the most active spam labellers, thereby creating a more general classifier for the remaining users, benefiting everyone. To validate this claim, we segregate users according to the number of training labels provided in our proprietary data. As before, a hybrid classifier is trained with b = {2 18, 2 20, 2 22, 2 24, 2 26 } bins. The results of this experiment are seen in Figure 3. Note that for small b, it does indeed appear that the most active users benefit at the expense of those with few labelled examples. However, as b increases, therefore reducing the noise due to hash collisions, users with no or very few examples in the training set also benefit from the added personalization. This improvement can be explained if we recall the subjective nature of spam and ham users do not always agree, especially in the case of business s or newsletters. Additionally, spammers may have infiltrated the data set with malicious labels. The hybrid classifier absorbs these peculiarities with the personal component, freeing the global component to truly reflect a common definition of spam and ham and leading to better overall generalization, which benefits all users. 5. MITIGATING THE ACTIONS OF MALICIOUS USERS WITH HYBRID HASHING In order to simulate the influence of deliberate noise in a controlled setting we performed additional experiments on the trec data set. We chose some percentage, mal, of malicious users uniformly at random from the pool of receivers, and set their labels at random. Note that having malicious users label randomly is actually a harder case than having them label them adversarially in a consistent fashion as then the personalized spam filter could potentially learn and invert their preferences. Figure 4 presents a comparison of global and hybrid spam filters under varying loads of malicious activity and different sized hash tables. Here we set mal {0%, 20%, 40%}. Note that malicious activity does indeed harm the overall spam filter performance for a fixed classifier configuration. The random nature of our induced malicious activity leads to a background noise occurring in many bins of our hash table, increasing the harmful nature of collisions. Both global and hybrid classifiers can mitigate this impact somewhat if the number of hash bins b is increased. In short, with malicious users, both global (dashed line) and hybrid (solid line) classifiers require more hash bins to achieve near-optimum performance. Since the hybrid classifier has more tokens, the number of hash collisions is also correspondingly larger. Given a large enough number of hash bins, the hybrid classifier clearly outperforms the single global classifier under the malicious NOVEMBER
5 Figure 4: The infl uence of the number of hash bins on global and hybrid classifi er performance with varying percentages of malicious users. settings. We do not include the results of a pure local approach, as the performance is abysmal for many users due to a lack of training data. 6. CONCLUSION This work demonstrates the hashing trick as an effective method for collaborative spam filtering. It allows spam filtering without the necessity of a memory-consuming dictionary and strictly bounds the overall memory required by the classifier. Further, the hashing trick allows the compression of many (thousands of) classifiers into a single, finite-sized weight vector. This allows us to run personalized and global classification together with very little additional computational overhead. We provide strong empirical evidence that the resulting classifier is more robust against noise and absorbs individual preferences that are common in the context of open-membership spam classification. REFERENCES [1] Attenberg, J.; Weinberger, K.; Dasgupta, A.; Smola. A.; Zinkevich, M. Collaborative -spam filtering with the hashing trick. Proceedings of the Sixth Conference on and Spam, CEAS 2009, [2] Caruana, R. Algorithms and applications for multitask learning. Proc. Intl. Conf. Machine Learning, pp Morgan Kaufmann, [3] Cormack, G. TREC 2007 spam track overview. The Sixteenth Text REtrieval Conference (TREC 2007) Proceedings, [4] Langford, J.; Li, L.; Strehl, A. Vowpal Wabbit online learning project [5] Weinberger, K.; Dasgupta, A.; Attenberg, J.; Langford, J.; Smola, A. Feature hashing for large scale multitask learning. ICML, NOVEMBER 2009
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Scalable Machine Learning - or what to do with all that Big Data infrastructure
- or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection
Achieve more with less
Energy reduction Bayesian Filtering: the essentials - A Must-take approach in any organization s Anti-Spam Strategy - Whitepaper Achieve more with less What is Bayesian Filtering How Bayesian Filtering
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering
2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University
Why Bayesian filtering is the most effective anti-spam technology
Why Bayesian filtering is the most effective anti-spam technology Achieving a 98%+ spam detection rate using a mathematical approach This white paper describes how Bayesian filtering works and explains
Anti Spamming Techniques
Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods
Application of Neural Network in User Authentication for Smart Home System
Application of Neural Network in User Authentication for Smart Home System A. Joseph, D.B.L. Bong, D.A.A. Mat Abstract Security has been an important issue and concern in the smart home systems. Smart
An Overview of Spam Blocking Techniques
An Overview of Spam Blocking Techniques Recent analyst estimates indicate that over 60 percent of the world s email is unsolicited email, or spam. Spam is no longer just a simple annoyance. Spam has now
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
How To Choose the Right Vendor Information you need to select the IT Security Testing vendor that is right for you.
Information you need to select the IT Security Testing vendor that is right for you. Netragard, Inc Main: 617-934- 0269 Email: [email protected] Website: http://www.netragard.com Blog: http://pentest.netragard.com
Utilizing Multi-Field Text Features for Efficient Email Spam Filtering
International Journal of Computational Intelligence Systems, Vol. 5, No. 3 (June, 2012), 505-518 Utilizing Multi-Field Text Features for Efficient Email Spam Filtering Wuying Liu College of Computer, National
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Spam Filtering based on Naive Bayes Classification. Tianhao Sun
Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous
How To Filter Email From A Spam Filter
Spam Filtering A WORD TO THE WISE WHITE PAPER BY LAURA ATKINS, CO- FOUNDER 2 Introduction Spam filtering is a catch- all term that describes the steps that happen to an email between a sender and a receiver
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Less naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:[email protected] also CoSiNe Connectivity Systems
Storage Management for Files of Dynamic Records
Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. [email protected] Alistair Moffat Department of Computer Science
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Junk Email Filtering System. User Manual. Copyright Corvigo, Inc. 2002-03. All Rights Reserved. 509-8282-00 Rev. C
Junk Email Filtering System User Manual Copyright Corvigo, Inc. 2002-03. All Rights Reserved 509-8282-00 Rev. C The Corvigo MailGate User Manual This user manual will assist you in initial configuration
Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme
IJCSET October 2012 Vol 2, Issue 10, 1447-1451 www.ijcset.net ISSN:2231-0711 Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme I.Kalpana, B.Venkateswarlu Avanthi Institute
Antispam Evaluation Guide. White Paper
Antispam Evaluation Guide White Paper Table of Contents 1 Testing antispam products within an organization: 10 steps...3 2 What is spam?...4 3 What is a detection rate?...4 4 What is a false positive rate?...4
Bayesian Spam Detection
Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
VBSim. Symantec Computer Virus/Worm Simulation System. Version 1.2. Copyright 1999, Symantec Corporation
Symantec Computer Virus/Worm Simulation System Version 1.2 Copyright 1999, Symantec Corporation Contents About VBSim... 3 Simulating the spread of malware... 3 Understanding the VBSim interface... 4 Demonstrating
A security analysis of the SecLookOn authentication system
Institut f. Statistik u. Wahrscheinlichkeitstheorie 1040 Wien, Wiedner Hauptstr. 8-10/107 AUSTRIA http://www.statistik.tuwien.ac.at A security analysis of the SecLookOn authentication system K. Grill Forschungsbericht
Lecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture - PRGs for one time pads
CS 7880 Graduate Cryptography October 15, 2015 Lecture 10: CPA Encryption, MACs, Hash Functions Lecturer: Daniel Wichs Scribe: Matthew Dippel 1 Topic Covered Chosen plaintext attack model of security MACs
Lasso-based Spam Filtering with Chinese Emails
Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
MACHINE LEARNING & INTRUSION DETECTION: HYPE OR REALITY?
MACHINE LEARNING & INTRUSION DETECTION: 1 SUMMARY The potential use of machine learning techniques for intrusion detection is widely discussed amongst security experts. At Kudelski Security, we looked
Machine learning challenges for big data
Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012
Bayes and Naïve Bayes. cs534-machine Learning
Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
A Practical Application of Differential Privacy to Personalized Online Advertising
A Practical Application of Differential Privacy to Personalized Online Advertising Yehuda Lindell Eran Omri Department of Computer Science Bar-Ilan University, Israel. [email protected],[email protected]
Basic E- mail Skills. Google s Gmail. www.netliteracy.org
Email it s convenient, free and easy. Today, it is the most rapidly growing means of communication. This is a basic introduction to email and we use a conversational non- technical style to explain how
Emerging Trends in Fighting Spam
An Osterman Research White Paper sponsored by Published June 2007 SPONSORED BY sponsored by Osterman Research, Inc. P.O. Box 1058 Black Diamond, Washington 98010-1058 Phone: +1 253 630 5839 Fax: +1 866
Image Based Spam: White Paper
The Rise of Image-Based Spam No matter how you slice it - the spam problem is getting worse. In 2004, it was sufficient to use simple scoring mechanisms to determine whether email was spam or not because
Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering
Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules
Highly Scalable Discriminative Spam Filtering
Highly Scalable Discriminative Spam Filtering Michael Brückner, Peter Haider, and Tobias Scheffer Max Planck Institute for Computer Science, Saarbrücken, Germany Humboldt Universität zu Berlin, Germany
Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam
Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
BARRACUDA. N e t w o r k s SPAM FIREWALL 600
BARRACUDA N e t w o r k s SPAM FIREWALL 600 Contents: I. What is Barracuda?...1 II. III. IV. How does Barracuda Work?...1 Quarantine Summary Notification...2 Quarantine Inbox...4 V. Sort the Quarantine
Introduction. How does email filtering work? What is the Quarantine? What is an End User Digest?
Introduction The purpose of this memo is to explain how the email that originates from outside this organization is processed, and to describe the tools that you can use to manage your personal spam quarantine.
Index Terms Cloud Storage Services, data integrity, dependable distributed storage, data dynamics, Cloud Computing.
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Privacy - Preserving
Not So Naïve Online Bayesian Spam Filter
Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China [email protected] Congfu Xu Institute of Artificial
Prevention of Spam over IP Telephony (SPIT)
General Papers Prevention of Spam over IP Telephony (SPIT) Juergen QUITTEK, Saverio NICCOLINI, Sandra TARTARELLI, Roman SCHLEGEL Abstract Spam over IP Telephony (SPIT) is expected to become a serious problem
Threat Modeling. Categorizing the nature and severity of system vulnerabilities. John B. Dickson, CISSP
Threat Modeling Categorizing the nature and severity of system vulnerabilities John B. Dickson, CISSP What is Threat Modeling? Structured approach to identifying, quantifying, and addressing threats. Threat
Spam Filtering with Naive Bayesian Classification
Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011
SPAM FILTER Service Data Sheet
Content 1 Spam detection problem 1.1 What is spam? 1.2 How is spam detected? 2 Infomail 3 EveryCloud Spam Filter features 3.1 Cloud architecture 3.2 Incoming email traffic protection 3.2.1 Mail traffic
The What, Why, and How of Email Authentication
The What, Why, and How of Email Authentication by Ellen Siegel: Director of Technology and Standards, Constant Contact There has been much discussion lately in the media, in blogs, and at trade conferences
Adaption of Statistical Email Filtering Techniques
Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
4.5 Symbol Table Applications
Set ADT 4.5 Symbol Table Applications Set ADT: unordered collection of distinct keys. Insert a key. Check if set contains a given key. Delete a key. SET interface. addkey) insert the key containskey) is
A Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
Performance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Ensuring Data Storage Security in Cloud Computing
Ensuring Data Storage Security in Cloud Computing Cong Wang 1, Qian Wang 1, Kui Ren 1, and Wenjing Lou 2 1 ECE Department, Illinois Institute of Technology 2 ECE Department, Worcester Polytechnic Institute
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
How To Create A Text Classification System For Spam Filtering
Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering PhD Thesis Khurum Nazir Junejo 2004-03-0018 Advisor: Dr. Asim Karim Department of Computer Science Syed Babar
MapReduce/Bigtable for Distributed Optimization
MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. [email protected] Scott Gilpin Google Inc. [email protected] Gideon Mann Google Inc. [email protected] Abstract With large data
How to Keep Marketing Email Out of the Spam Folder A guide for marketing managers and developers
How to Keep Marketing Email Out of the Spam Folder A guide for marketing managers and developers Sarah Longfors Web Developer marketing + technology 701.235.5525 888.9.sundog fax: 701.235.8941 2000 44th
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
Opus One PAGE 1 1 COMPARING INDUSTRY-LEADING ANTI-SPAM SERVICES RESULTS FROM TWELVE MONTHS OF TESTING INTRODUCTION TEST METHODOLOGY
Joel Snyder Opus One February, 2015 COMPARING RESULTS FROM TWELVE MONTHS OF TESTING INTRODUCTION The following analysis summarizes the spam catch and false positive rates of the leading anti-spam vendors.
Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
