Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
|
|
|
- Clemence Heath
- 10 years ago
- Views:
Transcription
1 Chronological Sampling for Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada Abstract. User models for filtering should be developed from appropriate training and test sets. A k-fold cross-validation is commonly presented in the literature as a method of mixing old and new messages to produce these data sets. We show that this results in overly optimistic estimates of the filter s accuracy in classifying future messages because the test set has a higher probability of containing messages that are similar to those in the training set. We propose the k-fold chronological cross-validation method that preserves the chronology of the messages in the test set. 1 Introduction Our research into spam filtering began with an investigation of existing filtering systems that are based on learned users models. Often, we found the reported accuracy of these systems to be overly optimistic [1]. We found it difficult to create models with the same high level of accuracy as published when using the same or independent datasets. Other authors have made similar observations [2]. Although much of the difference between the earlier evaluations and ours can be attributed to differences in the mix of legitimate and spam s in the datasets, we speculated that another important factor is the method of testing that is employed. Our work with machine learning models for spam filtering has shown that time is an important factor for valid testing; i.e. the order of incoming messages must be preserved so as to properly test a model. Unfortunately, many results reported in the literature are based on a k-fold cross-validation methodology that does not preserve the temporal nature of messages. It is common practice with cross-validation to mix the available data prior to sampling training and test sets so as to ensure a fair distribution of legitimate and spam messages [3]. However that practice, by mixing older messages with more recent ones, generates training sets that unfairly contain future messages. The mixing ignores an important dynamic feature of spam , namely that the generator (spammer) is an adversary who incorporates information about filtering techniques into their next generation spam [4]. If the temporal aspect is not considered, the performance of the model on future predictions may be significantly less than that estimated. A fair test set should contain only messages received after those used to develop the model. Commonly used data sets available on the web, such as the Ling-Spam corpus [5], do not even contain a time stamp in the data. We present a comparison of a spam filtering system tested using cross-validation and a temporal preserving approach. Corresponding author ([email protected]) 9
2 2 Background The k-fold cross-validation method is a standard method for comparing different models [3]. In k-fold cross-validation the dataset is divided into k subsets of approximately equal size. Model generation and testing is repeated k times. Each time a different subset is selected as a test set and the remaining subsets are selected for training. In some machine learning algorithms (e.g. inductive decision trees and artificial neural networks) it is necessary to select a part of the training set as a validation set to reduce the likelihood of over fitting. Each subset can be in the test set exactly once and in the training set (k 1) times. Before splitting the dataset, it is common to randomly sort all examples in the dataset, to ensure that they are evenly distributed, before creating the k subsets. The intention of cross-validation is that it will better estimate the true accuracy of the resulting models, based on the mean accuracy calculated over the k evaluations. In the addition, the standard deviation around the mean can be used to produce confidence intervals and to determine the statistical significance between different models, or machine learning algorithms. Consider the effect of randomly mixing the legitimate and spam messages prior to undertaking a k-fold standard cross-validation (SCV). When the datasets are mixed, possible future examples are injected into the training set thereby providing the model with an unfair look at future features. Figure 1 illustrates the problem of mixing old and new examples under SCV: If A, B, and C are three main types of examples in the dataset, let A be the oldest and C the newest. In SCV, all three types of examples are evenly distributed in the training set, validation set, and test set. This distribution provides the best opportunity for a model to perform well on the test set. However, in reality, type A and B have the highest probability of being available in the data set during the time of model development. C may not have appeared until after A and B. Thus, the mixing of examples in SCV provides an unrealistic set of future examples in the training set. We claim that this is one of the major reasons why many of the reported results on spam filtering are overly optimistic. 3 Chronological Cross-Validation We propose k-fold Chronological Cross-Validation (CCV) as a more realistic evaluation method of data for any temporally-sensitive model (including ) that selects the training and test sets while the data is in chronological order. The test set will then simulate the classification of future messages based on a model produced from prior messages and, therefore, the test set accuracy will better estimate the true accuracy of the filter. CCV maintains the chronology of the messages as the evaluation process moves along the chronological order of the data set. A ten-fold CCV is depicted in Fig. 2. We propose that this new cross-validation approach will reduce the probability of over-estimating the effectiveness of the model. CCV method is as follows: 1. The data set is sorted chronologically; 2. The data set is divided into 2k 1 blocks; 3. k folds (or repetitions) are undertaken as follows: 10
3 Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. In Cross-Validation, the examples are randomly mixed. All 3 types of messages could be evenly distributed as following: ABCCBBABBCBABBC BA. BACBCACCA ABCCBA Training Set Validation Set Test Set In reality, the data is likely to have the following distribution: ABABBBBAAABABABBBBBBAAAABABABBBBAA Data used for model development ABCCCACBCC New Examples Time Fig. 1. A problem with the random mix of examples in standard Cross-Validation. (a) In the first repetition, blocks 1 to k are selected for evaluation; (b) The first k (v + 1) blocks are selected as the training set, the following v blocks are selected as validation set, and the last block is reserved for the test set; (c) Each later fold advances one data block in chronological order and the oldest data block is abandoned (for example in Fig. 2, in the second repetition, block 1 is abandoned and blocks 2 to k + 1 are selected for evaluation); 4. The procedure is repeated k times, until data block 2k 1 has been tested. 4 Empirical Studies Two studies were undertaken using one dataset called AcadiaSpam, collected from a single individual from January to May 2003, working at Acadia University. The set consisted of 1454 spam messages and 1431 legitimate messages. The initial study was conducted with a subset of these s using one experimental design and the second was conducted with all of the data using a slightly different design. 4.1 Experiment 1 The initial study was undertaken during the development of a prototype intelligent client. A small subset of the data was chosen so as to quickly determine if the proposal had merit for larger scale testing. The objective was to determine the severity of SCV over-estimates the true accuracy of hypotheses as compared to CCV. 11
4 10-Fold Chronological Cross-Validation Data Set: The data set is chronologically sorted and divided into 19 blocks OLD Chronological Order NEW Run #1 Run #2 Run #3 Run #4 Data in training set, 7 blocks are included in the training set in each run. Data in validation set, 2 blocks are included in the validation set in each run. Evaluation process keeps moving forward along the chronological line until the last data block is included for test set. Data in test set, the last one block is the test set. Fig. 2. A 10-fold Chronological Cross-Validation. Method. The first 500 legitimate messages and 500 spam messages were selected from the AcadiaSpam dataset and stored in chronological order. A k = 6 was chosen for this initial experiment; therefore, 6 models were repeatedly developed and tested against their respective test sets under each method. A block of 500 messages was chosen for each repetition starting at the earliest message. On each repetition, the block was moved 100 messages forward in the chronology. From each block of messages, 300 were selected for a training set, 100 for a tuning set, and 100 for a test set. Two data sampling methods were used to create the sets. For the SCV method, the message data for the three sets were randomly chosen. For the CCV method, the most recent messages were selected for the test set and the remaining messages randomly distributed to the training and tuning sets. The tuning set was used to prevent over-fitting of the neural network to the training set. Prior to model development, 200 features were extracted from each message using traditional information retrieval methods (removing stop words, performing word stemming, and collecting word statistics). A standard back-propagation neural network with a momentum term was used to develop the spam filter models [3]. The network had 200 inputs, 20 hidden nodes and 1 output node. A learning rate of 0.1 and a momentum factor of 0.8 were used to train the networks to a maximum of 10, 000 iterations. Since the output of the network ranges from 0 to 1, messages with output greater than 0.5 were classified as legitimate, and all others were classified as spam. The models were evaluated based on their accuracy of classification, precision and recall of spam messages. The calculations of accuracy, recall and precision follow [2]. 12
5 1.00 CCV SCV Accuracy Precision Recall Fig. 3. Comparison of SCV and CCV on the 1000 AcadiaSpam dataset with k = 6. Results and Discussion. Figure 3 shows that the SCV method estimates the true accuracy of the models to be That figure is, on average, 2.5% higher than the CCV method s estimate (p = 0.238, based on a paired t-test). Similarly, the SCV method consistently produces the higher precision and recall models. The difference in the precision values is most significant at 5% (p = ). Our conjecture is that the SCV method unrealistically allows the modelling system to identify features of future spam s. The SCV method over-estimates its performance on the test messages because the training set has a higher probability of containing messages that are similar to those in the test set. The CCV method generates a less accurate but more realistic model because the testing procedure simulates the classification of future incoming messages. A potential flaw in this preliminary study is that it does not use a standard crossvalidation method, as not all data was used in every repetition of SCV. This was done to keep the number of examples used by the two systems the same during each repetition. Although we suspect that a more standard SCV would further increase the performance gap between SCV and CCV we undertook a second study to investigate this potential concern about the validity of our results. 4.2 Experiment 2 The second study used all of the available data in a more traditional SCV approach in which all data is used in every repetition. As in the initial study, the objective was to show that SCV over-estimates the true accuracy of hypotheses as compared to CCV. Method All messages in the AcadiaSpam dataset were used in this experiment (1454 spam messages and 1431 legitimate messages). For SCV, the messages were randomly ordered and divided into k = 10 blocks each consisting of 143 legitimate and 145 spam messages. Each repetition used 7 blocks as the training set, two blocks for a validation set, 1 block as a test set. 13
6 1.00 CCV SCV Accuracy Precision Recall Fig. 4. Comparison of SCV and CCV on the 2880 AcadiaSpam dataset with k = 10. For CCV, the AcadiaSpam dataset was chronologically sorted. Legitimate messages and spam messages were divided into 19 (2k 1, where k = 10) blocks. Each block has 151 messages consisting of approximately 75 legitimate messages and 76 spam messages. Each repetition of the experiment used a window of k = 10 blocks of messages starting with the oldest block. From these 10 blocks, the 7 oldest blocks are used as the training set, the next 2 blocks for the validation set, and the most recent block for the test set. For each new repetition of the experiment, the oldest block was removed from the window of 10 blocks and the next chronological block was added. Note that every message was used by both methods, however fewer examples were used in each repetition by CCV than by SCV. All other aspects of the method were the same as in the first experiment. Results and Discussion. The results of this larger experiment, shown in Fig. 4, support the findings of the initial experiment. The SCV method produces hypotheses with superior performance in all 3 measurements (accuracy, recall, precision) as compared with CCV. The difference in mean accuracy between the two methods was found to be 3.1% (p = , based on a paired t-test) up from 2.5% in the initial study. As in the initial study, the SCV method produces the highest precision and recall statistics. In this case, no significant difference was detected in the precision statistics (p = 0.38) but the recall statistics differed substantially (p = ). Although the difference in the statistics for these methods could be attributed solely to the smaller training sets used to develop the CCV models in this study, the results of the initial study do not support that conclusion. When both methods used the same size training sets, the SCV method was still found to over-estimate the model s performance. We conclude that the difference in the statistics is caused by unrealistic mix of old and new examples in the training sets used to develop the SCV models. 14
7 5 Related Work We have recently discovered work by Crawford et al [6] that agrees that the k-fold SCV method is unrealistic given the temporal nature of . They describe a model development approach that accumulates the older messages in the training set and selects only the most recent data for the test set. This testing approach is in accord with how a real filter must perform; therefore, it should provide a fair evaluation of the model s effectiveness. Beyond this the research emphasis and approach differs from our work. Crawford et al focus on model development over time whereas we are interested in a cross-validation method for estimating the true error of a model at any one point in time. Our CCV method purposely abandons older blocks of examples as it moves through its repetitions so as to better estimate model performance on a variety of examples. 6 Summary and Conclusion We have considered the importance of maintaining the temporal nature of incoming messages when developing a user model for filtering. Although the k-fold SCV is commonly presented in the literature as a method of randomly mixing examples to produce training and test datasets, we have demonstrated that the method results in overly optimistic estimates of an spam filter s accuracy in classifying future messages. Our conjecture is that the SCV method is inappropriate because it allows the modelling system to unfairly identify features of the test set spam s. The SCV method over-estimates model performance on future messages because the training set has a higher probability of containing messages that are similar to those in the test set. We propose the k-fold Chronological Cross-Validation (CCV) method as a step towards more realistic estimates of model performance. CCV generates less accurate but more realistic models because the testing procedure more properly simulates the classification of future messages. The CCV method can be used to more properly evaluate any complex user model that will change over time. Thus, it can better estimating a models ability to deal with concept drift: the change in a user model over time due to subtle changes in the user, their environment, or both [7]. More broadly, the CCV method can be applied to any learning task where the order of examples must be preserved. The CCV method highlights the fact that more examples are needed to properly evaluate a user model when the preservation of example order is a requirement. The size of the time window, which depends on k, must be large enough to develop good models but small enough to allow sufficient blocks for cross-validation. Window size must also be sensitive to the mix of training examples and is likely to be different for each individual. These are a couple of the open problems that we would like to investigate in future research. References 1. Clark, J., Koprinska, I., Poon, J.: A Neural Network Based Approach to Automated Classification. In: Proceedings IEEE/WIC International Conference on Web Intelligence (WI2003), Halifax, Canada (2003)
8 2. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to Filter Spam A Comparison of a Naïve Bayesian and a Memory- Based Approach. In: Proc. of the Workshop on Machine Learning and Textual Information Access, 4th European Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD-2000), Lyon, France (2000) Mitchell, T.: Machine Learning. McGraw Hill, New York, USA (1997) 4. Stern, H., Mason, J., Shepherd, M.: A linguistics-based attack on personalised statistical e- mail classifiers. Technical Report CS , Dalhousie Univ. (2004) 5. Androutsopoulos, I.: Ling-spam corpus (2000) data/lingspam_public.tar.gz. 6. Crawford, E., Kay, J., McCreath, E.: Automatic induction of rules for classification. In: The Sixth Australian Document Computing Symposium, Coffs Harbour, Australia (2001) 7. Widmer, G.: Combining Robustness and Flexibility in Learning Drifting Concepts. In: Proceedings of the 11th European Conference on Artifcial Intelligence (ECAI-94), Wiley, Chichester, UK (1994)
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Lecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
Journal of Information Technology Impact
Journal of Information Technology Impact Vol. 8, No., pp. -0, 2008 Probability Modeling for Improving Spam Filtering Parameters S. C. Chiemeke University of Benin Nigeria O. B. Longe 2 University of Ibadan
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Efficient Spam Email Filtering using Adaptive Ontology
Efficient Spam Email Filtering using Adaptive Ontology Seongwook Youn and Dennis McLeod Computer Science Department, University of Southern California Los Angeles, CA 90089, USA {syoun, mcleod}@usc.edu
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT
IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: [email protected], Abstract- Unethical
Differential Voting in Case Based Spam Filtering
Differential Voting in Case Based Spam Filtering Deepak P, Delip Rao, Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology Madras, India [email protected],
A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2
UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,
Bayesian Spam Filtering
Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary [email protected] http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Three-Way Decisions Solution to Filter Spam Email: An Empirical Study
Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University
A Case-Based Approach to Spam Filtering that Can Track Concept Drift
A Case-Based Approach to Spam Filtering that Can Track Concept Drift Pádraig Cunningham 1, Niamh Nowlan 1, Sarah Jane Delany 2, Mads Haahr 1 1 Department of Computer Science, Trinity College Dublin 2 School
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
L13: cross-validation
Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU
Filtering Junk Mail with A Maximum Entropy Model
Filtering Junk Mail with A Maximum Entropy Model ZHANG Le and YAO Tian-shun Institute of Computer Software & Theory. School of Information Science & Engineering, Northeastern University Shenyang, 110004
Impact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Abstract. Find out if your mortgage rate is too high, NOW. Free Search
Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back
Detecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
Learning is a very general term denoting the way in which agents:
What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering
2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016
Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency
Evaluating Data Mining Models: A Pattern Language
Evaluating Data Mining Models: A Pattern Language Jerffeson Souza Stan Matwin Nathalie Japkowicz School of Information Technology and Engineering University of Ottawa K1N 6N5, Canada {jsouza,stan,nat}@site.uottawa.ca
Spam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 and Barry Smyth 3 Abstract. While text classification has been identified for some time
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang
Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang Graduate Institute of Information Management National Taipei University 69, Sec. 2, JianGuo N. Rd., Taipei City 104-33, Taiwan
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools
Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools Spam Track Wednesday 1 March, 2006 APRICOT Perth, Australia James Carpinter & Ray Hunt Dept. of Computer Science and Software
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
How To Filter Spam With A Poa
A Multiobjective Evolutionary Algorithm for Spam E-mail Filtering A.G. López-Herrera 1, E. Herrera-Viedma 2, F. Herrera 2 1.Dept. of Computer Sciences, University of Jaén, E-23071, Jaén (Spain), [email protected]
Equity forecast: Predicting long term stock price movement using machine learning
Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK [email protected] Abstract Long
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 Abstract. While text classification has been identified for some time as a promising application
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Learning to classify e-mail
Information Sciences 177 (2007) 2167 2187 www.elsevier.com/locate/ins Learning to classify e-mail Irena Koprinska *, Josiah Poon, James Clark, Jason Chan School of Information Technologies, The University
AN E-MAIL SERVER-BASED SPAM FILTERING APPROACH
AN E-MAIL SERVER-BASED SPAM FILTERING APPROACH MUMTAZ MOHAMMED ALI AL-MUKHTAR College of Information Engineering, AL-Nahrain University IRAQ ABSTRACT The spam has now become a significant security issue
Cross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I
Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy
Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA
Welcome Xindong Wu Data Mining: Updates in Technologies Dept of Math and Computer Science Colorado School of Mines Golden, Colorado 80401, USA Email: xwu@ mines.edu Home Page: http://kais.mines.edu/~xwu/
Combining Global and Personal Anti-Spam Filtering
Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized
Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
REVIEW OF HEART DISEASE PREDICTION SYSTEM USING DATA MINING AND HYBRID INTELLIGENT TECHNIQUES
REVIEW OF HEART DISEASE PREDICTION SYSTEM USING DATA MINING AND HYBRID INTELLIGENT TECHNIQUES R. Chitra 1 and V. Seenivasagam 2 1 Department of Computer Science and Engineering, Noorul Islam Centre for
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Predicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Lecture 6. Artificial Neural Networks
Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm
Intrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. [email protected] J. Jiang Department
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Analecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
Detecting client-side e-banking fraud using a heuristic model
Detecting client-side e-banking fraud using a heuristic model Tim Timmermans [email protected] Jurgen Kloosterman [email protected] University of Amsterdam July 4, 2013 Tim Timmermans, Jurgen
Chapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Applying Machine Learning to Stock Market Trading Bryce Taylor
Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,
Enhancing Quality of Data using Data Mining Method
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad
E-mail Spam Classification With Artificial Neural Network and Negative Selection Algorithm
E-mail Spam Classification With Artificial Neural Network and Negative Selection Algorithm Ismaila Idris Dept of Cyber Security Science, Federal University of Technology, Minna, Nigeria. [email protected]
Email Classification Using Data Reduction Method
Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user
A HYBRID RULE BASED FUZZY-NEURAL EXPERT SYSTEM FOR PASSIVE NETWORK MONITORING
A HYBRID RULE BASED FUZZY-NEURAL EXPERT SYSTEM FOR PASSIVE NETWORK MONITORING AZRUDDIN AHMAD, GOBITHASAN RUDRUSAMY, RAHMAT BUDIARTO, AZMAN SAMSUDIN, SURESRAWAN RAMADASS. Network Research Group School of
[email protected]. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam
An Improved AIS Based E-mail Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State [email protected] Abdulhamid Shafi i
Effective Data Mining Using Neural Networks
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 6, DECEMBER 1996 957 Effective Data Mining Using Neural Networks Hongjun Lu, Member, IEEE Computer Society, Rudy Setiono, and Huan Liu,
Automatic Resolver Group Assignment of IT Service Desk Outsourcing
Automatic Resolver Group Assignment of IT Service Desk Outsourcing in Banking Business Padej Phomasakha Na Sakolnakorn*, Phayung Meesad ** and Gareth Clayton*** Abstract This paper proposes a framework
Non-Parametric Spam Filtering based on knn and LSA
Non-Parametric Spam Filtering based on knn and LSA Preslav Ivanov Nakov Panayot Markov Dobrikov Abstract. The paper proposes a non-parametric approach to filtering of unsolicited commercial e-mail messages,
Hong Kong Stock Index Forecasting
Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei [email protected] [email protected] [email protected] Abstract Prediction of the movement of stock market is a long-time attractive topic
