Detecting Spam Using Spam Word Associations

Size: px
Start display at page:

Download "Detecting E-mail Spam Using Spam Word Associations"

Transcription

1 Detecting Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in 3 rgm@coed.svnit.ac.in Abstract Now-a-days, mailbox management has become a big task. A large proportion of the s we receive are spam. These unwanted s clog the inbox and are very ubiquitous. Here, a new technique for spam detection is presented that makes use of clustering and association rules generated by the Apriori algorithm. Vector space notation is used to represent the s. The results obtained from experiments conducted on the ling-spam dataset demonstrate the effectiveness of the proposed technique. Keywords Association rules, Content based spam, detection, spam, Text clustering, Vector space model I. INTRODUCTION Spam is an unfortunate problem on the internet. Spam s are the s that we get without our consent. They are typically sent to millions of users at the same time. Spam can be defined as unsolicited (unwanted, junk) for a recipient or any that the user does not want to have in his inbox. It is also defined as Internet Spam is one or more unsolicited messages, sent or posted as a part of larger collection of messages, all having substantially identical content. [1] Most spam is sent to sell products and services and the reason that spam works is because a small number of people choose to respond to it. It costs the sender of the spam mail almost nothing to send millions of spam s. Spam is a big problem because of the large amount of shared resources it consumes. Spam increase the load on the servers and the bandwidth of the ISPs and the added cost to handle this load must be compensated by the customers. In addition, the time spent by people in reading and deleting the spam s is a waste. Taking a look at the 2010 statistics[2], 89.1% of the total s were spam. This amounts to about 262 spam s per day. These are truly large numbers. This paper is organized as follows. A brief introduction about spam was given in the paragraphs above. Related work in this field is discussed in section II. The work proposed in this paper is explained in Section III. The results & inferences are presented in section IV. Lastly, the conclusions are presented in section V II. RELATED WORK Several solutions to the spam problem involve detection and filtering of the spam s on the client side. Machine learning approaches have been used in the past for this purpose. Some examples of this are: Bayesian classifiers as Naive Bayes[3], [4], [5], [6], C4.5[7], Ripper[8] and Support Vector Machine(SVM)[9] and others. In many of these approaches, Bayesian classifiers were observed to give good results and so they have been widely used in several spam filtering software. A number of techniques make use of clustering as a part of their spam detection approach as: clustering followed by KNN classification [10], [11], clustering followed by KNN or BIRCH classification [1] and clustering followed by SVM classification [12]. Up to the knowledge of the authors, clustering with association rules has not been used for spam detection in the past. Vector space model[13] is an algebraic model for representing text documents as vectors of identifiers. Each dimension corresponds to a separate term. If the term occurs in the document, it has some non-zero value in the vector. This is shown in Figure 1. The simplest scheme is to set this value to the number of times a particular word occurs in that document. The drawback of this approach is that some terms occur with very high frequency and usually can t be used to discriminate the documents. The tf-idf weighting scheme is an improvement over the simple scheme. In this scheme, the value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus. According to this scheme the value is computed as: Where tf is the term frequency as discussed above and idf (t, D) is the inverse document frequency and is given as: 222

2 Where N is the number of documents and df t is the number of documents that contain that term. The tf-idf value is high when the document frequency is high and the inverse document frequency is low. The effect is that the common terms are filtered out. Figure1. Vector space notation Clustering is assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar to each other than to those in other clusters. It is one of the most important unsupervised learning problems. Document (or text) clustering is a subset of the larger field of data clustering. In our system, clustering is a data reduction step. i.e. after the clusters of documents are formed, we select only the 'spammy' clusters and then the subsequent steps are applied only to the selected clusters. This helps in reducing the time and effort needed to perform the entire operation. K-means[14] is one of the simplest clustering algorithms. It attempts to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. The main idea is to define k centroids, one for each cluster. The choice of the initial centroids affects the final outcome and ideally they should far away from each other. The steps in the algorithm are as given below: Arbitrarily choose k documents as the initial centroids. Repeat (re)assign each document to the cluster to which it is the most similar, based on the decided similarity measure Obtain new centroids for each cluster Until no change In distance-based clustering, the similarity criterion is distance: two or more objects belong to the same cluster if they are close according to a given distance. In our case, the distance measure that we use is the cosine distance between documents. It is the most popular similarity measure applied to text documents. The cosine distance of two documents is defined by the angle between their feature vectors. Where "." denotes the dot-product of the two frequency vectors A and B, and A denotes the length (or norm) of a vector. Document similarity is based on the amount of overlapping content between documents. The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity. Advantages of the K-means algorithm are that apart from being simple, it is efficient for operating on large data sets. Disadvantages are that initial choice of the centroids can give varied outcomes; it is sensitive to noise and outliers and tends to terminate at local optimums. The time complexity of the k-means algorithm is O (nkl), where n is the number of objects, k is the number of clusters, and l is the number of iterations [15]. Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a relational database or other information repository. For example, the rule {milk, bread} {butter}found in the sales data of a supermarket would indicate that if a customer buys milk and bread together, he or she is also likely to buy butter. Two important terms that go along with association rules are Support and Confidence. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true. Support and confidence values can be changed to control the number of rules that are generated Apriori[16] is a classic algorithm for learning association rules. It attempts to find subsets which are common to at least a minimum number C of the item sets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. The purpose of the Apriori Algorithm is to find associations between different sets of data. In our technique, the associations that we are interested are between the spam words that occur in s. True positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), are the four possible outcomes when an is associated by the system. A false positive is when the is incorrectly identified as spam, when it is in fact non-spam. A false negative is when the is incorrectly identified as non-spam when it is in fact spam. True positives and true negatives are correct detections. 223

3 TABLE I FOUR POSSIBLE OUTCOMES OF ASSOCIATION Support and Confidence are used to control the number of rules generated, they are not the evaluation criteria. Instead, four measures precision, recall, specificity and accuracy were used in the evaluation process. They are defined in Table II[1]. TABLE II EVALUATION MEASURES USED IN THE SYSTEM Measure Defined as What it means Precision TP Percentage of TP FP positive predictions that are correct Recall/Sensitivity TP Percentage of TP FN positive labeled instances that were predicted as positive Specificity TN Percentage of TN FP negative labeled instances that were predicted as negative Accuracy TP TN Percentage of TP TN FP FN predictions that are correct Accuracy alone is not sufficient to gauge the effectiveness of the system because we want both the spam and the non-spam s to be correctly labeled. Accuracy value might be high but it might be labeling only one of spam or non-spam s correctly. If the precision is high, it means that the false positives are less. If he Recall is more, it means that the system recognizes most of the spam messages. A high value of specificity means that very few non-spam messages are associated as spam. A perfect predictor would be described as 100% sensitivity and 100% specificity. 224 III. PROPOSED WORK The purpose of this paper is to present a technique to identify messages as spam or non-spam. Once that is done, the accuracy of the method is tested to see how many mails were correctly categorized. s messages are represented as vectors. Clustering is then applied to group together spam and non-spam s. Then, Apriori algorithm is applied to generate the association rules. These can be then applied to the new mails to see whether they are spam or not. The steps in the proposed system are as indicated in Figure 2. Apache Mahout[17] contains free implementations of scalable machine learning algorithms. It contains several algorithms for clustering, classification, pattern mining, frequent item mining amongst others. We use it for converting the set of documents to the vector space notation and then to cluster the s. Christian Borglet s[18] implementation of the Apriori algorithm is used for generating frequent item sets. By varying the values used for support and confidence, the number of rules generated can be controlled. The proposed system is setup on a machine with the following hardware: Intel Centrino duo 1.66 GHz, 3 GB RAM, 160 GB HDD. Figure 2. Flowchart of the proposed system

4 Set of documents is converted into the vector space notation. Entries in the vectors are the tf-idf values. It may happen that some s in the set have more length than others. In such s, same terms are likely to appear more times i.e. they may have more term frequencies. Plus, such s may also have more terms that can be considered as spam. To compensate for this difference in length, some sort of normalization is needed. L2 normalization[19] is used for this purpose. Clusters of similar s are then formed. K-means algorithm is used for clustering. The clusters may not be distinct spam and non-spam clusters each may consist of a mixture of spam and non-spam s. Out of these clusters, ones having >= 50% spam s will be selected to generate the association rules. This is to ensure that the set of s obtained after this step has substantial number of spam s which will improve the accuracy of the system. As the size of the data to be clustered increases the number of clusters to be formed(k) should also be increased accordingly to obtain better clusters. A list of commonly occurring spam words was created. The s obtained once we have selected the spammy clusters are compared against this list. This forms a filtering step. Only the words that occur in the list are retained and the rest of the text is deleted. At the end of this operation, we get the various combinations in which the spam words occur in the set of s. This list can be updated periodically to include new words that can be considered as spam words. Figure 3. Obtaining spam words combination from an document At this point we can go ahead and generate the association rules using the Apriori algorithm. Once we have the rules, they can be matched against new s to decide whether it is spam or not. 225 To understand how the new s will be processed, let s take an example. Assume that the words lottery and gambling occur in the list of spam words. So there may be a rule of the form lottery > gambling. Any new that has both these words in its content will be treated as spam. Likewise, several other rules may match for some test . s which are not spam will not contain any words from the list of spam words or won t contain all the words that form a rule. Such s will be identified as non-spam by the system. IV. EXPERIMENTAL RESULT The corpus used for training and testing is the ling-spam corpus[20]. In ling-spam, there are four subdirectories, corresponding to 4 versions of the corpus, viz., bare: Lemmatiser disabled, stop-list disabled, lemm: Lemmatiser enabled, stop-list disabled, lemm_stop: Lemmatiser enabled, stop-list enabled, stop: Lemmatiser disabled, stop-list enabled, Where lemmatizing is similar to stemming and stop-list tells if stopping is done on the content of the parts or not. We use the lemm_stop subdirectory in our approach Each one of these 4 directories contains 10 subdirectories (part 1,, part 10). These correspond to the 10 partitions of the corpus that were used in the 10-fold experiments. In every part, 9 partitions were used for training and the 10 th partition was used for testing. Each one of the 10 subdirectories contains both spam and legitimate messages, one message in each file. Files whose names have the form spmsg*.txt are spam messages. All other files are legitimate messages. The total number of spam messages is 481 and that of legitimate messages are The percentage of spam in this corpus is 16.6%. Final results are obtained by taking the average of the scores obtained in each of the 10 experiments, One advantage of the ling-spam corpus is the focus on the textual component of the as needed in our system. In a real -filtering system, some part of the message header may be used to improve the classification performance. e.g. senders address could be looked up in the address book. Such strategies do not require machine learning and are not the focus of our work here. Results of conducting the experiments on the ling-spam data set are shown in the table below. Values are the average of 10 experiments, each using 1 of the 10 parts for testing and the other 9 for training. TABLE III RESULTS OF APPLYING THE PROPOSED APPROACH ON THE LING-SPAM DATASET Measure Average Value Precision 60.10% Recall 71.60% Specificity 92.28% Accuracy 89.31%

5 As seen in the table above, about 89.31% of the total s were correctly identified. Precision value is on the lower side so more false positives are being generated. Recall is also low and this indicates that not all spam messages are recognized. Since the specificity is high, most non-spam s are recognized as non-spam. V. CONCLUSION In this paper, a new technique to effectively detect spam s using clustering and association rules was suggested. Clustering is used as a data reduction step - to find the spammy clusters out of all the s. After the doubtful clusters are identified, association rules can be generated for such clusters. Using these rules, we can then associate an incoming as spam or non-spam. As part of future work, the system can be made truly dynamic by automating the entire process on the server side. REFERENCES [1] Prabhakar, R. and Basavaraju, M A Novel Method of Spam Mail Detection Using Text Based Clustering Approach. Phil. Trans. Roy. Soc. London, vol. A247, pp [2] Internet 2010 in Numbers. Internet: [Mar. 23, 2012]. [3] Androutsopoulos, I., Chandrinos, K., Koutsias, J., Paliouras, G. and Spyropoulos, C An Evaluation of Naive Bayesian Anti-spam Filtering, in Proceedings of the Workshop on Machine Learning in the New Information Age: 11th European Conference on Machine Learning (ECML 2000), pp [4] Bogofilter. Internet: [Mar. 23, 2012]. [5] Graham, P. Better Bayesian Filtering. Internet: [Mar. 23, 2012]. [6] Dumais, S., Heckerman, D., Horvitz, E. and Sahami, M A Bayesian Approach to Filtering Junk , in Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, pp [7] Quinlan, J.R. C4.5: Programs for Machine Learning, 1 st ed., San Mateo, CA: Morgan Kaufmann, [8] Cohen, W.W Learning Rules that Classify , in Proceedings of the 1996 AAAI Spring Symposium on Machine Learning in Information Access, pp [9] Druker, H Support Vector Machines for Spam Categorization, in IEEE Transaction on Neural Networks, pp [10] Firte, L., Lemnaru, C. and Potolea, R Spam Detection Filter Using KNN Algorithm and Resampling, in Intelligent Computer Communication and Processing (ICCP), 2010 IEEE International Conferenc, pp [11] Alguliev, R.M., Aliguliyev, R.M. and SNazirova, S.A. Classification of Textual Spam Using Data Mining Techniques. Applied Computational Intelligence and Soft Computing, vol. 2011, Article ID , 8 pages. doi: /2011/ [12] Kyriakopoulou, A. and Kalamboukis, T Text Classification Using Clustering, in ECML-PKDD Discovery Challenge Workshop Proceedings. [13] Vector Space Model. Internet: Vector_space_model [Mar. 23, 2012]. [14] MacQueen, J Some Methods for Classification and Analysis of Multivariate Observations. Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66) Vol. I: Statistics, pp [15] Manning, C.D., Raghavan, P., and Schutze, H. Introduction to Information Retrieval. 1 st ed., Cambridge, England: Cambridge University Press, pp , [16] Agrawal, Rakesh and Srikant, Ramakrishnan Fast Algorithms for Mining Association Rules in Large Databases, in Proceedings of the 20th International Conference on Very Large Data Bases,VLDB, Santiago, Chile, pp [17] Apache Mahout: Scalable Machine Learning and Data Mining. Internet: [Mar. 23, 2012]. [18] Apriori - Association Rule Induction / Frequent Item Set Mining. Internet: [Mar. 23, 2012]. [19] Lp space. Internet: [Mar. 23, 2012]. [20] Ling-Spam data set. Internet: [Mar. 23, 2012]. 226

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

More information

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,

More information

Impact of Feature Selection Technique on Email Classification

Impact of Feature Selection Technique on Email Classification Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity

More information

Bayesian Spam Filtering

Bayesian Spam Filtering Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Naïve Bayesian Anti-spam Filtering Technique for Malay Language

Naïve Bayesian Anti-spam Filtering Technique for Malay Language Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

More information

Email Filters that use Spammy Words Only

Email Filters that use Spammy Words Only Email Filters that use Spammy Words Only Vasanth Elavarasan Department of Computer Science University of Texas at Austin Advisors: Mohamed Gouda Department of Computer Science University of Texas at Austin

More information

An Efficient Spam Filtering Techniques for Email Account

An Efficient Spam Filtering Techniques for Email Account American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical

More information

WE DEFINE spam as an e-mail message that is unwanted basically

WE DEFINE spam as an e-mail message that is unwanted basically 1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

More information

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

A Case-Based Approach to Spam Filtering that Can Track Concept Drift

A Case-Based Approach to Spam Filtering that Can Track Concept Drift A Case-Based Approach to Spam Filtering that Can Track Concept Drift Pádraig Cunningham 1, Niamh Nowlan 1, Sarah Jane Delany 2, Mads Haahr 1 1 Department of Computer Science, Trinity College Dublin 2 School

More information

Filtering Junk Mail with A Maximum Entropy Model

Filtering Junk Mail with A Maximum Entropy Model Filtering Junk Mail with A Maximum Entropy Model ZHANG Le and YAO Tian-shun Institute of Computer Software & Theory. School of Information Science & Engineering, Northeastern University Shenyang, 110004

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Non-Parametric Spam Filtering based on knn and LSA

Non-Parametric Spam Filtering based on knn and LSA Non-Parametric Spam Filtering based on knn and LSA Preslav Ivanov Nakov Panayot Markov Dobrikov Abstract. The paper proposes a non-parametric approach to filtering of unsolicited commercial e-mail messages,

More information

Differential Voting in Case Based Spam Filtering

Differential Voting in Case Based Spam Filtering Differential Voting in Case Based Spam Filtering Deepak P, Delip Rao, Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology Madras, India deepakswallet@gmail.com,

More information

Representation of Electronic Mail Filtering Profiles: A User Study

Representation of Electronic Mail Filtering Profiles: A User Study Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM

AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, APRIL 212, VOLUME: 2, ISSUE: 3 AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM S. Arun Mozhi Selvi 1 and R.S. Rajesh 2 1 Department

More information

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2 International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

More information

Email Classification Using Data Reduction Method

Email Classification Using Data Reduction Method Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user

More information

An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

More information

Three-Way Decisions Solution to Filter Spam Email: An Empirical Study

Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

More information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,

More information

Spam Detection and Pattern Recognition

Spam Detection and Pattern Recognition International Journal of Signal Processing, Image Processing and Pattern Recognition 31 Learning to Detect Spam: Naive-Euclidean Approach Tony Y.T. Chan, Jie Ji, and Qiangfu Zhao The University of Akureyri,

More information

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering

PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering 2007 IEEE/WIC/ACM International Conference on Web Intelligence PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering Khurum Nazir Juneo Dept. of Computer Science Lahore University

More information

Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Naive Bayes Spam Filtering Using Word-Position-Based Attributes Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper

More information

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type. Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Effective Email

More information

An incremental cluster-based approach to spam filtering

An incremental cluster-based approach to spam filtering Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 34 (2008) 1599 1608 www.elsevier.com/locate/eswa An incremental cluster-based approach to spam

More information

An Efficient Three-phase Email Spam Filtering Technique

An Efficient Three-phase Email Spam Filtering Technique An Efficient Three-phase Email Filtering Technique Tarek M. Mahmoud 1 *, Alaa Ismail El-Nashar 2 *, Tarek Abd-El-Hafeez 3 *, Marwa Khairy 4 * 1, 2, 3 Faculty of science, Computer Sci. Dept., Minia University,

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Data Pre-Processing in Spam Detection

Data Pre-Processing in Spam Detection IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 11 May 2015 ISSN (online): 2349-784X Data Pre-Processing in Spam Detection Anjali Sharma Dr. Manisha Manisha Dr. Rekha Jain

More information

Shafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam

Shafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam An Improved AIS Based E-mail Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State Idris.ismaila95@gmail.com Abdulhamid Shafi i

More information

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

Abstract. Find out if your mortgage rate is too high, NOW. Free Search Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back

More information

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

More information

Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization

An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization International Journal of Network Security, Vol.9, No., PP.34 43, July 29 34 An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization Jyh-Jian Sheu Department of Information Management,

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

1 Introductory Comments. 2 Bayesian Probability

1 Introductory Comments. 2 Bayesian Probability Introductory Comments First, I would like to point out that I got this material from two sources: The first was a page from Paul Graham s website at www.paulgraham.com/ffb.html, and the second was a paper

More information

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier

Spam Detection System Combining Cellular Automata and Naive Bayes Classifier Spam Detection System Combining Cellular Automata and Naive Bayes Classifier F. Barigou*, N. Barigou**, B. Atmani*** Computer Science Department, Faculty of Sciences, University of Oran BP 1524, El M Naouer,

More information

Spam Filter: VSM based Intelligent Fuzzy Decision Maker

Spam Filter: VSM based Intelligent Fuzzy Decision Maker IJCST Vo l. 1, Is s u e 1, Se p te m b e r 2010 ISSN : 0976-8491(Online Spam Filter: VSM based Intelligent Fuzzy Decision Maker Dr. Sonia YMCA University of Science and Technology, Faridabad, India E-mail

More information

Spam Filter Optimality Based on Signal Detection Theory

Spam Filter Optimality Based on Signal Detection Theory Spam Filter Optimality Based on Signal Detection Theory ABSTRACT Singh Kuldeep NTNU, Norway HUT, Finland kuldeep@unik.no Md. Sadek Ferdous NTNU, Norway University of Tartu, Estonia sadek@unik.no Unsolicited

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Accelerating Techniques for Rapid Mitigation of Phishing and Spam Emails

Accelerating Techniques for Rapid Mitigation of Phishing and Spam Emails Accelerating Techniques for Rapid Mitigation of Phishing and Spam Emails Pranil Gupta, Ajay Nagrale and Shambhu Upadhyaya Computer Science and Engineering University at Buffalo Buffalo, NY 14260 {pagupta,

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

Adaptive Filtering of SPAM

Adaptive Filtering of SPAM Adaptive Filtering of SPAM L. Pelletier, J. Almhana, V. Choulakian GRETI, University of Moncton Moncton, N.B.,Canada E1A 3E9 {elp6880, almhanaj, choulav}@umoncton.ca Abstract In this paper, we present

More information

Top 10 Algorithms in Data Mining

Top 10 Algorithms in Data Mining Top 10 Algorithms in Data Mining Xindong Wu ( 吴 信 东 ) Department of Computer Science University of Vermont, USA; 合 肥 工 业 大 学 计 算 机 与 信 息 学 院 1 Top 10 Algorithms in Data Mining by the IEEE ICDM Conference

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang

Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang Sender and Receiver Addresses as Cues for Anti-Spam Filtering Chih-Chien Wang Graduate Institute of Information Management National Taipei University 69, Sec. 2, JianGuo N. Rd., Taipei City 104-33, Taiwan

More information

A Novel Technique of Email Classification for Spam Detection

A Novel Technique of Email Classification for Spam Detection A Novel Technique of Email Classification for Spam Detection Vinod Patidar Student (M. Tech.), CSE Department, BUIT Divakar singh HOD, CSE Department, BUIT Anju Singh Assistant Professor, IT Department,

More information

An Imbalanced Spam Mail Filtering Method

An Imbalanced Spam Mail Filtering Method , pp. 119-126 http://dx.doi.org/10.14257/ijmue.2015.10.3.12 An Imbalanced Spam Mail Filtering Method Zhiqiang Ma, Rui Yan, Donghong Yuan and Limin Liu (College of Information Engineering, Inner Mongolia

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

How To Filter Spam With A Poa

How To Filter Spam With A Poa A Multiobjective Evolutionary Algorithm for Spam E-mail Filtering A.G. López-Herrera 1, E. Herrera-Viedma 2, F. Herrera 2 1.Dept. of Computer Sciences, University of Jaén, E-23071, Jaén (Spain), aglopez@ujaen.es

More information

Using Biased Discriminant Analysis for Email Filtering

Using Biased Discriminant Analysis for Email Filtering Using Biased Discriminant Analysis for Email Filtering Juan Carlos Gomez 1 and Marie-Francine Moens 2 1 ITESM, Eugenio Garza Sada 2501, Monterrey NL 64849, Mexico juancarlos.gomez@invitados.itesm.mx 2

More information

Top Top 10 Algorithms in Data Mining

Top Top 10 Algorithms in Data Mining ICDM 06 Panel on Top Top 10 Algorithms in Data Mining 1. The 3-step identification process 2. The 18 identified candidates 3. Algorithm presentations 4. Top 10 algorithms: summary 5. Open discussions ICDM

More information

International Journal of Research in Advent Technology Available Online at: http://www.ijrat.org

International Journal of Research in Advent Technology Available Online at: http://www.ijrat.org IMPROVING PEFORMANCE OF BAYESIAN SPAM FILTER Firozbhai Ahamadbhai Sherasiya 1, Prof. Upen Nathwani 2 1 2 Computer Engineering Department 1 2 Noble Group of Institutions 1 firozsherasiya@gmail.com ABSTARCT:

More information

Efficient Spam Email Filtering using Adaptive Ontology

Efficient Spam Email Filtering using Adaptive Ontology Efficient Spam Email Filtering using Adaptive Ontology Seongwook Youn and Dennis McLeod Computer Science Department, University of Southern California Los Angeles, CA 90089, USA {syoun, mcleod}@usc.edu

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

It is designed to resist the spam in the Internet. It can provide the convenience to the email user and save the bandwidth of the network.

It is designed to resist the spam in the Internet. It can provide the convenience to the email user and save the bandwidth of the network. 1. Abstract: Our filter program is a JavaTM 2 SDK, Standard Edition Version 1.5.0 (J2SE) based application, which can be running on the machine that has installed JDK 1.5.0. It can integrate with a JavaServer

More information

The Enron Corpus: A New Dataset for Email Classification Research

The Enron Corpus: A New Dataset for Email Classification Research The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu

More information

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 and Barry Smyth 3 Abstract. While text classification has been identified for some time

More information

Effectiveness and Limitations of Statistical Spam Filters

Effectiveness and Limitations of Statistical Spam Filters Effectiveness and Limitations of Statistical Spam Filters M. Tariq Banday, Lifetime Member, CSI P.G. Department of Electronics and Instrumentation Technology University of Kashmir, Srinagar, India Abstract

More information

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand

More information

Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Content-Based Spam Filtering and Detection Algorithms- An Efficient Analysis & Comparison

Content-Based Spam Filtering and Detection Algorithms- An Efficient Analysis & Comparison Content-Based Spam Filtering and Detection Algorithms- An Efficient Analysis & Comparison 1 R.Malarvizhi, 2 K.Saraswathi 1 Research scholar, PG & Research Department of Computer Science, Government Arts

More information

Spam Filtering using Spam Mail Communities

Spam Filtering using Spam Mail Communities Spam Filtering using Spam Mail Communities Deepak P 1, Jyothi John 1, Sandeep Parameswaran 2 1 Model Engg: College, Kochi, Kerala, India 2 IBM Global Services India Pvt. Ltd., Bangalore, India deepak-p@eth.net,

More information

Using Data Mining Methods to Predict Personally Identifiable Information in Emails

Using Data Mining Methods to Predict Personally Identifiable Information in Emails Using Data Mining Methods to Predict Personally Identifiable Information in Emails Liqiang Geng 1, Larry Korba 1, Xin Wang, Yunli Wang 1, Hongyu Liu 1, Yonghua You 1 1 Institute of Information Technology,

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

SVM-Based Spam Filter with Active and Online Learning

SVM-Based Spam Filter with Active and Online Learning SVM-Based Spam Filter with Active and Online Learning Qiang Wang Yi Guan Xiaolong Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China Email:{qwang, guanyi,

More information

Spam Filters: Bayes vs. Chi-squared; Letters vs. Words

Spam Filters: Bayes vs. Chi-squared; Letters vs. Words Spam Filters: Bayes vs. Chi-squared; Letters vs. Words Cormac O Brien & Carl Vogel Abstract We compare two statistical methods for identifying spam or junk electronic mail. The proliferation of spam email

More information

Single-Class Learning for Spam Filtering: An Ensemble Approach

Single-Class Learning for Spam Filtering: An Ensemble Approach Single-Class Learning for Spam Filtering: An Ensemble Approach Tsang-Hsiang Cheng Department of Business Administration Southern Taiwan University of Technology Tainan, Taiwan, R.O.C. Chih-Ping Wei Institute

More information

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio

More information

Email Spam Detection Using Customized SimHash Function

Email Spam Detection Using Customized SimHash Function International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

More information

A Collaborative Approach to Anti-Spam

A Collaborative Approach to Anti-Spam A Collaborative Approach to Anti-Spam Chia-Mei Chen National Sun Yat-Sen University TWCERT/CC, Taiwan Agenda Introduction Proposed Approach System Demonstration Experiments Conclusion 1 Problems of Spam

More information