# Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

Save this PDF as:

Size: px
Start display at page:

Download "Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India"

## Transcription

2 A. Bayesian Theorem An attempt to Spam filtering based on Naive Bayes classifier is done by S.Roy, A.Patra, S.Sau, K.Mandal and S.Kunar in [7]. Naive Bayes classifier uses the Baye s theorem of conditioned probability to recognize an to be spam or not. Conditioned Probability is given as P (cj d) = P (d cj) P (cj) / P (d) (1) Considering each attribute and class label as a random variable and given a record with attributes (A1, A2, An), the goal is to predict class C. Specifically, we want to find the value of C that maximizes P(C A1, A2, An). The approach taken is to compute the posterior probability P(C A1, A2, An) for all values of C using the Bayes theorem. P(C A1A2 An) = P(A1A2.An C) P(C) P(A1A2...An) So you choose the value of C that maximizes P(C A1,A2, An). This is equivalent to choosing the value of C that maximizes P(A1,A2, An C) P(C). Naïve Bayesian prediction requires each conditional probability be non zero. Otherwise, the predicted probability will be zero. (2) In order to overcome this, we use probability estimation from one of the following: (3) The complete process of classification can be divided into two phases: (4) 1) Training Phase: In this each is first individually categorized to a category (spam/ham). Remove html tags, stop words, special characters, articles, proverbs. Extract keywords calculate the frequency of keywords and then save it to database for the selected category. 2) Classification Phase: In this phase the newly arrived mail is first converting to lower case and stop words, html tags, special characters, articles, proverbs are removed. Extracts the keywords from mails and then calculate the probability of these words from the learning dataset. If the spam probability is higher than the mail is spam otherwise its no spam/legitimate mail. It is observed that in order to classify, predict a spam from a non spam one and to increase the accuracy of the above algorithm, the following techniques and assumptions are used [7]: Sorting spam or non spam according to the language, words and then count. If a word does not exist, consider to approximate P (word class) using Laplacian. Threshold value. The Dataset contains each word that filtering is used to determine if a message is spam or not. Beside each word, there are two numbers. The first number is the number of times that the word has occurred in legitimate s. The second number is the number of times that the word has occurred in spam s. B. Support Vector Machine Support vector machine model had been the most successful algorithm in the field of Text classification. It is mainly popular because of its ease of implementation and high accurate results. Originally it was presented to classify data into two fixed classes making it supervised non-probabilistic binary classifier. But with time it has been used by researchers for classifying data into N categories. One of the works in classification using SVM discussed in this paper is by Fagbola Temitayo, Olabiyisi Stephen and Adigun Abimbola [8]. They have combined the SVM with the Genetic algorithm to enhance the performance of SVM. In its simplest form SVM can be used to represent a document in vector space where each feature (word) represents one dimension. Identical feature denotes same dimension. Two of the parameters namely Term Frequency (TF) and TF-Inverse Document Frequency (TF-IDF) add value to these vectors. TF the number of times a word occurs in a document. Harris Drucker, Donghui Wu, and Vladimir N. Vapnik proposes a word is a feature only if it occurs in three or more documents which prevents misspelled words and words used rarely. TF-IDF uses the above TF multiplied by the IDF (inverse document frequency). The document frequency (DF) is the number of times that word occurs in all the documents. The inverse document frequency (IDF) is defined as 2014, IJARCSSE All Rights Reserved Page 274

3 (5) TF counts the number of times w i feature occurs in a document D, where as DF counts the number of documents in which feature w i occurred at least once. Using these two parameters we can assign priorities to the features used in creating vector in space. Performing the dot product of new vector with the vectors representing different classes in space a similarity factor of the new item to be classified is calculated and compared. SVM searches for the hyper plane which can separate the documents as category A and B (Binary SVM) with maximal margin. Fig 1 presents a rough sketch how SVM performs the classification. Fig. 1 SVM classification using two planes P1 and P2 Given some document D, comprising a set of n points of the form D {(x i, y i) xi R, y { 1,1}} (6) Where the yi is either 1 or 1, indicating the class A or B to which the point xi belongs. There can be multiple hyper plane which can divide a given set of points. Here we have given example of two such linear planes P1 and P2, where P2 divides the points with maximum margin. The equation of the plane can be given by the equation, w.x b 0 (7) Where denotes the dot product and w the normal vector to the hyper plane, and parameter b plays a role to find out the offset of plane along the vector w from origin. In the work [8], along with the SVM the authors introduce the Genetic Algorithm GA, which mainly serves the purpose of optimization of the feature set obtain in plain SVM using GA fitness function. Thus their approach is based on GA-SVM algorithm. A brief algorithm can be presented as follows [8] Convert each mail from training set to an xlsx format. Each row contains a mail in form of features stored in columns. A label is associated with each feature as (1,-1) to indicate a spam or not. Around 7000 total most frequent words are recognised with a unique id in keyword set. Dataset is passed to GA-SVM algorithm for classification. The GA-SVM algorithms show an improvement from the SVM algorithm for spam detection as obtained from the results. Both in terms of accuracy and computational time the GA-SVM has shown an improvement from SVM [8]. Table 1. SVM V/s GA-SVM results [8] Classifier Accuracy (%) Computational time(s) SVM GA-SVM C. K- Nearest Neighbour (K-NN) Another important contribution to classification using machine learning approach is done by M. Chang and C.K. Poon in [9]. They have compared the performance of three algorithms namely K-NN, K-NN with TF-IDF and cosine measure and Naive Bayes classifier of which our focus of discussion is K-NN. K-NN is a classification algorithm which classifies objects based on K objects having closest pattern in the training sample. To find the closest patter objects a number of similarity measures are used among which the most popular is Euclidean distance calculated as: D(p1, p2) (x2 x1)2 (y2 y1)2 (8) Where p1 and p2 represents the points or objects in space having coordinates (x1, y1) and (x2, y2) respectively. Using such a model a training set can be generated for different classes associated with each point. Consider a New Object N is to be classified as one of the predefined classes A or B. The Euclidean distance of that objects will be calculated with the other objects in space and the class of K nearest neighbor is assigned to N. Fig represents the 4-NN model for classification where N will be assigned class A since out of 4 nearest objects 3 have Class A. In the paper [9], the authors 2014, IJARCSSE All Rights Reserved Page 275

4 have introduced the use of phrases from as a feature for K-NN classification. Unlike earlier approaches where most algorithms uses words as features for classification. Phrases represent a sequence of words separated by white space if it occurs at least a minimum number of times. Such phrases are called Shingle by the authors. Considered a document D containing the text, Believe and act as if it were impossible to fail. If such a document is used to collect features in terms of shingles of length 3 then the set of features S is given as: S3 (D) = { Believe and act, and act as, act as if, as if it, if it were, it were impossible, were impossible to, impossible to fail.}the total number of shingles N of length 3 in D is 8, given by formula: N = n-w+1 (9) Where n is the total number of words in D and w is the length of each shingle. The authors used the concept of resemblance as their similarity measure defined as in [9], Sw(A) Sw(B) rw(a, B) (10) Sw(A) Sw(B) Where S w (A) and S w (B) are w shingles of document A and B. Using the above resemblance formula K nearest neighbor can be find out for a new , thus classifying it to the class shared by the most number of mails in K neighbours. Fig. 2 K-NN algorithm for 4 nearest neighbour Table 2 shows the summary of above mentioned algorithms. Table 2. Summary of Algorithms Algorithm Dataset Type of Method Result Bayesian Theorem Support Vector machine K Nearest Neighbour s of Author Account 6000 mails from Spam Assassin dataset s grouped into 27 folders. Supervised Supervised Supervised Extracted Keywords Bag of words optimized by GA algorithm Shinglesphrases of words 94.2% 2014, IJARCSSE All Rights Reserved Page % 94% precision for spam detection In our Dissertation, we focus on Naive Bayesian Theorem for analysis and enchantment in classification. III. IMPLEMENTATION DETAILS The consists of a message header and message body. Header includes sender and receiver address, subject, date and server address etc. Message body includes text data. In our implementation of the classification we will focus on message body only. The Naïve Bayes classification is based on Baye s rule of condition probability. Entire classification process is divided into three phases or steps. A. First phase In First phase the user creates the rule for classification. Rules are nothing, but the keywords/phrases that occur in mails for respective legitimate or spam mails. Then this rule is stored in database as a set of tokens for classification. B. Second Phase In Second phase can be called as training phase. Here the classifier will be trained using a spam and legitimate s manually by the user. Then with the help of algorithm the keywords are extracted from classified mails, and this keyword is stored as the set of token in database.

5 C. Third Phase One the first and second phase are completed, we now move to next phase for classifying the s by given algorithm, using this knowledge of tokens, the filter classifies every new incoming . Here the probability of maximum keyword match is calculated and the status of a new is confirmed as spam or ham mail, all its tokens are also calculated and then updated in the database. This self-learning functionality of our filter makes it unique among all other available spam filters. Even if the filter misclassifies any message, the user can rectify it and the spam filter would update its database accordingly. Below Figure 3 shows the block diagram. Fig. 3 Proposed spam filtering system with three phase technique. D. Platform To implement a system, we have used Visual Studio 2010 as front end, SQL Server 2008 as database for storing data and supported Operating System are WINDOWS XP & its above versions. IV. RESULTS A. Data Set For our project we used total s out of which 1250 legitimate messages and spam messages. The dataset message collection is, available on B. Result The accuracy of classifier dependent on several parameters such as number of mails considered during training phase, threshold value of each spam/legitimate s, size of etc. The evaluation measures which used for testing process in our research work could be defined as follows: True Positive (TP): no. of spam s correctly classified True Negative (TN): no. of ham s correctly classified. False Positive (FP): no. of spam s classified as ham. False Negative (FN): no. of ham classified as spam. Overall Accuracy (TP+TN) / (TP+TN+FP+FN): is percentage of prediction that is correct. Recall Rate TP/ (TP+FN): is percentage of positive label instance is predicted as positive. Precision TP/ (TP+FP): is percentage of positive prediction is correct. Table 3. Performance results of classifier based on proposed algorithm Dataset Recall Precision Accuracy Spam Ham The dataset is the number of spam and ham/ legitimate manually classified during the learning phase. Note the accuracy is also dependent on properly training the s during learning phase as the spam and legitimate mails may differ from organization to organization or person to person. We also observed that when two different categories (spam/ legitimate) have many keywords in common, the classifier accuracy is low. With distinct keywords for each categories (spam/ legitimate) the overlapping is minimum which lead to increase in classifier accuracy above 0.9. The results also show that, if we have large training data the accuracy increases. V. CONCLUSIONS In this Dissertation, we consider the requirement of improving the efficiency of filtering techniques based on Naïve Bayesian, which is a good machine learning algorithm. The project we concentrate on text words from subject and 2014, IJARCSSE All Rights Reserved Page 277

6 message body. The results show that our approach to classify s is a reasonable and effective one. However, there are lots of enhancements to be done in future. The future enhancement includes working with other languages than English, classifying the s depending on their header, the type of legitimate s such as important, social network, personal etc, depending on attachments, check for malicious code in , understanding the text and so on. ACKNOWLEDGMENT The Author Savita Pundalik Teli would like to acknowledge the contribution of Mr. Prof. Santoshkumar Birdar for his valuable suggestions and guidance. His extremely successful professional life has been a strong motivating factors in pursuit my PG. He is not only a great advisor but also a caring mentor during my PG. His strong dedication, amazing energy, and generosity will continue to be the source of inspiration to me. REFERENCES [1] A. Dasgupta, P. Drineas, and B. Harb, Feature Selection Methods for Text Classification, KDD 07, ACM, [2] F. Sebastiani, Text categorization, Alessandro Zanasi(ed.) Text Mining and its Applications, WIT Press, Southampton, UK, pp , [3] A. Khan, B. Baharudin, L. H. Lee, and K. Khan, A Review of Machine Algorithms for Text- Documents Classification, Journal of Advances Information Technology, vol. 1, [4] F. Sebastiani, Machine in Automated Text Categorization, ACM [5] D. Vira, P. Raja, and S. Gada, An Approach to Classification using Bayesian Theorem, GJCST. (USA), vol. 12, Issue 13, ver. 1.0, [6] J. Han, M. Kamber, and J. Pei, (2011, July 6) Data Mining Concepts and Techniques (3rd ed.). The Morgan Kaufmann Series in Data Management Systems. [7] A. Patra, K.Mandal, S.Roy, S.Sau and S. Kunar, An Efficient Spam Filtering Techniques for Account, American Journal of Engineering Research, vol. 02, Issue 10, pp , 2013 [8] F. Temitayo, O. Stephen, and A. Abimbola, Hybrid GA-SVM for Efficient Feature Selection in Classification, IISTE, vol. 3, no. 3, [9] M. Chang, C.K Poon, Using Phrases as Features In Classification, ELSEVIER,vol.82, 2009, pp [10] V. P. Deshpande, R. F. Erbacher, and C. Harris, An Evaluation of Naive Bayesian Anti-Spam Filtering Techniques Proc. of IEEE, [11] S. Youn and D. McLeod, Efficient Spam Filtering using Adaptive Ontology, International Conference on Information Technology (ITNG'07), pp , [12] S. Hershkop, and J. Stolfo, Combining models for false positive reduction, Proc of KDD 05 of ACM. Chicago. [s.n.], pp , [13] S. J. Delany, P. Cunningham, and B. Smyth, ECUE: A spam filter that uses machine learning to track concept drift, Proceedings of the 17th European Conference on Artificial Intelligence (PAIS stream), pp , [14] Y. Yang, An Evolution of statistical Approaches to Text Categorization, Information Retrieval 1, [15] Li, Y. H, and A. K. Jain, Classification of text documents. The Computer Journal, [16] L. S. Larkey, and W. B. Croft, Combining classifiers in text categorization. In Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval (Zurich, CH, 1996), pp [17] O. Zaiane, and M. Antonie, Text Document Categorization by Term Association, Proceedings of ICDM 2002, IEEE,, pp [18] S. Buddeewong1, and W. Kreesuradej, A New Association Rule-Based Text Classifier Algorithm, Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence, [19] S.M.Kamruzzaman,and M.R.Chowdhury, Text Categorization using Association Rule and Naive Bayes Classifier CoRR, [20] K. Aas, and L. Eikvil, Text Categorization: A Survey Report No ISBN , June, [21] K. Tretyakov, Machine Techniques in Spam Filtering, Technical report, Institute of Computer Science, University of Tartu, [22] K. A. Vidhya, and G. Aghila, A Survey of Naive Bayes Machine approach in Text Document Classification, (IJCSIS) International Journal of Computer Science and Information Security, Vol. 7, [23] A. McCallum, and K. Nigam, "A Comparison of Event Models for Naive Bayes Text Classification". AAAI/ ICML -98 Workshop on for Text Categorization. [24] K. Sang- Bum, et al, Some Effective Techniques for Naive Bayes Text Classification IEEE Transactions on Knowledge and Data Engineering, Vol. 18, November [25] S. Yirong, and J. Jing, Improving the Performance of Naive Bayes for Text Classification CS224N Spring [26] M. J. Pazzani, Searching for dependencies in Bayesian classifiers Proceedings of the Fifth Int. workshop on AI and, Statistics. Pearl, , IJARCSSE All Rights Reserved Page 278

### An Efficient Spam Filtering Techniques for Email Account

American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-02, Issue-10, pp-63-73 www.ajer.org Research Paper Open Access An Efficient Spam Filtering Techniques for Email

### A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

### Savita Teli 1, Santoshkumar Biradar 2

Effective Spam Detection Method for Email Savita Teli 1, Santoshkumar Biradar 2 1 (Student, Dept of Computer Engg, Dr. D. Y. Patil College of Engg, Ambi, University of Pune, M.S, India) 2 (Asst. Proff,

### Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

### Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

### Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

### A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

### A Proposed Algorithm for Spam Filtering Emails by Hash Table Approach

International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering

### Bayesian Spam Filtering

Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

### Detecting E-mail Spam Using Spam Word Associations

Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in

### IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

IMPROVING SPAM EMAIL FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT M.SHESHIKALA Assistant Professor, SREC Engineering College,Warangal Email: marthakala08@gmail.com, Abstract- Unethical

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

### Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### Spam detection with data mining method:

Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

### Behavior Analysis of SVM Based Spam Filtering Using Various Kernel Functions and Data Representations

ISSN: 2278-181 Vol. 2 Issue 9, September - 213 Behavior Analysis of SVM Based Spam Filtering Using Various Kernel Functions and Data Representations Author :Sushama Chouhan Author Affiliation: MTech Scholar

### Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

### T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

### Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

### Impact of Feature Selection Technique on Email Classification

Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

### SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and

### WE DEFINE spam as an e-mail message that is unwanted basically

1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Image Content-Based Email Spam Image Filtering

Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

### Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

### Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information Technology : CIT 2005 : proceedings : 21-23 September, 2005,

### IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,

### AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM

ISSN: 2229-6956(ONLINE) ICTACT JOURNAL ON SOFT COMPUTING, APRIL 212, VOLUME: 2, ISSUE: 3 AN EFFECTIVE SPAM FILTERING FOR DYNAMIC MAIL MANAGEMENT SYSTEM S. Arun Mozhi Selvi 1 and R.S. Rajesh 2 1 Department

### International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

### Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization

International Journal of Network Security, Vol.9, No., PP.34 43, July 29 34 An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization Jyh-Jian Sheu Department of Information Management,

### Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples

### An Approach to Detect Spam Emails by Using Majority Voting

An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,

### Naïve Bayesian Anti-spam Filtering Technique for Malay Language

Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information

### International Journal of Research in Advent Technology Available Online at: http://www.ijrat.org

IMPROVING PEFORMANCE OF BAYESIAN SPAM FILTER Firozbhai Ahamadbhai Sherasiya 1, Prof. Upen Nathwani 2 1 2 Computer Engineering Department 1 2 Noble Group of Institutions 1 firozsherasiya@gmail.com ABSTARCT:

### A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2

UDC 004.75 A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM E-MAIL FILTERING 1 2 I. Mashechkin, M. Petrovskiy, A. Rozinkin, S. Gerasimov Computer Science Department, Lomonosov Moscow State University,

### An incremental cluster-based approach to spam filtering

Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 34 (2008) 1599 1608 www.elsevier.com/locate/eswa An incremental cluster-based approach to spam

### CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

### Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

Journal of Machine Learning Research 7 (2006) 2699-2720 Submitted 3/06; Revised 9/06; Published 12/06 Spam Filtering Based On The Analysis Of Text Information Embedded Into Images Giorgio Fumera Ignazio

### SVM Ensemble Model for Investment Prediction

19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

### Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

### A Collaborative Approach to Anti-Spam

A Collaborative Approach to Anti-Spam Chia-Mei Chen National Sun Yat-Sen University TWCERT/CC, Taiwan Agenda Introduction Proposed Approach System Demonstration Experiments Conclusion 1 Problems of Spam

### Shafzon@yahool.com. Keywords - Algorithm, Artificial immune system, E-mail Classification, Non-Spam, Spam

An Improved AIS Based E-mail Classification Technique for Spam Detection Ismaila Idris Dept of Cyber Security Science, Fed. Uni. Of Tech. Minna, Niger State Idris.ismaila95@gmail.com Abdulhamid Shafi i

### Representation of Electronic Mail Filtering Profiles: A User Study

Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu

### International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

### Data Mining Solutions for the Business Environment

Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

### FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

### Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

### On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle

### 6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET) January- February (2013), IAEME

INTERNATIONAL International Journal of Computer JOURNAL Engineering OF COMPUTER and Technology ENGINEERING (IJCET), ISSN 0976-6367(Print), ISSN 0976 6375(Online) & TECHNOLOGY Volume 4, Issue 1, (IJCET)

### Decision Support System on Prediction of Heart Disease Using Data Mining Techniques

International Journal of Engineering Research and General Science Volume 3, Issue, March-April, 015 ISSN 091-730 Decision Support System on Prediction of Heart Disease Using Data Mining Techniques Ms.

### Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous

### Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### A Review of Data Mining Techniques

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

### DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

### Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

### Anti Spamming Techniques

Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods

### A Survey on Intrusion Detection System with Data Mining Techniques

A Survey on Intrusion Detection System with Data Mining Techniques Ms. Ruth D 1, Mrs. Lovelin Ponn Felciah M 2 1 M.Phil Scholar, Department of Computer Science, Bishop Heber College (Autonomous), Trichirappalli,

### A Novel Technique of Email Classification for Spam Detection

A Novel Technique of Email Classification for Spam Detection Vinod Patidar Student (M. Tech.), CSE Department, BUIT Divakar singh HOD, CSE Department, BUIT Anju Singh Assistant Professor, IT Department,

### Search Result Optimization using Annotators

Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

### Email Spam Detection Using Customized SimHash Function

International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email

### Email Classification Using Data Reduction Method

Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user

### Tracking and Recognition in Sports Videos

Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey mustafa.teke@gmail.com b Department of Computer

### IJMIE Volume 2, Issue 8 ISSN: 2249-0558

An Adaptive Classification approach to filter spam E-mail using Vector Space Model Nakul Dave* Uttam Chauhan** Avani Dave*** Abstract The majority of previous studies of data mining have been concentrate

### Spam E-mail Detection by Random Forests Algorithm

Spam E-mail Detection by Random Forests Algorithm 1 Bhagyashri U. Gaikwad, 2 P. P. Halkarnikar 1 M. Tech Student, Computer Science and Technology, Department of Technology, Shivaji University, Kolhapur,

### Web Document Clustering

Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

### eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide

eprism Email Security Appliance 6.0 Intercept Anti-Spam Quick Start Guide This guide is designed to help the administrator configure the eprism Intercept Anti-Spam engine to provide a strong spam protection

### Three-Way Decisions Solution to Filter Spam Email: An Empirical Study

Three-Way Decisions Solution to Filter Spam Email: An Empirical Study Xiuyi Jia 1,4, Kan Zheng 2,WeiweiLi 3, Tingting Liu 2, and Lin Shang 4 1 School of Computer Science and Technology, Nanjing University

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift Sarah Jane Delany 1 and Pádraig Cunningham 2 and Barry Smyth 3 Abstract. While text classification has been identified for some time

### Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

### Decision Support System For A Customer Relationship Management Case Study

61 Decision Support System For A Customer Relationship Management Case Study Ozge Kart 1, Alp Kut 1, and Vladimir Radevski 2 1 Dokuz Eylul University, Izmir, Turkey {ozge, alp}@cs.deu.edu.tr 2 SEE University,

### Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann

### A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

### Classifying and Identifying of Threats in E-mails Using Data Mining Techniques

Classifying and Identifying of Threats in E-mails Using Data Mining Techniques D.V. Chandra Shekar and S.Sagar Imambi Abstract E-mail has become one of the most ubiquitous methods of communication. The

### An Efficient Three-phase Email Spam Filtering Technique

An Efficient Three-phase Email Filtering Technique Tarek M. Mahmoud 1 *, Alaa Ismail El-Nashar 2 *, Tarek Abd-El-Hafeez 3 *, Marwa Khairy 4 * 1, 2, 3 Faculty of science, Computer Sci. Dept., Minia University,

### A Hybrid ACO Based Feature Selection Method for Email Spam Classification

A Hybrid ACO Based Feature Selection Method for Email Spam Classification KARTHIKA RENUKA D 1, VISALAKSHI P 2 1 Department of Information Technology 2 Department of Electronics and Communication Engineering

### Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 5 (Nov. - Dec. 2012), PP 36-41 Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian

www..org 104 Spam Filtering and Removing Spam Content from Massage by Using Naive Bayesian 1 Abha Suryavanshi, 2 Shishir Shandilya 1 Research Scholar, NIIST Bhopal, India. 2 Prof. (CSE), NIIST Bhopal,

### Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme

IJCSET October 2012 Vol 2, Issue 10, 1447-1451 www.ijcset.net ISSN:2231-0711 Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme I.Kalpana, B.Venkateswarlu Avanthi Institute

### BIG DATA IN HEALTHCARE THE NEXT FRONTIER

BIG DATA IN HEALTHCARE THE NEXT FRONTIER Divyaa Krishna Sonnad 1, Dr. Jharna Majumdar 2 2 Dean R&D, Prof. and Head, 1,2 Dept of CSE (PG), Nitte Meenakshi Institute of Technology Abstract: The world of

### A Case-Based Approach to Spam Filtering that Can Track Concept Drift

A Case-Based Approach to Spam Filtering that Can Track Concept Drift Pádraig Cunningham 1, Niamh Nowlan 1, Sarah Jane Delany 2, Mads Haahr 1 1 Department of Computer Science, Trinity College Dublin 2 School

### Bayesian Spam Detection

Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional

### Survey of Spam Filtering Techniques and Tools, and MapReduce with SVM

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 11, November 2013,

### A survey on Data Mining based Intrusion Detection Systems

International Journal of Computer Networks and Communications Security VOL. 2, NO. 12, DECEMBER 2014, 485 490 Available online at: www.ijcncs.org ISSN 2308-9830 A survey on Data Mining based Intrusion

### A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model Twinkle Patel, Ms. Ompriya Kale Abstract: - As the usage of credit card has increased the credit card fraud has also increased

### Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

### Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

### INVESTIGATIONS INTO EFFECTIVENESS OF GAUSSIAN AND NEAREST MEAN CLASSIFIERS FOR SPAM DETECTION

INVESTIGATIONS INTO EFFECTIVENESS OF AND CLASSIFIERS FOR SPAM DETECTION Upasna Attri C.S.E. Department, DAV Institute of Engineering and Technology, Jalandhar (India) upasnaa.8@gmail.com Harpreet Kaur

### Unmasking Spam in Email Messages

Unmasking Spam in Email Messages Anjali Sharma 1, Manisha 2, Dr. Manisha 3, Dr. Rekha Jain 4 Abstract: Today e-mails have become one of the most popular and economical forms of communication for Internet