Feature Subset Selection in E-mail Spam Detection



Similar documents
A Content based Spam Filtering Using Optical Back Propagation Technique

Impact of Feature Selection Technique on Classification

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

An Approach to Detect Spam s by Using Majority Voting

Spam Filtering using Naïve Bayesian Classification

Chapter 6. The stacking ensemble approach

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Detecting Spam Using Spam Word Associations

Towards better accuracy for Spam predictions

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique

Towards applying Data Mining Techniques for Talent Mangement

Spam Classification With Artificial Neural Network and Negative Selection Algorithm

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Classification Using Data Reduction Method

Data Mining. Nonlinear Classification

Machine Learning Final Project Spam Filtering


Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2

Bayesian Spam Filtering

1. Classification problems

Mining Product Reviews for Spam Detection Using Supervised Technique

** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Social Media Mining. Data Mining Essentials

Machine Learning in Spam Filtering

Data Mining Algorithms Part 1. Dejan Sarka

A Victimization Optical Back Propagation Technique in Content Based Mostly Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

An Efficient Spam Filtering Techniques for Account

Learning to classify

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Survey of Spam Filtering Techniques and Tools, and MapReduce with SVM

Keywords data mining, prediction techniques, decision making.

Predict Influencers in the Social Network

Keywords - Algorithm, Artificial immune system, Classification, Non-Spam, Spam

A Dynamic Flooding Attack Detection System Based on Different Classification Techniques and Using SNMP MIB Data

Spam? Not Any More! Detecting Spam s using neural networks

Predictive Data modeling for health care: Comparative performance study of different prediction models

A survey on Data Mining based Intrusion Detection Systems

Performance Evaluation of Intrusion Detection Systems using ANN

Improving spam mail filtering using classification algorithms with discretization Filter

Machine learning for algo trading

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

An Efficient Three-phase Spam Filtering Technique

Spam detection with data mining method:

Supervised Learning (Big Data Analytics)

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

Face Recognition For Remote Database Backup System

REVIEW OF HEART DISEASE PREDICTION SYSTEM USING DATA MINING AND HYBRID INTELLIGENT TECHNIQUES

Data Mining - Evaluation of Classifiers

Using Biased Discriminant Analysis for Filtering

How To Detect Credit Card Fraud

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Intrusion Detection using Artificial Neural Networks with Best Set of Features

Learning is a very general term denoting the way in which agents:

IMPROVING SPAM FILTERING EFFICIENCY USING BAYESIAN BACKWARD APPROACH PROJECT

Classification algorithm in Data mining: An Overview

CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering

ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES

Mining a Corpus of Job Ads

Increasing the Accuracy of a Spam-Detecting Artificial Immune System

Spam Detection on Twitter Using Traditional Classifiers M. McCord CSE Dept Lehigh University 19 Memorial Drive West Bethlehem, PA 18015, USA

Bayesian Spam Detection

Supervised Feature Selection & Unsupervised Dimensionality Reduction

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

SVM Ensemble Model for Investment Prediction

Lecture 13: Validation

CS 6220: Data Mining Techniques Course Project Description

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Tweaking Naïve Bayes classifier for intelligent spam detection

Gender Identification using MFCC for Telephone Applications A Comparative Study

Genetic Neural Approach for Heart Disease Prediction

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

SMS Spam Filtering Technique Based on Artificial Immune System

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

E-commerce Transaction Anomaly Classification

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

More Data Mining with Weka

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

International Journal of Research in Advent Technology Available Online at:

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Identifying Potentially Useful Header Features for Spam Filtering

FOREX TRADING PREDICTION USING LINEAR REGRESSION LINE, ARTIFICIAL NEURAL NETWORK AND DYNAMIC TIME WARPING ALGORITHMS

Lan, Mingjun and Zhou, Wanlei 2005, Spam filtering based on preference ranking, in Fifth International Conference on Computer and Information

Transcription:

Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012

Feature Subset Selection in E-mail Spam Detection Introduction Problem Statement Objective Related Works Proposed Architecture Result and Discussion Conclusion

Introduction Today, e-mails have become a very common and convenient medium for our daily communications. Spam is an unsolicited commercial or bulk e-mails, or an uninterested e-mails. It has created a significant security problem for end users and big organizations through carrying malicious virus and phishing scams. There are many solutions available to act as a model for automatic spam detection. Most techniques are based on Machine Learning and data mining techniques such as classification, clustering and Genetic Algorithm that classify spam emails and select their relevant features (Sang, 2010). PAGE 3 "IT Security for the Next Generation", Asia Pacific & MEA Cup 14-16 March, 2012

Problem Statement Lack of useful and relevant features that can distinguish between spam and nonspam emails efficiently (Bilal, 2005). Increase data dimensionality that decreases accuracy (Ren, 2006). The reduction of classification accuracy within the current methods (Burim, 2005). A high growth of false positive in different spam detection systems (Ching, 2010). PAGE 4 "IT Security for the Next Generation", Россия и СНГ 9-11 Марта, 2011

Objective To find related and relevant features to distinguish between spam and non-spam emails and decrease data dimensionality. To increase classification accuracy that shows the accuracy of detection system and MLP classifier. To decrease false positive using GA as a feature selector and MLP classifier. PAGE 5 "IT Security for the Next Generation", Россия и СНГ 9-11 Марта, 2011

Related Works Author and Year Proposed System Ruan and Tan (2007) Matthew and Chung Keung (2009) Gaurav and Shashikala (2010) Lee, Kim and Park (2010) Ruan and Tan (2010) Classified spam and legitimate emails based on support Vector Machine (SVM) and select more relevant features using Genetic Algorithm. Selected features as a mathematical method and classified them based on k-nn and Naïve Bayes. The results of proposed detection system show an accuracy near to 98%. Detected spam emails based on different techniques such as analyzing email contents by spam keywords knowledge base, finding sender of emails, and cross validation without using feature selection. The lack of feature selection and high computational costs decrease accuracy of classifier. Developed an optimal spam detection model based on Random Forest (RF). This model is able to optimize RF parameters and select relevant features simultaneously. Selected features of spam emails such as words and symbols by artificial Immune system to detect spam emails based Artificial Neural Network classifier. The result of this study showed 99% accuracy and 0.2% error rate. PAGE 6 "IT Security for the Next Generation", Asia Pacific & MEA Cup 14-16 March, 2012

Detection System Architecture PAGE 7 "IT Security for the Next Generation", Россия и СНГ 9-11 Марта, 2011

Genetic Algorithm GA finds an optimal binary vector, where each bit is associated with a feature. If the ith bit of this vector = 1, then the ith feature is allowed to participate in classification; if the bit is = 0, then the corresponding feature does not participate. GA method applied in this study decreases the number of irrelevant features and high dimensionality. Thus, the number of features are decreased from 156 to 78. PAGE 8 "IT Security for the Next Generation", Россия и СНГ 9-11 Марта, 2011

The Fitness function of GA PAGE 9 "IT Security for the Next Generation", Россия и СНГ 9-11 Марта, 2011

Multi-Layer perceptron (MLP) Multi-layer perceptron is arranged in different layers, namely input, hidden and output layers. In fact, these layers are organized to minimize appropriate error functions and increase the accuracy of classification. This classifier not only increase the accuracy, but increase the speed of detection process. PAGE 10 "IT Security for the Next Generation", Россия и СНГ 9-11 Марта, 2011

Experiment Dataset Two benchmark corpora used to test our proposed email spam detection system based MLP classifier are the SpamAssassin corpus and LingSpam corpus. Name of Data base Spam Non-spam LingSpam 481 2412 SpamAssassin 3797 6954 PAGE 11 "IT Security for the Next Generation", Asia Pacific & MEA Cup 14-16 March, 2012

Performance Measurement Measure Accuracy Precision Recall Expression (TP+TN)/(TP+FP+FN+TN) TP/(TP+FP) TP/(TP+FN) TP = spam emails and identified as spam FN = spam emails but identified as non-spam TN = non-spam emails and identified as non-spam FP = non-spam emails but identified as spam PAGE 12 "IT Security for the Next Generation", Россия и СНГ 9-11 Марта, 2011

Result Table 1 illustrates the performance of GA based feature subset selection with MLP classifier in the spam detection system. 10 runs were used for each dataset to estimate the classifications. Other techniques from previous studies are compared to the proposed technique i.e MLP Table 1 Methods Accuracy (%) Recall (%) Precision (%) Linear Discriminant 98.55 91.49 99.77 SVM 97.09 97.38 97.02 BP Neural Network 99.13 95.84 98.93 NB 96.41 81.10 96.85 SVM-IG 96.85 81.90 99.0 MLP LingSpam 99.83 93.75 100 MLP SpamAssassin 99.81 100 99.48 PAGE 13 "IT Security for the Next Generation", Asia Pacific & MEA Cup 14-16 March 2012

Cont. Result PAGE 14 "IT Security for the Next Generation", Asia Pacific & MEA Cup 14-16 March, 2012

Conclusion The Genetic Algorithm (GA) as a feature selection algorithm is discussed in this study. This algorithm like an automated heuristic method selects significant features of spam emails to decline data-dimensionality and increase MLP classifier. The results of this paper show the performance near to 100% and 0.01 false positive rate. Integrated experiments of this paper are based on two benchmark corpora LingSpam and SpamAssassin to indicate the strength of GA in feature selection and the ability of MLP classifier for generating high accuracy and decrease the time of computing. PAGE 15 "IT Security for the Next Generation", Россия и СНГ 9-11 Марта, 2011

Thank You Amir Rajabi Behjat, Universiti Technology Mara, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012