INVESTIGATIONS INTO EFFECTIVENESS OF GAUSSIAN AND NEAREST MEAN CLASSIFIERS FOR SPAM DETECTION



Similar documents
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

An Overview of Knowledge Discovery Database and Data mining Techniques

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

A Review of Data Mining Techniques

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining Solutions for the Business Environment

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

The Scientific Data Mining Process

A Survey on Intrusion Detection System with Data Mining Techniques

Using Data Mining for Mobile Communication Clustering and Characterization

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

A Proposed Algorithm for Spam Filtering s by Hash Table Approach

Data Mining: Overview. What is Data Mining?

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Healthcare Measurement Analysis Using Data mining Techniques

Assessing Data Mining: The State of the Practice

A Framework for Data Warehouse Using Data Mining and Knowledge Discovery for a Network of Hospitals in Pakistan

Statistics for BIG data

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

2.1. Data Mining for Biomedical and DNA data analysis

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data Mining System, Functionalities and Applications: A Radical Review

Integrated Data Mining and Knowledge Discovery Techniques in ERP

not possible or was possible at a high cost for collecting the data.

Federico Rajola. Customer Relationship. Management in the. Financial Industry. Organizational Processes and. Technology Innovation.

An Introduction to Data Mining

Data Mining for Fun and Profit

Role of Neural network in data mining

Comparison of K-means and Backpropagation Data Mining Algorithms

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Towards applying Data Mining Techniques for Talent Mangement

MA2823: Foundations of Machine Learning

Sanjeev Kumar. contribute

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

SPATIAL DATA CLASSIFICATION AND DATA MINING

2. IMPLEMENTATION. International Journal of Computer Applications ( ) Volume 70 No.18, May 2013

INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR.

Spam Detection Using Customized SimHash Function

A Study of Web Log Analysis Using Clustering Techniques

A Content based Spam Filtering Using Optical Back Propagation Technique

Machine Learning with MATLAB David Willingham Application Engineer

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Introduction to Data Mining

Grid Density Clustering Algorithm

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining Analytics for Business Intelligence and Decision Support

Rule based Classification of BSE Stock Data with Data Mining

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Database Marketing, Business Intelligence and Knowledge Discovery

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

An Imbalanced Spam Mail Filtering Method

Addressing the Class Imbalance Problem in Medical Datasets

A Survey on Web Research for Data Mining

Keywords data mining, prediction techniques, decision making.

Fluency With Information Technology CSE100/IMT100

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Overview Applications of Data Mining In Health Care: The Case Study of Arusha Region

Decision Trees for Mining Data Streams Based on the Gaussian Approximation

Inner Classification of Clusters for Online News

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Denial of Service Attack Detection Using Multivariate Correlation Information and Support Vector Machine Classification

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

Blog Post Extraction Using Title Finding

A New Approach For Estimating Software Effort Using RBFN Network

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

Towards better accuracy for Spam predictions

Neural Networks in Data Mining

Detection of Heart Diseases by Mathematical Artificial Intelligence Algorithm Using Phonocardiogram Signals

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

Introduction to Data Mining

TDS - Socio-Environmental Data Science

Using Artificial Intelligence to Manage Big Data for Litigation

Keywords - Algorithm, Artificial immune system, Classification, Non-Spam, Spam

Clustering Technique in Data Mining for Text Documents

Analysis of Fraud Detection Using WEKA Tool

Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Minimize Response Time Using Distance Based Load Balancer Selection Scheme

COURSE RECOMMENDER SYSTEM IN E-LEARNING

The University of Jordan

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Data Warehousing and Data Mining in Business Applications

SURVIVABILITY ANALYSIS OF PEDIATRIC LEUKAEMIC PATIENTS USING NEURAL NETWORK APPROACH

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

AnalysisofData MiningClassificationwithDecisiontreeTechnique

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

How To Filter Spam Image From A Picture By Color Or Color

Data Mining - Evaluation of Classifiers

Data Mining Techniques

Transcription:

INVESTIGATIONS INTO EFFECTIVENESS OF AND CLASSIFIERS FOR SPAM DETECTION Upasna Attri C.S.E. Department, DAV Institute of Engineering and Technology, Jalandhar (India) upasnaa.8@gmail.com Harpreet Kaur Associate Professor & HOD, C.S.E. Dept DAV Institute of Engineering and Technology, Jalandhar (India) preetbajaj23@yahoo.co.in Abstract Satinder Pal Associate Professor & HOD, C.S.E. & I.T. Dept Indo Global Engineering College, Abhipur, Mohali (India) astinderpal@gmail.com This paper presents the results of investigations into the effectiveness of Gaussian and Nearest Mean Classifiers for SPAM detection. The results are in the form of traces of probability of error and time taken for classification. Since SPAM is increasingly becoming difficult to detect, so these automated techniques will help in saving lot of time and resources required to handle email messages. Key words: Gaussian Mean, KDD, Nearest Mean, SPAM Introduction The knowledge discovery and data mining (KDD) field draws on findings from statistics, databases, and artificial intelligence to construct tools that let users gain insight from massive data sets. People in business, science 1, medicine 2-4, academia, and government collect such data sets, and several commercial packages now offer generalpurpose KDD tools. An important KDD goal is to turn data into knowledge. For example, knowledge acquired through such methods on a medical database could be published in a medical journal. Knowledge acquired from analyzing a financial or marketing database could revise business practice and influence a management school s curriculum. In addition, some US laws require reasons for rejecting a loan application, which knowledge from the KDD could provide. Occasionally, however, you must explain the learned decision criteria to a court, as in the recent lawsuit Blue Mountain filed against Microsoft for a mail filter that classified electronic greeting cards as spam mail 5. In one early KDD success story, Robert Evans and Doug Fisher analyzed data from a printing press, found conditions under which the press failed, and identified rules to avoid these failures 6,7. The increase in the affordability of storage capacity, the associated growth in the volumes of data being stored and the mounting recognition in the value of temporal data (as well as the usefulness of temporal databases and temporal data model modeling) has resulted in the prospect of mining temporal rules from both static and longitudinal data. Data mining can itself be viewed as the application of artificial intelligence and statistical techniques to the increasing quantities of data held in large, more or less structured data sets, such as databases and temporal data mining is an extension of this work 8. The knowledge discovery process is comprised of business understanding, identifying data requirements, data preparation, modeling, evaluation, and deployment 9-11. Many methods have been proposed to increase the efficiency of data mining algorithms 12-13. Although a lot of applications in business, science 14-15, medicine 16-17 have been developed but not many applications have been exploited to control spamming in internet. A few attempts made so far are experimental which take a lot of time and are expensive to conduct 18-19. Materials and Methods The Matlab has been used as the programming tool for this simulation experiment. Random samples for each class of email were generated and random partitioning of the samples of each class into two equal sized sets to form a ISSN : 976-5166 Vol. 2 No. 1 1

training set and a test set for each class has been done. For each case, estimated the parameters of the Gaussian density function from the training set of the corresponding class. For each case the estimates of the parameters have been used to determine the Gaussian discriminant function. The Gaussian classifier for spam problem has been developed. The test samples have been classified for each class. For each case, the probability of classification error has been determined and also the time taken to classify has been measured. Further the nearest mean classifier has been implemented. The test samples of each class have been classified. For each case, the probability of classification error (POE) has been estimated and also the time taken for classification has been measured. Finally comparison of the two methods for effectiveness against spam based on probability of error and time taken to classify has been conducted. Results and discussion In the first iteration 5 email messages were generated and classified according to Gaussian mean and nearest mean method. The plot shows the variation of probability of error. It can be seen that the maximum POE is almost.93 in the case of nearest mean method and mostly the POE of the Gaussian mean method is generally less than the nearest neighbor method. However at some instances the POE of Gaussian mean method is more is at the 25th and 35th email message (Fig. 1)..9.7.5.3.1 5 1 15 2 25 3 35 4 45 5 Fig. 1: It shows the variation of probability of error for 5 emails. In the second iteration 1 email messages were generated and classified according to Gaussian mean and nearest.97 in the case of nearest mean method and mostly the POE of the Gaussian mean method is generally less than the nearest neighbor method. However at some instances the POE of Gaussian mean method is more is at the 3th and 7th email message (Fig. 2)..1.9.7.5.3.1 1 2 3 4 5 6 7 8 9 1 ISSN : 976-5166 Vol. 2 No. 1 2

Fig. 2: It shows the variation of probability of error for 1 emails. In the third iteration 15 email messages were generated and classified according to Gaussian mean and nearest.95 in the case of nearest mean method and mostly the POE of the Gaussian mean method is generally less than the nearest neighbor method. However at some instances the POE of Gaussian mean method is more is at the 4th and 14th email message (Fig. 3)..12.1 5 1 15 Fig. 3: It shows the variation of probability of error for 15 emails. In the fourth iteration 2 email messages were generated and classified according to Gaussian mean and nearest.19 in the case of nearest mean method and mostly the POE of the Gaussian mean method is generally less than the nearest neighbor method. However at some instances the POE of Gaussian mean method is more is at the 2th and 5th email message (Fig. 4)..12.1 2 4 6 8 1 12 14 16 18 2 Fig. 4: It shows the variation of probability of error for 2 emails. In the next iteration 25 email messages were generated and classified according to Gaussian mean and nearest.11 in the case of nearest mean method and mostly the POE of the Gaussian mean method is generally less than the nearest neighbor method. However at some instances the POE of Gaussian mean method is more is at the 12th and 24th email message (Fig. 5). ISSN : 976-5166 Vol. 2 No. 1 3

.1.9.7.5.3.1 5 1 15 2 25 Fig.5: It shows the variation of probability of error for 25 emails. In the next experiment email messages were generated and classified according to Gaussian mean and nearest neighbor method and the time taken to classify was plotted (Fig. 6). 25 2 Performance Evaluation of Gaussian and Nearest Mean Classification Methods For Filtering SPAM Mails time in sec 15 1 5 1 2 3 4 5 6 7 8 9 1 No. of emails (in 1s) Fig. 6: It shows evaluation of Gaussian and Nearest mean for Filtering spam mails in time scale. Conclusion and future scope It can be seen from the above iterations that most of times Gaussian mean method gives better performance and the POE is less as compared to Nearest Neighbor method. Still a few times the nearest mean method resulted in less POE but these instances are rare. From the traces of time taken by classifiers to classify the emails it can be seen that at low volumes both the classifiers consume equal time but as the load of emails increases the Gaussian classifier takes more time than the nearest mean method. Since in Spam detection more weight age is given to accuracy than the time taken to classify so it can be concluded that in fighting cyber crime the method of Gaussian Classification is better in classifying email messages as Phishing emails that the Nearest mean method. Acknowledgments The authors are grateful to Indo Global Engineering College, Abhipur, Mohali (India) for providing continuous support throughout the work. References 1. Zhang M, Zhang. Tjandra D, Wong STC. (28): DVMap: a space conscious data visualization and space discovery frame work for biomedical data warehouse. Information Technology in Biomedicine. IEEE Transactions, 8 (3), pp. 343-353. 2. Chia HWK, Tan CL, Sung SY. (26): Enhancing Knowledge discovery via association based evolution of neural logic networks. Knowledge and Data Engineering. IEEE Transactions, 18(7), pp. 889-91. ISSN : 976-5166 Vol. 2 No. 1 4

3. Jinwook S, Shneiderman B, Sung SY. (26): Knowledge discovery in high dimensional data: case studies and a user survey for the rank by feature framework. Visualization and Computer Graphics. IEEE Transactions, 12(3), pp. 311-322. 4. Gong X, Nakamura K, Hua Y, Kei Y, Nobuhuro G. (27) : BAAQ: An Infrastructure for application integration and Knowledge discovery in Bioinformatics. Information Technology in Biomedicine. IEEE Transactions, 11(4), pp. 428-434. 5. Caulfield B. Hallmark without the Revenue? Internet World, 22 Feb. 1999. 6. Evans R, Fisher D. (1994): Overcoming process delays with Decision Tree Induction. IEEE Expert, 9(1), pp. 6-66. 7. Pazzani MJ. (2): Knowledge discovery from data. Intelligent systems and their applications, 15(2), pp. 1-12. 8. Longbing C, Chengqi Z. (27): Domain Driven actionable Discovery. Intelligent Systems. IEEE, 22 (4), pp. 78-88. 9. Siromoney A, Raghuram L, Korah I, Prasad GNS. (2): Inductive Logic programming for knowledge discovery from MRI data. Engineering in medicine and biology magazine. IEEE, 19(4), pp. 72-77. 1. Bojarczuk CC, Lopes HS, Frietas AA. (2): Genetic Programming for knowledge discovery in chest pain diagnosis. Engineering in medicine and biology magazine. IEEE, 19(4), pp. 38-44. 11. Tsumoto S. (2): Automated discovery of positive and negative knowledge in clinical databases. Engineering in medicine and biology magazine. IEEE, 19(4), pp. 56-62. 12. Cios KJ, Pedrycz W, Swiniarsk RM. (1998): Data mining methods for Knowledge discovery. Neural networks. IEEE Transactions, 9(6), pp.1533-1534. 13. Limin Fu.(1999): Knowledge discovery by inductive neural networks. Knowledge and data engineering. IEEE Transactions, 11(6), pp. 992-998. 14. Pieper J, Srinivasan S, Dom B. (21): Streaming media knowledge discovery. Computer. 34(9), pp. 68-74. 15. Roddick JF, Spiliopoulou M. (22): A survey of temporal knowledge discovery paradigms and methods. Knowledge and data engineering. IEEE Transactions, 14(4), pp. 75-767. 16. Kulkarni A, McCaslin S. (24); Knowledge discovery from multi spectral satellite Images. Geoscience and Remote sensing letters. IEEE, 1(4), pp. 246-25. 17. Castro ARG, Miranda V. (25): Knowledge discovery in neural networks with application to transformer failure diagnosis. IEEE Transactions, 2(2), pp. 717-724. 18. Drucker H, Donghui W, Vapnil VN. (1999): Support vector machines for spam categorization. IEEE Transactions, 1(5), pp. 148-154. 19. Hoanca B. (26): How good are our weapons in spam wars? Technology and Society Magazine. IEEE, 25(1), pp. 22-3. ISSN : 976-5166 Vol. 2 No. 1 5