The spam vs. non-spam e-mails classifier



Similar documents
Cleaned Data. Recommendations

Employee Survey Analysis


How To Use A Polyanalyst

Crime Pattern Analysis

DATA MINING TECHNIQUES AND APPLICATIONS

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

IBM SPSS Direct Marketing

Working with telecommunications

Introduction to Data Mining

Easily Identify Your Best Customers

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Data Mining: Overview. What is Data Mining?

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Social Media Mining. Data Mining Essentials

not possible or was possible at a high cost for collecting the data.

An Overview of Knowledge Discovery Database and Data mining Techniques

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Megaputer Intelligence

Comparing Methods to Identify Defect Reports in a Change Management Database

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Mining Solutions for the Business Environment

IBM SPSS Direct Marketing 23

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

IBM SPSS Direct Marketing 22

Table of Contents. June 2010

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Easily Identify the Right Customers

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining for Fun and Profit

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Applications in Higher Education

An Introduction to Data Mining

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Decision Trees from large Databases: SLIQ

Knowledge Discovery and Data Mining

Chapter 12 Discovering New Knowledge Data Mining

How to Get More Value from Your Survey Data

Machine Learning using MapReduce

Data Mining Part 5. Prediction

2015 Workshops for Professors

A New Approach for Evaluation of Data Mining Techniques

Customer Classification And Prediction Based On Data Mining Technique

IBM SPSS Data Preparation 22

An Introduction to. Metrics. used during. Software Development

Customer and Business Analytic

Data Mining Techniques

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Artificial Intelligence in Retail Site Selection

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page Analytics and Data Mining 1

Data Mining Fundamentals

Random Forest Based Imbalanced Data Cleaning and Classification

DHL Data Mining Project. Customer Segmentation with Clustering

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

Colour Image Segmentation Technique for Screen Printing

CART 6.0 Feature Matrix

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

A fast, powerful data mining workbench designed for small to midsize organizations

Data Mining - Evaluation of Classifiers

E-commerce Transaction Anomaly Classification

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

It is designed to resist the spam in the Internet. It can provide the convenience to the user and save the bandwidth of the network.

Predicting Student Performance by Using Data Mining Methods for Classification

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Fast Analytics on Big Data with H20

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Chapter 7: Data Mining

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

BIG DATA What it is and how to use?

INSIDE. Neural Network-based Antispam Heuristics. Symantec Enterprise Security. by Chris Miller. Group Product Manager Enterprise Security

MSCA Introduction to Statistical Concepts

Get the most value from your surveys with text analysis

Location Analytics for Financial Services. An Esri White Paper October 2013

The Power of Predictive Analytics

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

MHI3000 Big Data Analytics for Health Care Final Project Report

The PageRank Citation Ranking: Bring Order to the Web

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Business Analytics using Data Mining Project Report. Optimizing Operation Room Utilization by Predicting Surgery Duration

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining for Knowledge Management. Classification

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Gerry Hobbs, Department of Statistics, West Virginia University

Transcription:

A real-life case study about using PolyAnalyst for building high precision classification models A white paper written for Megaputer Intelligence Inc. by Valentin Militaru Megaputer Intelligence, Inc. 120 West Seventh Street, Suite 310 Bloomington IN 47404 USA www.megaputer.com +18123300110

Contents General Context and Industry Problem...3 Task Definition...3 Some Methodological Considerations...4 Data Mining: Building the Classification Model...4 Data Inspection and Preprocessing...4 Algorithm Selection...6 Model Building (Data Mining)...6 Assessment of the Results...7 Conclusions... 7 Further Applications... 8 Acknowledgements... 9 Page 2

General Context and Industry Problem The case study is based upon a real-life problem, which was presented in the Data Mining Cup 2003 Contest. Isolating spam e-mails is a matter of critical importance in today's communication environment. The quality of an email filter is basically determined by the quality of its classification algorithm. The present case study is based upon a real-life situation, which generated the task for the Data Mining Cup 2003 Contest (www.data-mining-cup.de) that took place in Chemnitz, Germany. 1 The problem of unrequested received e-mails, most of them with doubtful content, is well-known. It is estimated that each day, about 25 Mio. of unwelcome emails (socalled spam mails) are sent. This corresponds to 10 percent of all emails worldwide. Performed investigations with the employees of many enterprises turned out that, on average, 40 percent of their daily received emails are spam. With some of the employees, this share reaches up to 90 % of the received e-mails. In spite of corresponding laws, an effective legal prosecution is nearly impossible. The receiver faces thus the problem of separating the important emails from the spam mails which are often not recognizable at the first sight. Since this is a time-consuming process, in the last years, the wish to recognize the spam emails and to select them automatically has motivated different producers from the software industry to build dedicated applications, equipped with more or less "artificial intelligence", that should automate the classification task. The efficiency of these solutions varies widely and the key component of any good email filter is its ability to classify emails on the basis of different measured and synthesized attributes and thus to correctly assign spam and non-spam emails with a high probability. Task Definition Starting from a training dataset of 8000 emails, a model for accurately classifying other 11,177 mails had to be built. During a program aiming at the optimization of the communication process, the management of a medium-size German company observed that an exceedingly large part of all incoming emails are advertisement mails. The large amount of working time spent by the 120 employees for the daily sorting and final erasure of the spam emails has unveiled a high potential of rationalization. For this reason, all emails of the company have been collected over some time and stored separately by spam and non-spam emails. Every email was described by a set of attributes. The accepted error margin was 1% (misclassified non-spam emails). Finally, al list of the nonspam e-mails should have been delivered. The task was to build a classification model for precisely identifying spam mails. In this respect, two datasets were considered: a) a training dataset, containing information about 8,000 e-mails; b) a dataset of 11.177 rows, upon which the classification model should have been applied. The objective was to minimize the number of the spam emails which have passed the filter, subject to the constraint that within the filtered emails only, a maximum of 1% non-spam emails were allowed. As result, a list had to be created, containing the IDs of those emails which have passed the filter (i.e. all emails which have been recognized as potential non-spam). Page 3

Some Methodological Considerations Many data mining techniques implemented by PotyAnalyst could have been taken into consideration for solving this task: k- Nearest Neighbor, Artificial Neural Networks, Decision Trees, etc... With its complex base of data mining engines, Poly Analyst can address a highly diverse suite of problems. Intuitively, classification tasks can be easily approached with the Memory Based Reasoning engine, that implements a k-nearest Neighbor algorithm, or with the Decision Tree/ Decision Forest engines, etc... Nevertheless, it is known that by data preprocessing, any classification task can be transformed in a prediction task and vice versa. Therefore, one might assume that Stepwise Linear Regression engine or the Artificial Neural Networks engine, that implements the PolyNet Predictor algorithm can be employed too. The logic of selecting the appropriate data mining algorithm is presented later in this whitepaper. Still, an important observation is that the analyst deals with a supervised classification problem, therefore cluster-based or association analysis engines are practically useless for model building. On the other hand, solving the task imposed the use of both descriptive and predictive data mining techniques. The project was developed in 4 steps: i) data inspection and preprocessing ii) algorithm selection iii) model building (data mining) iv) assessment of the results Each of these steps will be explained in detail further in this material. Data Mining: Building the Classification Model Data Inspection and Preprocessing The two datasets used for training and testing the model comprised more than 16 Million Boolean variables (19,177x835 = 16,012,795) Within the Data Mining Cup 2003 Contest, two datasets were provided: the training dataset, containing 8,000 records and the classification dataset, containing 11,177 records (emails) which had to be classified in spam or non-spam emails. According to the information provided by the organizers of the Data Mining Cup 2003 contest, both training and classification data were taken from the same sample and, thus, had the same class distribution. Strictly speaking, all data was taken from a sample of a total of 19.177 emails. The structure of both datasets was similar, with the exception of one categorical field (target) which was present only in the training dataset. Values for the target field were assigned based on past experience (target = "yes" -+ record is spam mail, target = "no' : * record is non-spam mail). The two datasets had an unusually large number of attributes. Each record of the training dataset was described by 834 attributes, as follows: the first attribute (id) was numeric and provided the identification number of the current record in the dataset Page 4

the last attribute (target) was used to classify the dataset in spam and non-spam records the other 832 attributes, all of Boolean type, provided various information about the records; these descriptive attributes were coded according to those of the popular opensource project Spam Assassin 2. Distribution of values for the "id" atribute for both spam and non-spam mails Training_Dataset_Split_0 Training_Dataset_Split_1 Note: Only the instances in the Training Dataset were taken into consideration. The graphic clearly shows that the value range for the "id" atribute is correlated with the target atribute. PolyAnalyst's Summary Statistics engine revealed critical information about the redundant data. An abnormal correlation between the ID and the target attribute was revealed using PolyAnalyst's charting capabilities. The first important information about the data was knowing which attributes were really relevant for the classification of the records in two categories, namely target = "yes" and target = "no". For this purpose, the Summary Statistics engine confirmed that out of 833 Boolean attributes, only 544 were set to "1" at least once in the entire training dataset, therefore, only these were providing useful information in the classification process. The other 289 attributes were constant for all the 8,000 records of the training dataset, therefore their respective columns were accounted as redundant data and were not taken into consideration any further. Another descriptive technique implemented in Poly Analyst revealed an additional valuable information about the training dataset. The histogram chart was used for building a view for the distribution of values for the "id" attribute. For a clearer illustration, the training dataset was split into two subsets, one containing only spam emails (target - "yes") and one containing non-spam emails (target = "no"). The graphic representation revealed a surprising distribution of values: all the records having an id higher than 384140 were spam e-mails (target = "yes"). Therefore, the id attribute had to be taken into consideration when building the classification model. These preliminary analyses demonstrated that no data transformation was required; data pre-processing effort was therefore reduced to selecting the attributes with minimum of relevance to the classification process (544 descriptive attributes plus the id field).

Algorithm Selection Decision Tree engine was selected for building the classification model, due to its high flexibility. As mentioned before, Poly Analyst offers a powerful base of data mining engines. Yet, the following characteristics of the challenge had to be considered when choosing the appropriate algorithm: the algorithm should have handled very well datasets with large number of attributes it should have been able to work with both Boolean and numeric attributes it should have been able so find a nonlinear solution the classification model should have been highly precise the algorithm should have taken into consideration all the attributes indicated by the analyst, since all the 544 + 1 attributes were identified as relevant in the previous step. According to the above-mentioned requirements, the only algorithm suitable for the data mining process was Decision Tree. As stated in Poly Analyst's documentation, the DT algorithm is well poised for analyzing very large databases and its calculation time scales very well (grows only linearly) with increasing number of data columns. Although k-nearest Neighbor might also have looked like a potential solution, its computation time is proportional to the square of the number of records. Thus this algorithm is not designed for exploring large datasets. Relative to the high number of attributes, it is important to notice that Poly Analyst, unlike other data mining tools, can handle datasets with more than 255 columns. This ability proved to be vital for this particular application. Model Building (Data Mining) The DT engine generated a complex structure, a 19 levels tree with 48 leaves. The use of the DT algorithm on the training dataset lead to a classification model (tree) with the following characteristics: Number of non-terminal nodes: 47 Number of leaves: 48 Depth of constructed tree: 19 As expected, the DT rule, based on such a large number of attributes, generated a very complex tree structure, difficult to represent completely as a graph. In order to verify the accuracy of the model, the DT rule was tested on the training dataset (with all the 834 attributes selected), with the following results: First testing of the model showed an encouraging total classification error of 0.9% Total classification error: % of undefined prediction cases: Classification probability: Classification efficiency: Classification error for class "Non-spam": Classification error for class "Spam": 0.90% 0.00% 99.10% 97.69% 0.86% 0.96% According to this test, the situation of real vs. predicted cases was satisfactory, only 42 out of 4,888 non-spam mails were misclassified (0.86%) and only 30 out of 3,112 spam mails were accounted as non-spam (0.96%). The overall test error was therefore 72/8,000 cases, that is 0.9%. Page 6

Considering the total error as reasonable (0.9%), the DT rule was applied to the classification dataset, which lead to the selection of 6753 records (out of 11,177) that were regarded as non-spam emails. Assessment of the Results Tested on the classification dataset, the model correctly classified 6,699 nonspam emails, out of 6,803. Only 104 records were omitted by the classification "filter" and only 54 spam mails "passed" through the "filter". The model offered a total classification error of 1.4% The evaluation of the model followed a procedure which is typical for every contest. The jury knew which of the 11,177 emails to be classified were actually non-spam emails or spam emails. This information was later available to the participants and thus it became possible the testing of the model against real data. The following table reflects the model's efficiency: Going back to the definition of the task, the project's objective was to minimize the number of spam emails which have passed the filter. The actual number of non-spam emails in the classification dataset was 8,803. Only 54 (0.79%) spam mails incorrectly passed the filter and were misclassified as being non-spam mail. Still, there were 104 non-spam mails that the model considered to be spam. In terms of percentage, the model did not recognize 1.52% of the non-spam emails. To conclude these calculations, testing the model on the classification dataset proved that its real precision was of 98.59%, with only 158 misclassified records out of 11,177. Conclusions Most important, this case study adds up evidence about PolyAnalyst's successful application in real-life projects. Its efficiency can be summarized by two aspects: speed and accuracy. The training took 20 minutes using a PHI computer running Poly Analyst 4.6 under Win XP As presented in Megaputer's technical documentation, "Decision Trees is PolyAnalyst's fastest algorithm when dealing with a large amount of records and/or fields". Running on a Pentium III at 650 MHz with only 64 MB of RAM, the classification model took less than 1 hour to build with a demo version of PolyAnalyst 4.5 under Windows 98. Using the same 8,000 records dataset, but this time on a Pentium III at 650 MHz with 128 MB of RAM, the time was reduced at 20 minutes, for a licensed version of PolyAnalyst 4.6, under Windows XP. Applying the model on the 11,177 rows dataset was only a matter of seconds. Page 7

The model offered a classification error far better than most of the solutions that were submitted in the DMC 2003 contest. As regards the accuracy of the model, Poly Analyst's decision tree algorithm proved its capacity to build high precision classification models. From a dataset of 11,177 records, only 158 were misclassified, which represents a global error of than 1.4%. Such level of precision is critical in some areas of application (such as medicine) and a huge advantage in others, like finance or CRM. Yet, a third aspect deserves to be mentioned in our conclusion: the software's ability to handle large volumes of data. This characteristic was critical in our example, since the training dataset contained more than 6.6 Mil Boolean records and the total size of the project reached 16 Mil Boolean records (835 columns x 19,177 rows). Further Applications Same classification approach can be applied in fields like marketing, retail analyses, fraud detection, finance, etc... Such data analysis model can be further replicated in many other areas of interest. Classification tasks are widely encountered in data mining projects, some of the most common fields being: Customer Relationship Management: which prospects are most likely to respond to a new promotion or which customers are about to switch to a competitor? Fraud detection: which transactions are most likely to be fraudulent? Retail/point-of-sales analyses: which location is most appropriate for a future store? Portfolio analysis: which securities/ assets would fit a predefined portfolio at a certain point in time? Medicine: what diagnostic or medication is suitable for a complex of symptoms? Page 8

Acknowledgements The author wishes to thank Mr. Richard Kasprzycki and Mr. Hrishikesh Pore from Megaputer Intelligence Inc for their confidence and invaluable contribution to the writing of this material. Also, the author congratulates the organizers of the DMC 2003 contest and thanks Mrs. Sandra HOmke and Mr. Jens Scholz from Prudsys A.G. for their acceptance to further using the data provided in the contest. Notes: ' The Data Mining Cup contest (www.data-mining-cup.de) is a competition that is organized every year at Chemnitz, Germany, by the Chemnitz University of Technology (www.tu-chemnitz.de), Prudsys AG (www.prudsvs.de) and the European Knowledge Discovery Network of Excellence (www.kdnet.org). The contest is addressed to students of universities, advanced technical colleges as well as universities of cooperative education from Germany and abroad. In 2003 there were 514 registered participants, from 199 universities and colleges around the world (39 countries), out of which only 156 found a solution to the problem. 2 SpamAssassin(tm) is a mail filter to identify spam. Using its rule base, it uses a wide range of heuristic tests on mail headers and body text to identify unsolicited commercial email. For more information see http://spamassassin.org Page 9

Author: Militant, Valentin-Dragos Faculty of Cybernetics, Statistics and Computer Science Academy of Economic Studies, Bucharest TEL: +40.744.689.080 EMAIL: valentin.militaru@home.ro Corporate and Americas Headquarters Megaputer Intelligence, Inc. 120 West Seventh Street, Suite 310 Bloomington, IN 47404, USA TEL: +1.812.330.0110; FAX: +1.812.330.0150 EMAIL: info@megaputer.com Europe Headquarters Megaputer Intelligence, Ltd. B. Tatarskaja 38 Moscow 113184, Russia TEL: +7.095.951.8079; FAX: +7.095.953.5731 EMAIL: info@megaputer.com Copyright 2003 Megaputer Intelligence, Inc. All rights reserved. Limited copies may be made for internal use only. Credit must be given to the publisher. Otherwise, no part of this publication may be reproduced without prior written permission of the publisher. PolyAnalyst and PolyAnalyst COM are trademarks of Megaputer Intelligence Inc. Other brand and product names are registered trademarks of their respective owners.