Performance Comparison of Naïve Bayes and J48 Classification Algorithms

Similar documents
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Classification and Prediction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining for Knowledge Management. Classification

Data Mining Part 5. Prediction

Experiments in Web Page Classification for Semantic Web

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Customer Classification And Prediction Based On Data Mining Technique

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Web Document Clustering

Information Management course

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Chapter 6. The stacking ensemble approach

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Prediction of Heart Disease Using Naïve Bayes Algorithm

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Final Project Report

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Social Media Mining. Data Mining Essentials

Mining the Software Change Repository of a Legacy Telephony System

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Comparison of Data Mining Techniques used for Financial Data Analysis

Classification algorithm in Data mining: An Overview

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Data Mining Algorithms Part 1. Dejan Sarka

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining: A Preprocessing Engine

Introduction to Data Mining

Decision Support System For A Customer Relationship Management Case Study

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Data Mining Classification: Decision Trees

Data Mining System, Functionalities and Applications: A Radical Review

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Introduction. A. Bellaachia Page: 1

Rule based Classification of BSE Stock Data with Data Mining

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

The Optimality of Naive Bayes

Data quality in Accounting Information Systems

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Financial Trading System using Combination of Textual and Numerical Data

Professor Anita Wasilewska. Classification Lecture Notes

How To Solve The Kd Cup 2010 Challenge

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Comparative Analysis of Classification Algorithms on Different Datasets using WEKA

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Data Mining Analytics for Business Intelligence and Decision Support

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Heart Disease Diagnosis Using Predictive Data mining

Introduction to Data Mining

How To Cluster

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

Data Mining with Weka

Journal of Engineering Science and Technology Review 7 (4) (2014) 89-96

WEKA KnowledgeFlow Tutorial for Version 3-5-8

First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Dynamic Data in terms of Data Mining Streams

Empirical Study of Decision Tree and Artificial Neural Network Algorithm for Mining Educational Database

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Predicting Student Performance by Using Data Mining Methods for Classification

Spam Detection A Machine Learning Approach

LVQ Plug-In Algorithm for SQL Server

Supervised Learning (Big Data Analytics)

1. Classification problems

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining

Implementation of Data Mining Techniques to Perform Market Analysis

Data Mining Applications in Fund Raising

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

DATA MINING AND REPORTING IN HEALTHCARE

Data Mining: Foundation, Techniques and Applications

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

A Framework for Dynamic Faculty Support System to Analyze Student Course Data

Three Perspectives of Data Mining

International Journal of Advance Research in Computer Science and Management Studies

Projektgruppe. Categorization of text documents via classification

Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

SVM Ensemble Model for Investment Prediction

SPATIAL DATA CLASSIFICATION AND DATA MINING

Chapter 20: Data Analysis

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

A Comparative Study of clustering algorithms Using weka tools

DATA MINING TECHNIQUES AND APPLICATIONS

Random forest algorithm in big data environment

A Logistic Regression Approach to Ad Click Prediction

Software Defect Prediction Modeling

Transcription:

Performance Comparison of Naïve Bayes and J48 Classification Algorithms Anshul Goyal and Rajni Mehta, Assistant Professor,CSE JCDM COE, Kurukshetra University, Kurukshetra JCDM COE, Kurukshetra University, Kurukshetra goyal.anshul90@gmail.com, er.rajnimehta@gmail.com Abstract Classification is an important data mining technique with broad applications. It classifies data of various kinds. Classification is used in every field of our life. Classification is used to classify each item in a set of data into one of predefined set of classes or groups. This paper has been carried out to make a performance evaluation of Naïve Bayes and j48 classification algorithm. Naive Bayes algorithm is based on probability and j48 algorithm is based on decision tree. The paper sets out to make comparative evaluation of classifiers NAÏVE BAYES AND J48 in the context of financial institute dataset to maximize true positive rate and minimize false positive rate of defaulters rather than achieving only higher classification accuracy using WEKA tool. The experiments results shown in this paper are about classification accuracy and cost analysis. The results in the paper on this dataset also show that the efficiency and accuracy of j48 and Naive bayes is good. Keywords: Naïve Bayes, j48, True Positive Rate, False Positive Rate, Precision, Recall, F-Measure, ROC Curve. Data Mining Advancement in technology has brought great growth in the volume of the data available on the internet, digital libraries, news sources and company-wide intranets. This makes a huge number of databases and information repositories available, but it is impossible to manually organize, analyze and retrieve information from this data. This generates an eminent need of methods that can help users to efficiently navigate, summarize, and organize the data so that it can further be used for applications ranging from market analysis, fraud detection and customer retention etc. Therefore, the techniques that perform data analysis and may uncover important data patterns are needed. One of these techniques is known as data mining. machine learning. Basically classification is used to classify each item in a set of data into one of predefined set of classes or groups. Classification method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics. In classification, once the software is made that can learn how to classify the data items into groups. For example, we can apply classification in application that given all past records of employees who left the company, predict which current employees are probably to leave in the future. In this case, we divide the employee s records into two groups that are leave and stay. And then it can be asked from the data mining software to classify the employees into each group [3]. Given a collection of records. Each record contains a set of attributes; one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Classification A Two-Step Process Model construction: describing a set of predetermined classes. Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute.[1] The set of tuples used for model construction: training set. The model is represented as classification rules, decision trees, or mathematical formula. Model usage: for classifying future or unknown objects. Estimate accuracy of the model. The known label of test sample is compared with the classified result from the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model. Test set is independent of training set, otherwise over-fitting will occur. Data mining refers to extracting or mining knowledge from large amounts of data [1][2]. Data Mining is performed on Database-oriented data sets and applications, Relational databases, data warehouses, transactional databases and advanced data sets and advanced applications such as Object- Relational database Temporal Databases, Sequence Databases, and Time-Series Databases, Spatial Databases and Spatiotemporal Databases Text Databases and Multimedia Databases, Heterogeneous Databases and Legacy Databases, Data Streams and the World Wide Web[1]. Classification Classification is a classic data mining technique based on

Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) 2. Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left [4][5]. J48 Classification by Decision Tree Induction Decision tree 1. A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution 2. Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers 3. Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree [4][5]. Output: A Decision Tree for buys computer Extracting Classification Rules from Trees 1. Represent the knowledge in the form of IF-THEN rules 2. One rule is created for each path from the root to a leaf 3. Each attribute-value pair along a path forms a conjunction 4. The leaf node holds the class prediction 5. Rules are easier for humans to understand [6]. Example IF age = <=30 AND student = no THEN buys computer = no IF age = <=30 AND student = yes THEN buys computer = yes IF age = 31 40 THEN buys computer = yes IF age = >40 AND credit rating = excellent THEN buys computer = yes IF age = >40 AND credit rating = fair THEN buys computer = no Naïve Bayes Classifier (I) A simplified assumption: attributes are conditionally independent: Greatly reduces the computation cost, only count the class distribution [10]. Naive Bayesian Classifier (II) Given a training set, we can compute the probabilities Decision Tree Algorithm for Decision Tree Induction 1. Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-andconquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discredited in advance) Bayes theorem: 1. P (C X) = P (X C) P(C) / P(X). 2. P(X) is constant for all classes. 3. P(C) = relative freq of class C samples c such that p is maximum=c Such that P (X C) P(C) is maximum 4. Problem: computing P (X C) is unfeasible! [9][10]. WEKA Tool In Weka datasets should be formatted to the ARFF format. The Weka Explorer will use these automatically if it does not recognize a given file as an ARFF file, the Preprocess panel has facilities for importing data from a database, and for preprocessing this data using a filtering algorithm. These filters can be used to transform the data and make it possible to delete instances and attributes according to specific criteria

[7][8]. Performance Investigation and Results Experiments are performed on the Bank data set by using Classification algorithm using WEKA tool. The first step is to find the total no. of instances of the given data using both, Naïve bayes and j48 classification algorithm. In the next step, experiment calculates the Classification accuracy and cost analysis. A confusion matrix contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier. The entries in the confusion matrix have the following meaning in the context of our study: 1. a is the number of correct predictions that an instance is negative, 2. b is the number of incorrect predictions that an instance is positive, 3. c is the number of incorrect of predictions that an instance negative, and 4. d is the number of correct predictions that an instance is positive. Several standard terms have been defined for this matrix: 1. True positive (TP) If the outcome from a prediction is p and the actual value is also p, then it is called a true positive (TP). a b Classified As 87 67 a=male 68 78 b=female The above shown Table is about that calculates that actual and predicted classification i.e. Total no. of true positives for class a is 87 and Total no. of False positive for class a 67. Total no. of true positives for class b is 78 and Total no. of false positive for class b is 68. Calculations of true positive rate have been done putting the following formula on the confusion matrix. True Positive Rate=Diagonal element/sum of relevant row False positive Rate= Diagonal element/sum of relevant row True Positive Rate for Class a=87/87+67=.565 False positive Rate for Class a=68/68+78=.466 True Positive Rate for class b=78/78+68=.534 False Positive Rate for Class b=67/67+87=.435 Precision= Diagonal element/sum of relevant Column Precision for Class a=87/87+68=.561 Precision for Class b=78/78+67=.538 F-measure=2* Precision* Recall/ Precision+ Recall F-measure for Class a=2*.561*.565/.561+.565=.563 F-measure for Class b=2*.538*.534/.538+.534=.536 Graphs using j48 2. False positive (FP) However if the actual value is n then it is said to be a false positive (FP) 3. Precision and recall Precision is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on an understanding and measure of relevance. 4. Precision can be seen as a measure of exactness or quality, whereas recall is a measure of completeness or quantity. High recall means that an algorithm returned most of the relevant results. High precision means that an algorithm returned more relevant results than irrelevant. Cost analysis of j48 for Class Male Tables and graphs using J48 Bank Data set have total no. of 300 instances. Gender class has been chosen randomly from bank dataset. When J48 algorithm is applied on the dataset the confusion matrix is generated for class Gender having two possible values i.e. Male and Female.

Graphs using Naïve Bayes Cost analysis of j48 for Class Female Tables and graphs using Naïve Bayes Cost analysis of Naïve bayes for Class male Tables and graph for class Gender a b Classified As 93 61 a=male 82 64 b=female The above shown Table is about that calculates that actual and predicted classification i.e. Total no. of true positives for class a is 93 and Total no. of False positive for class a is 61. Total no. of true positives for class b is 64 and Total no. of false positive for class b is 8. True Positive Rate=Diagonal element/sum of relevant row False positive Rate=Diagonal element/sum of relevant row True Positive Rate for Class a=93/93+61=.604 False positive Rate for Class a=82/82+64=.562 True Positive Rate for class b=64/64+82=.438 False Positive Rate for Class b=61/61+93=.396 Precision= Diagonal element/sum of relevant Column Precision for Class a=93/93+82=.531 Precision for Class b=64/64+61=.512 F-measure=2* Precision* Recall/ Precision+ Recall F-measure for Class a =2*.531*.604/.531+.604=.565 F-measure for Class b =2*.512*.438/.512+.438=.472 Cost analysis of Naïve bayes for Class Female The above shown tables and graphs represents Confusion matrix, True positive rate, False Positive Rate, Precision, Recall, ROC Area, F-Measure on the basis of class Gender in which there are two categories Male and Female. There are two graphs for each class.graph 1 st is about sample size and True positive Rate. Graph 2 nd is for sample size and cost benefit analysis. These graphs are drawn on the basis of confusion matrix and cost matrix. In this we are using Naïve Bayes classification algorithm. Conclusion Both the algorithms are performed on the given Bank data set and their results are Presented in this section Performance evolution on the basis of Gender Classification Accuracy Cost Analysis Gender Naïve Bayes J48 Naïve Bayes J48 Male 48.33% 52.67% 155 142 Female 51% 52% 147 144 Classification accuracy and cost analysis of class Gender The proposed method uses bank dataset. The experiments have been performed using the weka tool. Bank data set have been taken from UCI repository having 300 instances and 9 attributes. J48 is a simple classifier technique to make a

decision tree, efficient result has been taken from bank dataset using weka tool in the experiment. Naive Bayesian classifier also showing good results. The experiments results shown in the study are about classification accuracy and cost analysis. J48 gives more classification accuracy for class gender in bank dataset having two values Male and Female. The result in the study on these datasets also shows that the efficiency and accuracy of j48 and Naive bayes is good. Classification technique of data mining is useful in every domain of our life e.g. universities, Crime, Medical etc Italy. Future work Classification is an important technique of data mining and has applications in various fields. In the present study few issues like high dimensionality, Scalability and accuracy are focused but there are still many issues that can be taken into consideration for further research which are as follows: 1. Different algorithms which are not included in WEKA can be tested. Also, experiments with various feature selection techniques should be compared. 2. Classification technique of data mining is useful in every domain of our life e.g. University domain category wise, Medical domain, crime domain, Auto Price, Zoo etc. 3. given by weka tool can also be removed by changing in implementation of algorithms that are already used in weka tool. References [1] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, 2001. [2] Daniel T. Larose, Discovering knowledge in data: an introduction to data mining. [3] http://wikipedia.org/wiki/research. [4] C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation Computer Systems, 13, 1997. [5] U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI Conf., pages 601-606, AAAI Press, 1994. [6] M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham, England, April 1997. [7] Remco R. Bouckaert, Eibe Frank, Mark Hall Richard Kirkby, Peter Reutemann, Seewald David Scuse, WEKA Manual for Version 3-7-5, October 28, 2011. [8] Rossen Dimov, WEKA: Practical Machine Learning Tools and Techniques in Java, 2006/07. [9] Uffe B. Kjærulff, Anders L. Madsen, Probabilistic Networks an Introduction to Bayesian Networks and Influence Diagrams, 10 May 2005. [10] Zhang H.; Su J.; (2004) Naive Bayesian classifiers for ranking. Paper apperas in ECML2004 15th European Conference on Machine Learning, Pisa,