# Performance Analysis of Decision Trees

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India Avinash Kaur Department of Computer Science, Lovely Professional University, Phagwara, Punjab, India ABSTRACT In data mining, decision trees are considered to be most popular approach for classifying different attributes. The issue of growing of decision tree from available data is considered in various discipline like pattern recognition, machine learning and data mining. This paper presents an updated survey of growing of decision tree in two phases following a top down approach. This paper presents a framework for dealing with the condition of uncertainty during decision tree induction process and concludes with comparison of Averaging based approach and Uncertain Decision Tree approach. 1. INTRODUCTION With the passage of time, large amount of data is collected in almost every field. The useful data need to be extracted or mined for fulfilling the future needs. Although there are various techniques through which the data can be extracted but the classification technique used for classifying the data for mining is the most useful technique. The classification technique is implemented using different algorithms like decision tree, neural network, genetic algorithm and K-nearest neighbor. The decision trees are constructed using the data whose values are known and precise. These are implemented using C4.5 algorithm. But during the construction of decision tree major problem of uncertainty arises in the datasets that is responsible for the performance degradation and inaccurate results. The averaging approach and distributed approach are used to handle the problem of uncertainty. These approaches are compared on the same datasets so as to predict which approach reduces the error rate and depict more accurate results. 2 DATA MINING PROCESS 2.1 Data Mining Data mining is the process which takes data as input and provides knowledge as output. It is non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data [1].It is also called as knowledge mining. It is a part of the knowledge discovery process. Advanced Computation methods are applied in data mining for the extraction of information from data [2]. 2.2 Data Mining Techniques Different data mining techniques are used to extract the appropriate data from database. This goal of extraction is divided into two categories: prediction and description. In prediction process, the existing database and the variables in the database are interpreted in order to predict unknown or future values. On the other hand, description focuses on finding the patterns describing the data and subsequent presentation for user interpretation [3]. Every data mining techniqueis based either on the concept of prediction or description. Different Data Mining Techniques are association, classification, regression and clustering [2]. Association rule mining is one of the most and well researched techniques of data mining [4]. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories [5]. Classification is a process which builds a model based on previous datasets so as to predict future trends with new data sets. The association and classification data mining techniques deals with the categorical attribute. The technique used to predict the numeric or continuous values is known as numeric prediction or regression. It was developed by Sir Frances Galton [2]. It is based on the concept of prediction and is used to predict a model based on the observed data over a period of time [6]. Sometimes, the large amount of data is available but few variables in the datasets do not have class labels. The process of assigning class labels to the variables is known as clustering. In this process, the variables in the data sets are grouped together according to the similarity between each other and dissimilarity from other variables [5] Classification Classification is one of the fundamental tasks in data mining and has also been studied extensively in statistics, machine learning, neural networks and expert system over decades [7]. It is a process which builds a model based on previous datasets so as to predict future trends with new data sets. It is a two-step process in which the first step is the learning step which involves building of a classifier or a model based on analyzation of training dataset so as to predict categorical labels [2]. As the class label is provided in the data set, so it is also known as supervised learning [8]. In the second step, the accuracy of the model is predicted using test data set. If the accuracy is acceptable, then the model is used to classify new 10

2 data tuples. The different classification techniques are decision trees, neural networks, K-nearest neighbor and genetic algorithms [2]. Among these techniques, the decision trees induction algorithms present several advantages over other learning algorithms, such as robustness to noise, low computational cost for generating the model and ability to deal with redundant attributes. These are fast and accurate among all the classification methods [9]. 3 DECISION TREES Decision tree is a predictive modelling based technique developed by Rose Quinlan.It is a sequential classifier in the form of recursive tree structure [10]. There are three kinds of nodes in the decision tree. The node from which the tree is directed and has no incoming edge is called the root node.a node with outgoing edge is called internal or test node. All the other nodes are called leaves (also known as terminal or decision node) [11]. The data set in decision tree is analyzed by developing a branch like structure with appropriate decision tree algorithm. Each internal node of tree splits into branches based on the splitting criteria. Each test node denotes a class. Each terminal node represents the decision. They can work on both continuous and categorical attributes [2]. 3.1 Decision Tree Induction Decision Tree induction is an important step of segmentation methodology. It acts as a tool for analyzing the large datasets. The response of analyzation is predicted in the form of tree structure [12]. The Decision tree are classified using the two phase s tree building phase and tree pruning phase [2] a) Tree Building Phase: In this phase the training data is repeatedly partitioned until all the data in each partition belong to one class or the partition is sufficiently small. The form of the split depends on the type of attribute. The splitting for numeric attributes are of the form A s, where s is a real number and the split for categorical attributes are of the form A c, where c is a subset of all possible values of A [13]. The alternative splits for attribute are compared using split index. The different tests for choosing the best split are Gini Index (Population Diversity), Entropy (Information Gain), Information Gain Ratio [11]. i) The Gini Index measures the impurity of a data partition or a set of training tuples as Gini( t) 1 [ p( j t)] j wherep (j t) is the relative frequency of class j at node t. It is used in CART, SLIQ and SPRINT decision tree algorithms. The node with the least computed Gini index is chosen [14]. ii) Entropy is an information-theoretic measure of the uncertainty contained in a training set, due to the presence of more than one possible classification. If there are K classes, we can denote the proportion of instances with classification i by pi for i = 1 to K. The value of pi is the number of occurrences of class i divided by the total number of instances, which is a number between 0 and 1 inclusive The Entropy is measured in bits. It is given by 2 Entropy t) p( j t)log p( j t) ( 2 j Where p (j t) is the relative frequency of class j at node t[15] iii) Information Gain is the change in information entropy from prior state to a state that takes some information [19]. It measures reduction in entropy achieved because of split. It is given by k i Gainspit Entropy( p) Entropy( i) i1 n where Parent Node, p is split into k partitions and n i is number of records in partition i[15]. The node with the highest information gain is chosen as best split. It is used in ID3 and C4.5 algorithms.the disadvantage of information gain is that it is biased towards the splits that result in large number of partitions, each being small but pure [16]. iv) Gain Ratio is the variant introduced by the Australian academician Ross Quinlan in his influential system C4.5 in order to reduce the effect of the bias resulting from the use of information gain. Gain Ratio adjusts the information gain for each attribute to allow for the breadth and uniformity of the attribute values [17]. This split method doesn't select a test with an equal or below average information gain and a low split info, because a low split info doesn't provide any intuitive insight into class prediction [18].The gain ratio is given by GainRATIO split SplitINFO where Parent Node, p is split into k partitions n i is the number of records in partition i.[15] This technique is used in C4.5 algorithm [19]. b) Tree Pruning Phase: This phase avoids model overfitting [9]. The issue of overfitting arises due to random errors, noise in data or the coincidental patterns, which can lead to strong performance degradation. There are two approaches that fulfill the task of tree pruning. In the Pre-Pruning Approach (Forward Pruning) top-down construction of decision tree is stopped at a point when the sufficient data is no longer available to make the accurate decisions. In the Post-Pruning Approach (Backward Pruning) the tree is first fully grown and then the sub trees are removed which lead to unreliable results. Further these approaches are classified into Reduced Error Pruning and Rule Post Pruning [7]. i) Reduced error pruning: In this process all the internal nodes in the tree are considered. For each node in the tree it is checked that if removing it along the subtree below it does not affect the accuracy of the validation set [11]. This process is repeated recursively for a node until no more improvements on a node are possible. The reduced error pruning is used by ID3 decision tree algorithm. The biggest disadvantage of n GAINsplit SplitINFO k i1 ni ni log 2 n n 11

3 reduced error pruning is that it cannot be used on the small training data set [14]. ii) Rule Post Pruning: In this kind of pruning the tree structure is converted into rule based system and the redundant conditions are removed by pruning each rule.at the end the rules are sorted by accuracy. The Rule post pruning is used by C4.5 decision tree algorithm [15]. The advantage of Rule post pruning is that pruning becomes flexible and improves interpretability. This yields higher accuracy [14]. 3.2 Decision Tree Algorithms The different decision tree algorithms are ID3, C4.5 and CART. a) ID3 (Iterative Dichotomiser) ID3 is a greedy learning decision tree algorithm introduced in 1986 by Quinlan Ross [17]. It is based on Hunts algorithm [20].This algorithm recursively selects the best attribute as the current node using top down induction. Then the child nodes are generated for the selected attribute. It uses an information gain as entropy based measure to select the best splitting attribute and the attribute with the highest information gain is selected as best splitting attribute. The accuracy level is not maintained by this algorithm when there is too much noise in the training data sets. The main disadvantage of this algorithm is that it accepts only categorical attributes and only one attribute is tested at a time for making decision. The concept of pruning is not present in ID3 algorithm. [21] b) C4.5 algorithm C4.5 is an extension of ID3 algorithm developed by Quinlan Ross [20].This algorithm overcomes the disadvantage of overfitting of ID3 algorithm by process of pruning. It handles both the categorical and the continuous valued attributes. Entropy or information gain is used to evaluate the best split. The attribute with the lowest entropy and highest information gain is chosen as the best split attribute. The pessimistic pruning is used in it to remove unnecessary branches indecision tree [7]. d) C5 C5 algorithm is the successor of C4.5 proposed by Quinlan [22].It uses the concept of maximum gain to find the best attribute. It can produce classifiers which can be represented in the form of rule sets or decision trees. Some important features of C5 algorithm are: Accuracy: C5 rule sets are small in size thus they are not highly prone to error Speed : C5 has a great speed efficiency as compared to C4.5 Memory: Memory utilization of C4.5 was very high as compared to C5 thus C4.5 was called memory hungry algorithm while it is not true in case of C5. Decision Trees: C5 produces simple and small decision trees Boosting: C5 adopts a boosting technique to calculate accuracy of data. It is a technique for creating and combining multiple classifiers to generate improved accuracy. Misclassification costs: C5 calculates the error costs separately for individual classes then it constructs classifiers to minimize misclassification costs. It supports sampling and cross validation [23]. 3.3 Uncertainty in Decision Tree Although the different algorithms have been devised for formulation of accurate results but a major problem of uncertainty arises in data sets in decision tree algorithms due to various factors like the missing tuples value, noise. There are two approaches to handle uncertain data. First approach is the averaging approach (AVG) which is based on greedy algorithm and induces top down tree. In this expected value replaces each probability density function (probability that a random variable x falls in a particular range) in uncertain information and thus converts the data tuples into point valued tuples. This function is also denoted by pdf. Thus the decision tree algorithms ID3 and C4.5 can be reused. Another approach is the distributed approach which considers the complete information carried by probability distribution function to build a decision tree. The challenge in this approach is that a training tuples can now pass a test at tree node probabilistically when its pdf properly contains split point of test. This induces a decision tree which is known as Uncertain Decision Tree (UDT) [14]. To test the accuracy of both the approaches and decide which approach is better than the other the experiments are conducted on the 5 real data sets as shown in Table 1 Table 1 Selected Data sets from UCI Machine Learning Repository [14] Data Set Training Tuples No.of Attributes No. of classes Test Tuples Satellite Vehicle fold Ionosphere fold Glass fold Iris fold For the different datasets Gaussian distribution and the Uniform is considered as an error model to generate the probability density function (pdf). Gaussian distribution is chosen for the measure that involves noise and uniform distribution is chosen when the digitization of measured values introduces noise. The results of applying AVG and UDT to different datasets are shown in table 2. Table 2 Accuracy Improvement by Considering 12

4 Data Set AVG UDT In the experiments each PDF is represented by 100 sample points (i.e. s=100). For most of the data sets Gaussian distribution is assumed as an error model but for the datasets Satellite and Vehicle uniform distribution error model has been used. The second and third columns in the table depict the percentage accuracy achieved on a particular data set by the AVG and UDT approach consequently. Comparing the second and third column of table 2, it is seen that UDT approach builds a more accurate decision tree then the AVG approach. For instance for the data set Satellite accuracy is improved from to that is reducing the error rate from 15.52% to 12.27%. For the dataset Ionosphere accuracy is improved from to thus reducing the error rate from 28.97% down to 24.91%.Similarly in other datasets there is reduction in error rate. The Figure 1 depicts that UDT is more accurate than AVG for same data sets. Figure 1: Comparison of UDT and AVG 4 CONCLUSION Best Case Gaussian Uniform Satellite Vehicle Ionosphere N/A Glass N/A Iris N/A Chart Title AVG UDT The concept of data mining and different techniques used for mining of the data are discussed. The data mining technique classification solves a purpose of classifying the data sets and reaching a decision with different algorithms of decision tree. The performance is the issue in decision trees which arises due to data uncertainty. The uncertainty arises due to missing data tuples and noise in data. To handle the problem of uncertainty, two kind of approaches are followed: averaging based approach and distributed approach. The decision trees are induced based on both the approaches. According to the experiments conducted on different data sets, the UDT builds a more accurate decision tree then averaging approach. The experimental results are depicted graphically. The new methodologies can be introduced to improve the accuracy of decision trees which may lead to highly accurate decisions.. 5. REFERENCES [1] F[1] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (2006, March). "From Data Mining to Knowledge Discovery in Databases". The Knowledge Engineering Review,Vol 21, No.1, pp [2] Han, J., & Kamber, M. (2006). "Data Mining: Concepts and Techniques" (2nd ed.). Morgan Kaufmann Publishers. [3] Pujari, A. K. (2001). "Data Mining Techniques". Universites Press India Private Limited. [4] Agarwal, R., Imieliński, T., & Swami, A. (1993, June). "Mining association rules between sets of items in large databases". ACM SIGMOD Record, Vol 22,No.2, pp [5] Zhao, Q., & Bhowmick, S. S. (2003). "Association Rule Mining: A Survey". Retrieved fromhttp://sci2s.ugr.es/keel/pdf/specific/report/zhao03ars.pdf [6] Oracle Data Mining Concepts 11g. (2008, May). Retrieved from oracle.com: 29.pdf [7] Lavanya, D., & Rani, U. k. (2011, July). "Performance Evaluation of Decision Tree Classifiers on Medical Datasets". International Journal of Computer Applications, Vol 26,No. 4,pp [8] Kotsiantis, S. B. (2007). "Supervised Machine Learning : A review of classification techniques",vol 160,No. 3,Frontiers in Artificial Intelligence and Applications [9] Barros, R. C., Basgalupp, M. P., Carvalho, A. C., &Freitas, A. A. (2010, Jan). "A Survey of Evolutionary Algorithms for DecisionTree Induction". IEEE Transactions on Systems,Mans and Cybernetics, Vol. 10,No. 10,pp [10] Quinlan, J. R. (1987). "Generating production rules from decision trees". Proceedings of the 10th international joint conference on Artificial intelligence, pp [11] Maimon, O., & Rokach, L. (2010). "Data Mining and Knowledge Discovery Handbook". (2nd, Ed.) Springer. [12] De ville, B. (2006). "Decision trees for Business Intelligent and Data Mining using SAS Enterprise Miner". NC, USA: SAS Institute Inc. [13] Apers, P., Bouzeghoub, M., & Gardarin, G. (1996). Advances in Database Technology. 5th International Conference on Extending Database Technology Avignon,Vol

5 [14] Tsang, S., Kao, B., Yip, K. Y., Ho, W.-S., & Lee, S. D. (2011, Jan). "Decision Trees for Uncertain Data". IEEE Transactions on Knowledge and Data Engineering, Vol 23,No.1. [15] Bramer, M. (2007). Principles of data Mining. London: Springer. [16] Quinlan, J. R. (1993). C4.5 programs for machine learning. Morgan,. [17] Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, Vol 1,No.1,pp [18] Harris, E. (2002). "Information Gain Versus Gain Ratio: A Study of Split Method Biases". AMAI. [19] Das, S., & Saha, B. (2009). "Data Quality Mining using Genetic Algorithm". International Journal of Computer Science and Security, Vol 3,No.2,pp [20] Anyanwu, M. N., & Shiva, S. G. (2009). "Comparative Analysis of Serial Decision Tree Classification". International Journal of Computer Science and Security, Vol 3,No. 3,pp [21] Venkatadri, & Lokanatha. (2011). A Comparative Study of Decision Tree Classification Algorithms in Data Mining.. International Journal of Computer Applications in Engineering, Technology and Sciences, Vol 3,No.3.,pp [22] Ruggieri, S. (2002, April). "Efficient C4.5 [classification algorithm]". IEEE Tansactions on Knowledge and Data Engineering,pp [23] C5 Algorithm. (2012). Retrieved fromhttp://rulequest.com/see5-comparison.html IJCA TM : 14

### Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

### COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

### Classification and Prediction

Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

### TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

### Data Mining for Knowledge Management. Classification

1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

### ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India lav_dlr@yahoo.com

### DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

### Data mining techniques: decision trees

Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

### Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

### A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model ABSTRACT Mrs. Arpana Bharani* Mrs. Mohini Rao** Consumer credit is one of the necessary processes but lending bears

### Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

### ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### Comparative Analysis of Serial Decision Tree Classification Algorithms

Comparative Analysis of Serial Decision Tree Classification Algorithms Matthew N. Anyanwu Department of Computer Science The University of Memphis, Memphis, TN 38152, U.S.A manyanwu @memphis.edu Sajjan

### EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One

### A Review of Data Mining Techniques

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

### Decision-Tree Learning

Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values

### The Use of Probabilistic Neural Network and ID3 Algorithm for Java Character Recognition

The Use of Probabilistic Neural Network and ID3 Algorithm Java Recognition Gregorius Satia Budhi Inmatics Department greg@petra.ac.id Rudy Adipranata Inmatics Department rudya@petra.ac.id Bondan Sebastian

### Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm awm@cs.cmu.

Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 42-268-7599 Copyright Andrew W. Moore Slide Decision Trees Decision trees

### Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

### Data Mining based on Rough Set and Decision Tree Optimization

Data Mining based on Rough Set and Decision Tree Optimization College of Information Engineering, North China University of Water Resources and Electric Power, China, haiyan@ncwu.edu.cn Abstract This paper

### Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints

### Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application Gaurav L. Agrawal 1, Prof. Hitesh Gupta 2 1 PG Student, Department of CSE, PCST, Bhopal, India 2 Head of Department CSE, PCST, Bhopal,

### DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

### Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### Interactive Exploration of Decision Tree Results

Interactive Exploration of Decision Tree Results 1 IRISA Campus de Beaulieu F35042 Rennes Cedex, France (email: pnguyenk,amorin@irisa.fr) 2 INRIA Futurs L.R.I., University Paris-Sud F91405 ORSAY Cedex,

### D A T A M I N I N G C L A S S I F I C A T I O N

D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.

### Implementation of Data Mining Techniques to Perform Market Analysis

Implementation of Data Mining Techniques to Perform Market Analysis B.Sabitha 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, P.Balasubramanian 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

### AnalysisofData MiningClassificationwithDecisiontreeTechnique

Global Journal of omputer Science and Technology Software & Data Engineering Volume 13 Issue 13 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

### EFFICIENT K-MEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING

EFFICIENT K-MEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING Navjot Kaur, Jaspreet Kaur Sahiwal, Navneet Kaur Lovely Professional University Phagwara- Punjab Abstract Clustering is an essential

### International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

### Decision tree algorithm short Weka tutorial

Decision tree algorithm short Weka tutorial Croce Danilo, Roberto Basili Machine leanring for Web Mining a.a. 2009-2010 Machine Learning: brief summary Example You need to write a program that: given a

### Random forest algorithm in big data environment

Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

### Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive

### Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

### REVIEW OF ENSEMBLE CLASSIFICATION

Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

### Comparative Analysis of Classification Algorithms on Different Datasets using WEKA

Volume 54 No13, September 2012 Comparative Analysis of Classification Algorithms on Different Datasets using WEKA Rohit Arora MTech CSE Deptt Hindu College of Engineering Sonepat, Haryana, India Suman

### T3: A Classification Algorithm for Data Mining

T3: A Classification Algorithm for Data Mining Christos Tjortjis and John Keane Department of Computation, UMIST, P.O. Box 88, Manchester, M60 1QD, UK {christos, jak}@co.umist.ac.uk Abstract. This paper

### A New Approach for Evaluation of Data Mining Techniques

181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### Survey on Students Academic Failure and Dropout using Data Mining Techniques

ISSN 2320-2602 Volume 3, No.5, May 2014 P.Senthil Vadivu International et al., International Journal of of Advances in Computer in Computer Science and Science Technology, and 3(5), Technology May 2014,

### Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

### Enhanced Boosted Trees Technique for Customer Churn Prediction Model

IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction

### An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

### Data Mining Part 5. Prediction

Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

### Decision Trees for Mining Data Streams Based on the Gaussian Approximation

International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 Decision Trees for Mining Data Streams Based on the Gaussian Approximation S.Babu

### Data Mining Solutions for the Business Environment

Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

### Data Mining for Knowledge Management in Technology Enhanced Learning

Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 115 Data Mining for Knowledge Management in Technology Enhanced Learning

### EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

### IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil brunorocha_33@hotmail.com 2 Network Engineering

### Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

### Learning from Data: Decision Trees

Learning from Data: Decision Trees Amos Storkey, School of Informatics University of Edinburgh Semester 1, 2004 LfD 2004 Decision Tree Learning - Overview Decision tree representation ID3 learning algorithm

### Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

### A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington

### IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,

### Dynamic Data in terms of Data Mining Streams

International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

### Mauro Sousa Marta Mattoso Nelson Ebecken. and these techniques often repeatedly scan the. entire set. A solution that has been used for a

Data Mining on Parallel Database Systems Mauro Sousa Marta Mattoso Nelson Ebecken COPPEèUFRJ - Federal University of Rio de Janeiro P.O. Box 68511, Rio de Janeiro, RJ, Brazil, 21945-970 Fax: +55 21 2906626

### Healthcare Measurement Analysis Using Data mining Techniques

www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik

### An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

### Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three

### A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

### Rule based Classification of BSE Stock Data with Data Mining

International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification

### Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

### Studying Auto Insurance Data

Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

### International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

### DATA MINING METHODS WITH TREES

DATA MINING METHODS WITH TREES Marta Žambochová 1. Introduction The contemporary world is characterized by the explosion of an enormous volume of data deposited into databases. Sharp competition contributes

### PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao ABSTRACT Department of Computer Engineering, Fr.

### DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

### Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

### Data Mining: A Preprocessing Engine

Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

### ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

### Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Smart Grid Data Analytics for Decision Support

1 Smart Grid Data Analytics for Decision Support Prakash Ranganathan, Department of Electrical Engineering, University of North Dakota, Grand Forks, ND, USA Prakash.Ranganathan@engr.und.edu, 701-777-4431

### TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

### CS570 Introduction to Data Mining. Classification and Prediction. Partial slide credits: Han and Kamber Tan,Steinbach, Kumar

CS570 Introduction to Data Mining Classification and Prediction Partial slide credits: Han and Kamber Tan,Steinbach, Kumar 1 Classification and Prediction Overview Classification algorithms and methods

### Data Mining: Foundation, Techniques and Applications

Data Mining: Foundation, Techniques and Applications Lesson 1b :A Quick Overview of Data Mining Li Cuiping( 李 翠 平 ) School of Information Renmin University of China Anthony Tung( 鄧 锦 浩 ) School of Computing

### Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### Professor Anita Wasilewska. Classification Lecture Notes

Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,

### LVQ Plug-In Algorithm for SQL Server

LVQ Plug-In Algorithm for SQL Server Licínia Pedro Monteiro Instituto Superior Técnico licinia.monteiro@tagus.ist.utl.pt I. Executive Summary In this Resume we describe a new functionality implemented

### DATA MINING IN HIGHER EDUCATION : UNIVERSITY STUDENT DROPOUT CASE STUDY

DATA MINING IN HIGHER EDUCATION : UNIVERSITY STUDENT DROPOUT CASE STUDY Ghadeer S. Abu-Oda and Alaa M. El-Halees Faculty of Information Technology, Islamic University of Gaza Palestine ABSTRACT In this

### Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

### FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,