COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Similar documents

Data Mining for Knowledge Management. Classification

Classification and Prediction

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining Part 5. Prediction

Data Mining Classification: Decision Trees

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Professor Anita Wasilewska. Classification Lecture Notes

Data Mining: Foundation, Techniques and Applications

Data mining techniques: decision trees

Decision Trees from large Databases: SLIQ

Decision-Tree Learning

Lecture 10: Regression Trees

Classification and Prediction

Social Media Mining. Data Mining Essentials

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Methods: Applications for Institutional Research

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Performance Analysis of Decision Trees

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

A Data Mining Tutorial

Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Part 5. Prediction

Introduction to Learning & Decision Trees

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Chapter 12 Discovering New Knowledge Data Mining

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Efficient Integration of Data Mining Techniques in Database Management Systems

Chapter 20: Data Analysis

An Overview and Evaluation of Decision Tree Methodology

Machine Learning using MapReduce

Clustering through Decision Tree Construction in Geology

Azure Machine Learning, SQL Data Mining and R

Classification On The Clouds Using MapReduce

Data Mining Practical Machine Learning Tools and Techniques

Applying Data Mining Technique to Sales Forecast

Information Management course

Web Document Clustering

Rule based Classification of BSE Stock Data with Data Mining

Data Preprocessing. Week 2

Data Mining on Streams

Foundations of Artificial Intelligence. Introduction to Data Mining

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining 5. Cluster Analysis

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

EFFICIENT CLASSIFICATION OF BIG DATA USING VFDT (VERY FAST DECISION TREE)

Decision Tree Learning on Very Large Data Sets

Data Mining: Concepts and Techniques

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Classification Techniques (1)

How To Solve The Kd Cup 2010 Challenge

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

DATA MINING TECHNIQUES AND APPLICATIONS

Clustering Via Decision Tree Construction

Performance and efficacy simulations of the mlpack Hoeffding tree

Implementation of Data Mining Techniques to Perform Market Analysis

A Survey of Classification Techniques in the Area of Big Data.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Classification algorithm in Data mining: An Overview

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model

AnalysisofData MiningClassificationwithDecisiontreeTechnique

Comparative Analysis of Serial Decision Tree Classification Algorithms

COURSE RECOMMENDER SYSTEM IN E-LEARNING

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Data Mining - Evaluation of Classifiers

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Customer Classification And Prediction Based On Data Mining Technique

Data Mining based on Rough Set and Decision Tree Optimization

Specific Usage of Visual Data Analysis Techniques

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

International Journal of Advance Research in Computer Science and Management Studies

Data Mining Applications in Fund Raising

Transcription:

COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction

Lecture outline Classification versus prediction Classification A two step process Supervised versus un supervised learning Evaluating classification methods Decision tree induction Attribute selection measures Information gain, Gain ratio, and Gini index Overfitting and tree pruning Enhancements to basic decision tree induction Classification in large databases

Classification versus prediction Classification Predicts categorical class labels (discrete or nominal) Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction Models continuous valued functions, (predicts unknown or missing values) Typical applications Credit approval Target marketing Medical diagnosis Fraud detection

Classification A two step process Model construction: Describing a set of predetermined classes Each tuple/sample/record is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is the training set The model is represented as classification rules, decision trees, or mathematical formulas Model usage: For classifying future or unknown objects Need to estimate the accuracy of the model The known label of a test sample is compared with the classified results from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model The test set is independent of the training set, otherwise over fitting will occur If the accuracy is acceptable, use the model to classify data objects whose class labels are not known

Example: Model construction Training Data Classification Algorithm NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Source: Han and Kamber, DM Book, 2 nd Ed. (Copyright 2006 Elsevier Inc.) Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes

Example: Using the model in classification Accuracy: 75% Classifier Testing Data Unseen Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Source: Han and Kamber, DM Book, 2 nd Ed. (Copyright 2006 Elsevier Inc.) (Jeff, Professor, 4) Tenured?

Supervised versus un supervised learning Supervised learning (classification and prediction) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations (in previous example: attribute 'Tenured' with classes `yes' and `no') New data is classified based on the training set Un supervised learning (clustering and associations) The class label of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Evaluating classification methods Accuracy (or other quality measures) (more on this later today) Classifier accuracy: Predicting class label Predictor accuracy: Guessing value of predicted attributes Speed and complexity Time to construct the model (training time) Time to use the model (classification/prediction time) Scalability: Efficiency in handling disk based databases Robustness: Handling noise and outliers Interpretability: Understanding and insight provided by the model Other measures (for example goodness of rules) Such as decision tree size or compactness of classification rules

Decision tree induction: Example training data set What rule would you learn to identify who buys a computer? Age Income Student Credit rating Buys computer <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no Source: Han and Kamber, DM Book, 2 nd Ed. (Copyright 2006 Elsevier Inc.)

Example output: A decision tree for Buys computer age? <=30 overcast 31..40 >40 student? yes credit rating? no yes excellent fair no yes yes no Source: Han and Kamber, DM Book, 2 nd Ed. (Copyright 2006 Elsevier Inc.)

Algorithm for decision tree induction Basic algorithm (a greedy algorithm) Tree is constructed in a top down recursive divide and conquer manner At start, all training examples are at the root Take the best immediate (local) decision while building the tree Data is partitioned recursively based on selected attributes Attributes are categorical (if continuous valued, they need to be discretised in advance) Attributes are selected on the basis of a heuristic or statistical measure (for example, information gain, gain ratio, or Gini index) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning (majority voting is employed for classifying the leaf) There are no samples left

Attribute selection measure: Information gain Used in popular decision tree algorithm ID3 In each step, select the attribute with the highest information gain Let p i be the probability that an arbitrary record in data set D belongs to class C i, estimated as C i / D ( = number of...) Expected information (information entropy) needed to classify a record in D is (m being the number of classes): m Information required (after using attribute A to split D into v partitions) to classify D is: The smaller Info A (D) is, the purer the partitions are Info D = v Info A D = j=1 Information gained by using attribute A for branching: i=1 p i log 2 p i D j D Info D j Gain A =Info D Info A D

Information gain example Class P: Buys computer = yes Class N: Buys computer = no Info age D = 5 14 Info D 0 4 14 Info D 1 Info D = 9 14 log 2 9 14 5 14 log 2 5 14 =0.940 p i Age Partition j <=30 2 3 1 31 40 4 0 2 >40 3 2 3 n i 5 14 Info D =0. 694 2 From this follows: Gain age =Info D Info age D =0. 246 Age Income Student Credit rating Buys computer <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no And similar: Gain Income =0.029 Gain Student =0.151 Gain Credit rating =0.048

Calculate information gain for continuousvalue attributes Let A be an attribute with continuous values (e.g. income) We have to determine the best split point for A Sort the values of A in increasing order Typically, the mid point between each pair of adjacent values is considered as a possible split point (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 The point with the minimum expected information requirement for A, Info A (D), is selected as the split point for A Split: D 1 is the set of records in D satisfying A split point, and D 2 is the set of tuples in D satisfying A > split point

Attribute selection measure: Gain ratio The information gain measure is biased towards attributes with a large number of values C4.5 (successor of ID3) uses gain ratio to overcome the problem (normalisation to information gain): GainRatio(A) = Gain(A) / SplitInfo(A) v SplitInfo A D = j=1 The attribute with the maximum gain ratio is selected as the splitting attribute Example: Gain ratio on attribute income D j D log 2 D j D SplitInfo A D = 4 14 log 2 4 14 6 14 log 2 6 14 4 14 log 2 4 14 =0.926 GainRatio(income) = 0.029 / 0.926 = 0.031

Attribute selection measure: Gini index Used in CART, IBM Intelligent Miner If a data set D contains records from m classes, the Gini index, gini(d), is defined as: p j is the relative frequency of class j in D If D is split on attribute A into two sub sets D 1 and D 2, the Gini index, gini A (D), is defined as: Reduction in impurity: gini(a) = gini(d) gini A (D) The attribute that provides the smallest gini A (D) (or the largest reduction in impurity) is chosen to split a node Need to enumerate all possible splitting points for each attribute m gini D =1 j=1 p j 2 gini A D = D 1 D gini D 1 D 2 D gini D 2

Gini index example Example data set D has 9 records in class Buys computers = yes and 5 in Buys computers = no gini D =1 9 Suppose the attribute income partitions the data set into 10 records in D 1 : { medium, low } and 4 in D 2 : { high }: 14 2 5 14 gini income {low, medium} D = 10 14 Gini D 1 4 14 Gini D 1 2 =0. 459 = 10 14 1 7 10 2 3 10 2 4 14 1 2 4 Split on { low, high } and { medium } gives a Gini index of 0.458, and on { medium, high } and { low } gives 0.450 Best split for attribute income is on { low, medium } and { high } because it minimises the Gini index 2 2 2 4 =0.443

Comparing attribute selection measures The three measures, in general, return good results, but... Information gain Is biased towards multivalued attributes Gain ratio Tends to prefer unbalanced splits in which one partition is much smaller than the others Gini index Is biased to multivalued attributes Has difficulty when the number of classes is large Tends to favor tests that result in equal sized partitions and purity in both partitions There are many other selection measures!

Overfitting and tree pruning Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Classify training records very well, but poor accuracy for unseen records Two approaches to avoid overfitting Prepruning: Halt tree construction early: do not split a node if this would result in a goodness measure falling below a threshold (difficult to choose an appropriate threshold) Postpruning: Remove branches from a fully grown tree: get a sequence of progressively pruned trees (use a set of data different from the training data to decide which is the best pruned tree )

Enhancements to basic decision tree induction Allow for continuous valued attributes Dynamically define new discrete valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute (feature) construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication

Classification in large databases Classification: a classical problem extensively studied by statisticians and machine learning researchers Scalability: classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? Relatively fast learning speed (compared to other classification methods) Convertible to simple and easy to understand classification rules Can use SQL queries for accessing databases Comparable classification accuracy with other methods

Things to do... Tutorial 4 is this week (association rules mining and introduction to the Rattle data mining tool) Please read the tutorial sheet before coming going into your tutorial (specifically have a read of the Rattle introduction documentation) Work on assignment 2 (please post questions on the COMP3420 discussion forum)