DCS 802 Data Mining Decision Tree Classification Technique

Similar documents
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining: Foundation, Techniques and Applications

Data Mining for Knowledge Management. Classification

Decision-Tree Learning

Decision Trees. JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology. Doctoral School, Catania-Troina, April, 2008

Classification and Prediction

The KDD Process: Applying Data Mining

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Professor Anita Wasilewska. Classification Lecture Notes

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining Classification: Decision Trees

Data Mining on Streams

Social Media Mining. Data Mining Essentials

Data mining techniques: decision trees

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Introduction to Learning & Decision Trees

Data Mining Practical Machine Learning Tools and Techniques

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Big Data: The Science of Patterns. Dr. Lutz Hamel Dept. of Computer Science and Statistics

CONSTRUCTING PREDICTIVE MODELS TO ASSESS THE IMPORTANCE OF VARIABLES IN EPIDEMIOLOGICAL DATA USING A GENETIC ALGORITHM SYSTEM EMPLOYING DECISION TREES

Lecture 10: Regression Trees

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Knowledge-based systems and the need for learning

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Knowledge Discovery and Data Mining

Version Spaces.

Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System

Chapter 12 Discovering New Knowledge Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Part 5. Prediction

Data Mining Techniques Chapter 6: Decision Trees

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Decision Trees from large Databases: SLIQ

Classification and Prediction

Data Mining Essentials

Classification/Decision Trees (II)

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Machine Learning Techniques for Data Mining

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Unit 4 DECISION ANALYSIS. Lesson 37. Decision Theory and Decision Trees. Learning objectives:

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Question 2 Naïve Bayes (16 points)

Gerry Hobbs, Department of Statistics, West Virginia University

Text Analytics Illustrated with a Simple Data Set

FUNDAMENTALS OF MACHINE LEARNING FOR PREDICTIVE DATA ANALYTICS Algorithms, Worked Examples, and Case Studies

The Minimum Consistent Subset Cover Problem and its Applications in Data Mining

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Course 395: Machine Learning

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Data Mining with Weka

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Problems often have a certain amount of uncertainty, possibly due to: Incompleteness of information about the environment,

Data Preprocessing. Week 2

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Predictive Dynamix Inc

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

Data Mining based on Rough Set and Decision Tree Optimization

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Performance Analysis of Decision Trees

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

A Practical Differentially Private Random Decision Tree Classifier

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Chapter 20: Data Analysis

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Less naive Bayes spam detection

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

Approximation Algorithms

Algorithms and Data Structures

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

Linear Financial Modeling and Statistical Profiling

Analysis of Algorithms I: Optimal Binary Search Trees

Symbol Tables. Introduction

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

An Overview and Evaluation of Decision Tree Methodology

Foundations of Artificial Intelligence. Introduction to Data Mining

Data Mining Part 5. Prediction

An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients

Web Document Clustering

D A T A M I N I N G C L A S S I F I C A T I O N

Spring 2007 Assignment 6: Learning

Smart Graphics: Methoden 3 Suche, Constraints

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Weather forecast prediction: a Data Mining application

Introduction to Logistic Regression

Data Mining - Evaluation of Classifiers

Nonlinear Optimization: Algorithms 3: Interior-point methods

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Scheduling Shop Scheduling. Tim Nieberg

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Učenje pravil za napovedno razvrščanje

Title Page. Cooperative Classification: A Visualization-Based Approach of Combining the Strengths of the User and the Computer

Spam Detection A Machine Learning Approach

Transcription:

DCS 802 Data Mining Decision Tree Classification Technique Prof. Sung-Hyuk Cha Spring of 2002 School of Computer Science & Information Systems 1 Decision Tree Learning Widely used, practical method for inductive inference Approximates discrete-valued target functions as trees Robust to noisy data and capable of learning disjunctive expressions A family of decision tree learning algorithms includes ID3, ASSISTANT and C4.5 Use a completely expressive hypothesis space Inductive bias is a preference for small trees over large trees 2 1

Introduction Learned function is represented as a decision tree Learned trees can also be re-represented as sets of if-then rules to improve human readability 3 Decision Tree Representation Decision trees classify instances by sorting them down from the root to the leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance. Each branch descending from that node corresponds to one of the possible values of this attribute. 4 2

Examples for Decision Tree Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No 5 Learned Decision Tree for PlayTennis Figure 3.1 6 3

Decision Trees represent disjunction of conjunctions Decision tree represents (outlook = sunny ^ humidity = normal ) v (outlook = overcast ) v (outlook = rain ^ wind = weak ) 7 Appropriate Problems for Decision Tree Learning (1) Instances are represented by attribute-value pairs each attribute takes on a small no of disjoint possible values, eg, hot, mild, cold extensions allow real-valued variables as well, eg temperature The target function has discrete output values eg, Boolean classification (yes or no) easily extended to multiple-valued functions can be extended to real-valued outputs as well 8 4

Appropriate Problems for Decision Tree Learning (2) Disjunctive descriptions may be required naturally represent disjunctive expressions The training data may contain errors robust to errors in classifications and in attribute values The training data may contain missing attribute values eg, humidity value is known only for some training examples 9 Appropriate Problems for Decision Tree Learning (3) Practical problems that fit these characteristics are: learning to classify medical patients by their disease equipment malfunctions by their cause loan applications by by likelihood of defaults on payments 10 5

The Basic Decision Tree Learning Algorithm (ID3) Top-down, greedy search (no backtracking) through space of possible decision trees Begins with the question which attribute should be tested at the root of the tree? Answer evaluate each attribute to see how it alone classifies training examples Best attribute is used as root node descendant of root node is created for each possible value of this attribute 11 Which Attribute Is Best for the Classifier? Select attribute that is most useful for classification ID3 uses Information gain as a quantitative measure of an attribute Information Gain: A statistical property that measures how well a given attribute separates the training examples according to their target classification. 12 6

ID3 Notation Attribute A that best classifies the examples (Target Attribute for ID3) {Attributes} - A Root node values v1, v2 and v3 of Attribute A Label = + Label = - 13 ID3 Algorithm to learn boolean-valued functions ID3 (Examples, Target_attribute, Attributes) Examples are the training examples. Target_attribute is the attribute (or feature) whose value is to be predicted by the tree. Attributes is a list of other attributes that may be tested by the learned decision tree. Returns a decision tree (actually the root node of the tree) that correctly classifies the given Examples. Create a Root node for the tree If all Examples are positive, Return the single-node tree Root, with label =+ If all Examples are negative, Return the single-node tree Root, with label =- If Attributes is empty, Return the single-node tree Root, with label = the most common value of Target_attribute in Examples %Note that we will return the name of a feature at this point 14 7

Summary of the ID3 Algorithm, continued Otherwise Begin A the attribute from Attributes that best* classifies Examples The decision attribute (feature) for Root A For each possible value v i, of A, Add a new tree branch below Root, corresponding to test A = v i Let Examples vi the subset of Examples that have value v i for A If Examples vi is empty Then below this new branch, add a leaf node with label = most common value of Target_attribute in Examples Else, below this new branch add the subtree ID3(Examples vi,target_attribute, Attributes - {A}) End Return Root * The best attribute is the one with the highest information gain, as defined in Equation: Sv Gain( S, A) Entropy( S) Entropy( Sv ) S v Values( A) 15 Entropy as a Measure of Homogeneity of Examples Information Gain is defined in terms of Entropy expected reduction in entropy caused by partitioning the examples according to this attribute Entropy: Characterizes the (im)purity of an arbitrary collection of examples Given a collection S of positive and negative examples, entropy of S relative to boolean classification is. Entropy( S) p+ log 2 p+ p log2 Where p+ is proportion of positive examples and p- is proportion of negative examples p 16 8

Entropy Illustration: S is a collection of 14 examples with 9 positive and 5 negative examples Entropy of S relative to the Boolean classification: Entropy (9+, 5-) = -(9/14)log 2 (9/14) - (5/14)log 2 (5/14) = 0.940 Entropy is zero if all members of S belong to the same class 17 Entropy Function Relative to a Boolean Classification 18 9

Entropy for multi-valued target function If the target attribute can take on c different values, the entropy of S relative to this c-wise classification is Entropy ( S) p i log2 pi c i= 1 19 Information Gain Measures the Expected Reduction in Entropy Entropy measures the impurity of a collection Information gain Gain (S,A) of attribute A is the reduction in entropy caused by partitioning the examples according to this attribute Gain( S, A) Entropy( S) v Values( A) Sv Entropy( S S where Values (A) is the set of all possible values for attribute A and S v is the subset of S for which attribute A has value v v ) 20 10

Training Examples for Target Concept PlayTennis Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No 21 Stepping through ID3 for the example Top Level (with S={D1,..,D14}) Gain(S, Outlook) = 0.246 Gain(S,Humidity)= 0.151 Gain(S,Wind) = 0.048 Gain(S,Temperature) = 0.029 Example computation of Gain for Wind Values(Wind) = Weak, Strong S =[9+,5-] S weak <-- [6+,2-], S strong <--[3+,3-] Gain(S,Wind) = 0.940 - (8/14)0.811 - (6/14)1.00 = 0.048 Best prediction of target attribute 22 11

Sample calculations for Humidity and Wind 23 The Partially Learned Decision Tree 24 12

Hypothesis Space Search in Decision Tree Learning ID3 searches a hypothesis space for one that fits training examples ID3 performs hill-climbing considering progressively more elaborate hypotheses (to find a decision tree that correctly classifies the training data) Hill climbing is guided by evaluation function which is the gain measure 25 Hypothesis Space Search by ID3 Information gain heuristic guides search of hypothesis space by ID3 26 13

Capabilities and Limitations of ID3 Hypothesis space is a complete space of all discrete valued functions Cannot determine how many alternative trees are consistent with training data (follows from maintaining a single current hypothesis) ID3 in its pure form performs no backtracking (usual risks of hillclimbing- converges to local optimum) ID3 uses all training examples at each step in to make statistically based decisions regarding how to refine its current hypothesis more robust than Find-S and Candidate Elimination which are incrementally-based 27 Inductive Bias in ID3 What is the policy by which ID3 generalizes from observed training examples to classify unseen instances? Basis for choosing one consistent hypothesis over others 28 14

Inductive Bias in Decision Tree Learning Approximate inductive bias of ID3: Shorter trees are preferred over larger trees. Breadth First Search ID3 (BFS-ID3) searches all trees of depth1, all trees of depth 2, etc and produces same tree as ID3 A closer approximation to the inductive bias of ID3: Shorter trees are preferred over longer trees. Trees that place high information gain attributes close to the root are preferred over those that do not. 29 Restriction Biases and Preference Biases ID3 searches a complete hypothesis space (i.e., one capable of expressing any finite discrete-valued function). Its inductive bias is a preference for certain hypotheses Referred to as preference bias or search bias The version space CANDIDATE-ELIMINATION algorithm searches an incomplete hypothesis space (i.e., one that can express only a subset of the potentially teachable concepts). Its inductive bias is in the form of a restriction on the set of hypotheses considered Referred to as restriction bias or language bias. 30 15

Why Prefer Short Hypotheses? ID3 has an inductive bias for favoring shorter decision trees Occam s Razor: Prefer the simplest hypothesis that fits the data. There are fewer short hypotheses than long ones less likely to find a short hypothesis that coincidentally fits the data many long hypotheses fail to generalize subsequently prefer 5-node tree to fit 20 examples than 500 node tree polynomial versus linear fit of noisy data 31 Issues in Decision Tree Learning How deeply to grow the decision tree Handling continuous attributes Choosing an appropriate attribute selection measure Handling training data with missing attribute values differing costs Improving computational efficiency 32 16

Avoiding Overfitting the Data Must be careful to avoid overfitting the data when there is noise in the data or when the number of training examples is too small to produce a representative sample of the true target function. 33 Overfitting the Data Definition: Given a hypothesis space H, a hypothesis h H is said to overfit the training data if there exists some alternative hypothesis h H, such that h has smaller error than h over the training examples, but h has a smaller error than h over the entire distribution of instances. 34 17

Overfitting in Decision Tree Learning Learning which patients have a form of diabetes 35 Approaches To Prevent Overfitting Approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data. Approaches that allow the tree to overfit the data and then post-prune the tree. 36 18

How To Determine the Correct Size 1. Training and Validation: Use a separate set of examples, distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree. 2. Use all available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement. 3. Use an explicit measure of the complexity for encoding the training examples and the decision tree, halting growth when this encoding size is minimized. 37 Training and Validation Training Set : used to form learned hypotheses Validation Set: used to evaluate the accuracy of this hypothesis over subsequent data also, evaluate impact of pruning hypothesis Philosophy: Validation set is unlikely to exhibit same random fluctuations as Training set check against overfitting Typically validation set is one half size of training set 38 19

Reduced Error Pruning Reduced-error Pruning: Consider each node of the decision nodes in the tree to be candidates for pruning Pruning a decision tree consists of removing a subtree rooted at the node making it a leaf node assigning it the most common classification of the training examples affiliated with that node. Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set. 39 Effect of Reduced-Error Pruning 40 20

Rule Post-Pruning Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing the overfitting to occur. Convert the learned tree into an equivalent set of rules by creating 1 rule for each path from the root node to the leaf node. Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. Sort the pruned rules by their estimated accuracy and consider them in this sequence when classifying subsequent sequences. 41 Incorporating Continuous-Valued Attributes Temperature 40 48 60 72 80 90 PlayTennis No No Yes Yes Yes No 42 21

43 y b a b b b a b b a a b b b b b b x a a b b y b a b b b a b b a a b b b b b b 1 2 a a b b x y 2 6 b a b b b a b b a a b b b b b b 1 2 a a b b x (a) (b) 44 22

45 46 23

Class P f, f, f f 1 2 3 ' 1,2 ' 1,3, f, f P1 1, 2, 6 A, A, A P2 7, 4, 4 B, B, B P3 5, 3, 2 A, A, A P4 3, 3, 3 A, A, A P5 3, 5, 4 A, A, B P6 3, 4, 2 A, A, A P7 5, 3, 3 A, A, A P8 4, 1, 3 A, A, A P9 5, 2, 1 A, A, A ' 2,3 Class N f, f, f f 1 2 3 ' 1,2 ' 1,3, f, f N1 4, 5, 7 A, B, B N2 5, 6, 6 B, B, B N3 5, 7, 3 B, A, B N4 6, 6, 6 B, B, B N5 6, 4, 8 B, B, B N6 6, 7, 2 B, B, B N7 7, 2, 6 B, B, A N8 7, 5, 4 B, B, B N9 8, 6, 1 B, B, B ' 2,3 47 Alternative Measures for Selecting Attributes c si SplitInformation( S, A) log2 S si i= 1 S 48 24

Alternative Measures for Selecting Attributes, continued GainRatio( S, A) Gain( S, A) SplitInformation( S, A) 49 Handling Training Examples with Missing Attribute Values Strategies assign it the value that is most common among training examples at node n. assign a probability to each of the possible values of A rather than simply assigning the most common value to A(x). 50 25

Handling Attributes with Differing Costs Strategies replace information gain attribute selection measure by 2 Gain ( S, A) Cost( A) use attribute selection measure Gain( S, A) 2 1 ( Cost ( A) + 1) w 51 Summary Decision tree learning provides a practical method for concept learning and for learning discrete-valued functions. ID3 searches a complete hypothesis space (i.e., the space of decision trees can represent any discrete-valued function defined over discrete-valued instances). The inductive bias implicit in ID3 includes a preference for smaller trees. 52 26

Summary, continued Overfitting the training data is an important issue in decision tree learning. A large variety of extensions to the basic ID3 algorithm has been developed by different researchers. 53 27