Decision Trees. Content. Classification Example: Fisher s Iris Data. Classification Example: Fisher s Iris Data

Transcription

1 Content Decision Trees Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary) Classification Example: Fisher s Iris Data Classification Example: Fisher s Iris Data 3 species of iris flowers, 50 observations per species 4 predictor variables: petal length and width, sepal length and width Objective: to predict the class of species based

2 Classification Example: Stock Selection Classification Example: Stock Selection To predict a stock whether it is underperformed or overperformed. Underperformed means its monthly return is less than the median stock return for the month Otherwise, overperformed Classification Example: In-Patient Data Classification Example: In-Patient Data 1,756,484 records of hospital in-patient statistics in NSW, Australia in Aim: identify risk factors for an adverse event (AE) An adverse event (AE) is an unintended injury or complication which results in disability, death or prolongation of hospital stay, and is caused by health care management rather than the patient s disease Eg. accidental cut during surgery, incorrect dosage of drugs Potential predictors: Comorbidity (multiple diagnoses), Procedures (multiple procedures), gender, insurance, psychiatric status, age, day only, readmitted, etc 3.4% of AE cases in the dataset

3 Classification Example: In-Patient Data Classification Example: In-Patient Data Model Performance: Confusion matrix Misclassification rate = ( )/ = 31.6% Sensitivity = 47403/60004 = 79.0% Specificity = / = 68.0% Pen-Digits Data Binary Tree

4 Tree with Multiway Splits Boston Housing Data Regression Tree Multivariate Step Function

5 Content What is a decision tree Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary) Variation of Decision Trees Classification tree The target is discrete (binary, nominal) The leaves give the predicted class as well as the probability of class membership Regression tree The target is continuous The leaves give the predicted value of the target Tree with binary splits Tree with multiway splits

6 Illustrating Classification Task Example of a Decision Tree Decision Tree Classification Task Apply Model to Test Data

7 Apply Model to Test Data Apply Model to Test Data Apply Model to Test Data Apply Model to Test Data

8 Apply Model to Test Data Content Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary) How to build a decision tree Root-Node Split Recursive partitioning a top-down, greedy algorithm to fit the decision tree for the data Top-down Starting at the root node, split the data into subgroups that are as homogeneous as possible with respect to the target. Greedy method always make a locally optimal choice in the hope that this will lead to a globally optimal solution

9 1-Deep Space Depth 2 2-Deep Space Three steps in tree construction Selection of the best split Which input variable could give the best split? Best according to which splitting criterion? Stop-splitting rule When should the splitting stop? Assignment of each leaf node to a class Predict the value of the target variable (discrete or continuous) at each leaf node

10 No. of possible splits No. of possible splits Split on a nominal input with L distinct levels No. of possible splits into B branches: S(L,B) = B S(L 1,B) + S(L 1,B 1) Split on an ordinal input with L distinct levels No. of possible splits into B branches: No. of possible splits No. of possible splits Split on a continuous input Treat it as if an ordinal input

11 Selection of the best splits Splitting Criterion Exhaustively examining all possible splits is time consuming. By default, Softwares will use exhaustive search if no. of possible splits < Otherwise, a clustering of levels of an input is used to limit the possible splits to consider. An alternative way is to consider binary splits only (B = 2) nominal : 2 L 1 1 possible splits ordinal : L 1 possible splits After a set of candidate splits is determined, a splitting criterion is used to determine the best one. Splitting criterion for discrete target Statistical approach to splitting Two approaches for discrete target: Method 1: statistical test for independence between the input and target variables Chi-squared test Likelihood ratio test The best split is the one that is most significant (i.e. p-value is the smallest) Any split in a classification tree can be arranged in a contingency table. Test of independence between target (row) and input (column): Chi-squared test X 2 = Σ (O E) 2 /E Likelihood ratio test G 2 = 2 Σ O ln(o/e) O = observed frequency E = expected frequency X 2 and G 2 ~ chi-square dist. with d.f. (r-1)(b-1) r = no. of target levels B = no. of branches

12 Example revisited Pen-Digits Data: Chi-Squared Test X 2 = d.f. = 1 G 2 = d.f. = 1 Smaller p-value Stronger association between input and target The split with the smallest P-value or largest logworth = log 10 (p-value) will be chosen Splitting criterion for discrete target Method 2: based on impurity function of a node Gini index: 1 Σ j p 2 j Entropy: Σ j p j log 2 p j where log 2 (x) = ln(x) / ln(2) Misclassification error: 1 max p j Gini Index Gini index is a measure of diversity for discrete data. Gini = 1-2(3/8) 2-2(1/8) 2 =.69 The best split is the one that gives the maximum reduction in impurity (IP): ΔIP = 0.4 6/10(0.33) 4/10(0) = Gini = 1-(6/7) 2 -(1/7) 2 =.24 Minimum G = 0 if one of the p j s is 1 Maximum G = 1 1/k if p 1 = = p k = 1/k

13 Entropy Impurity function Properties of an impurity function of a node: Nonnegative decreases when the node is more pure, i.e. one class dominates For node 1: Gini = = 0.5 Entropy = 0.5 log 2 (0.5) 0.5 log 2 (0.5) = 1 Misclassification error = = 0.5 For node 2: Gini = = Entropy = 0.75 log 2 (0.75) 0.25 log 2 (0.25) = Misclassification error = = 0.25 Remarks Problem with Impurity Reduction The process of selecting the best split on a node: 1) Select the best split on each input variable (i.e. choose number of branches and cut-off points) 2) select the best of these Comparing splits on the same input variable: Gini, Entropy, and Misclass favour splits into greater numbers of branches (large B). They are not appropriate for evaluating multiway splits. The p-values of Chi-squared and likelihood ratio tests automatically adjust for this bias through the d.f.. Impurity reduction tends to prefer splits that result in large number of partitions, each being small but pure Customer ID has highest information gain because entropy for all the children is zero

14 Remarks P-Value Adjustments in Chi-Square Test Comparing splits on different input variables: The p-values of Chi-squared and likelihood ratio tests tends to be smaller as the number of possible splits, m, increases. Kass (1980) proposed Bonferroni adjustments of the pvalues to account for this bias. Logworth = log 10 (m p-value) What is the value of m? If all the splits have logworth < log 10 (0.2) then don t split. Otherwise, the split with the largest logworth is selected as the best split. Which splitting criterion is the best? No single best choice Attempt all and determine the best results Splitting Criterion for continuous target Two approaches for continuous target: Based on impurity function of a node Sample variance Based on a statistical test for one-way ANOVA F test Boston Housing Data NOX

15 Assignment of each leaf node to a class F test is better than (sample) variance reduction as it has P-value adjustment for different no. of branches. F test is relatively robust to departures from normality assumption However, F test is sensitive to departures from non-constant variance For classification tree: Classify an observation in a node to the class with maximum posterior probability p( j ) is prior probability p( t j ) = proportion of class j obs. going to node t If p( j ) = proportion of all obs. belonging to class j, then p( j t ) = proportion of obs. in node t belonging to class j For regression tree: Predict an observation in a node by the sample mean of the target values in the node. Example Content Prior probabilities p(1) = p(7) = 364/1064 p(9) = 336/1064 Conditional probabilities p(t 1) = 285/364 p(t 7) = 143/364 p(t 9) = 41/336 Show the following results for posterior probabilities: p(1 t) = 285/469 p(7 t) = 143/469 p(9 t) = 41/469 Classify to class 1. Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)

16 Stop splitting rule A simple method: continuous splitting until every node is pure or contains only one observation. fit training data perfectly but may predict poorly on new data. Two approaches: Top-down stopping rules (pre-pruning) Bottom-up assessment criteria (post-pruning) Advantages of Trees Easy to interpret Tree structured presentation Allow mixed input data types: Nominal, ordinal, interval Allow discrete (binary and nominal) or continuous target ordinal target not allowed Robust to outliers in inputs No problem with missing values Automatically Detects interactions (AID) Accommodates nonlinearity Selects input variables Disadvantages of trees Content Most algorithms use univariate splits Solution: Linear combination split (a1x1+a2x2< c?) Unstable fitted tree Often a small change in the data result in a very different series of splits Solution: Bagging Lack of smoothness (step function) in reg. tree Splitting turns continuous input variables into discrete variables. Solution: tree-based regression Spitting using a greedy algorithm While each split is optimal, the overall tree is not. Examples What is a decision tree? How to build a decision tree? Stopping rule and tree pruning Confusion matrix (binary)

17 Confusion Matrix Misclassification rate = (false positive + false negative)/(total cases) Accuracy (or correct classification rate) = (true + + true )/(total cases) Captured Response Curve or Target Concentration Curve Proportion of responders in the full sample are captured in the top 10% (20% ) of people as ranked by the model. Try to locate all positive targets (all respondents) Response rate Response Rate = true positives / total predicted positives Gains Chart or Response Chart Proportion of responders in the top 10% (20% ) of people as ranked by the model. Lift chart: lift (= response rate / baseline)

18 Predictive Power