Data Mining Techniques Chapter 6: Decision Trees



Similar documents
Data mining techniques: decision trees

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Data Mining Classification: Decision Trees

Gerry Hobbs, Department of Statistics, West Virginia University

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

An Overview and Evaluation of Decision Tree Methodology

Lecture 10: Regression Trees

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Decision Trees from large Databases: SLIQ

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining Methods: Applications for Institutional Research

Data Mining for Knowledge Management. Classification

Decision-Tree Learning

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Chapter 12 Discovering New Knowledge Data Mining

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Data Preprocessing. Week 2

6 Classification and Regression Trees, 7 Bagging, and Boosting

Social Media Mining. Data Mining Essentials

A fast, powerful data mining workbench designed for small to midsize organizations

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Classification and Prediction

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining Techniques Chapter 7: Artificial Neural Networks

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Start-up Companies Predictive Models Analysis. Boyan Yankov, Kaloyan Haralampiev, Petko Ruskov

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

5. Multiple regression

Chapter 20: Data Analysis

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Classification and Regression Trees (CART) Theory and Applications

Analysing Questionnaires using Minitab (for SPSS queries contact -)

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining Algorithms Part 1. Dejan Sarka

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Introduction to Learning & Decision Trees

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

CART 6.0 Feature Matrix

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

11. Analysis of Case-control Studies Logistic Regression

Weight of Evidence Module

Statistics Review PSY379

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Model Validation Techniques

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

How To Make A Credit Risk Model For A Bank Account

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Simple Linear Regression Inference

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Data Mining: A Magic Technology for College Recruitment. Tongshan Chang, Ed.D.

Data Mining - Evaluation of Classifiers

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

JetBlue Airways Stock Price Analysis and Prediction

Multiple Linear Regression

Is it statistically significant? The chi-square test

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Enhancing Compliance with Predictive Analytics

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Final Exam Practice Problem Answers

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Factors affecting online sales

Methods for statistical data analysis with decision trees

Data Mining: An Overview. David Madigan

Data mining and statistical models in marketing campaigns of BT Retail

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Didacticiel - Études de cas

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Data Mining: An Introduction

Simple Predictive Analytics Curtis Seare

A Property & Casualty Insurance Predictive Modeling Process in SAS

Easily Identify Your Best Customers

Predictive Modeling of Titanic Survivors: a Learning Competition

Chapter 23. Two Categorical Variables: The Chi-Square Test

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Nonlinear Regression Functions. SW Ch 8 1/54/

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Transcription:

Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees................................................... 3 Classification tree example................................................. 4 Regression trees......................................................... 5 How to grow a decision tree................................................ 6 Finding the splits........................................................ 7 Growing/pruning the tree.................................................. 8 Classification purity: gini coefficient........................................... 9 Classification purity: entropy reduction......................................... 10 Classification purity: chi-square.............................................. 11 Regression purity: variance reduction.......................................... 12 Regression purity: F-test................................................... 13 Pruning.............................................................. 14 Extracting rules......................................................... 15 Further refinements...................................................... 16 1

What is a classification decision tree? Structure used to divide a collection of records into groups using a sequence of simple decision rules: e.g., classification of all living things. Decision rules aim to correctly classify a categorical target variable: e.g., catalog company target variable might be place an order = 0 (no) or 1 (yes). Rules are based on input variables: e.g., if recency < 6 months then predict order = 1, otherwise predict order = 0. Rules are nested (imagine branching tree-structure): e.g., if recency < 6 months AND frequency > 3 purchases per year then predict order = 1, otherwise predict order = 0. c Iain Pardoe, 2006 2 / 16 Visualizing decision trees Start at the root node (at the top!). Root branches into 2 (or more) child nodes (apply rule to decide which one to go to). Child nodes also branch based on further rules. Keep branching until cannot make any more splits (either too few objects at node or further splits would not improve classification accuracy). Proportion of 1 s (score) in terminal node (leaf) suggests likely category: e.g., if proportion of 1 s > 50% then predict target = 1, otherwise predict target = 0. c Iain Pardoe, 2006 3 / 16 2

Classification tree example c Iain Pardoe, 2006 4 / 16 Regression trees Decision trees can also be used for prediction problems with a quantitative target variable: e.g., target = amount spent. Such trees are called regression trees. Average value of the target variable in a terminal node (leaf) used as prediction for objects in that node. c Iain Pardoe, 2006 5 / 16 How to grow a decision tree Classification trees: choose rules to make purity of child nodes as high as possible (i.e., child nodes as close to 0% or 100% for the target variable categories as possible). Regression trees: choose rules to make variance of child nodes as small as possible. c Iain Pardoe, 2006 6 / 16 3

Finding the splits Decide which input variable makes the best split (highest purity/lowest variance in child nodes). Splitting criteria include: classification gini coefficient, entropy reduction (information gain), chi-square; regression variance reduction, F-test. Splitting on a quantitative input: if X 1 < k 1 go left, if X 1 k 1 go right; not sensitive to outliers or skewed data. Splitting on a qualitative input: if X 2 {A, B, C} go left, if X 2 {D, E} go right. No problem with missing data ( missing is its own category). c Iain Pardoe, 2006 7 / 16 Growing/pruning the tree Can use input variables multiple times for different nodes to fine-tune classifications/predictions. Tree is complete (full) when no more splits are possible: each leaf is completely pure/has zero variance; however, while this fits the training sample perfectly, it won t work well on the validation sample overfitting! Use validation sample to prune tree (cut off lower down branches) using fit measures: leaf error rates (proportion of misclassifications or RMSE); leaf lift (proportion of leaf responders / proportion of population responders). Use test sample to assess fit of final selected model. c Iain Pardoe, 2006 8 / 16 4

Classification purity: gini coefficient The gini coefficient (aka population diversity) is the sum of squares of category proportions, e.g.: 50%/50%, g = 0.5 2 +0.5 2 = 0.5 (lowestpurity); 90%/10%, g = 0.9 2 +0.1 2 = 0.82; 100%/0%, g = 1 2 +0 2 = 1.0 (highest purity). Total impact of split = proportion reaching node 1 g 1 + proportion reaching node 2 g 2 +... Select the split that results in the largest increase in this weighted average. Examples: example from class 9 on credit risk; homework 5 question 6 (based on book example p181 2). c Iain Pardoe, 2006 9 / 16 Classification purity: entropy reduction Entropy (measures chaos ) calculated as follows: 50%/50%, e = 1[.5 log 2 (.5)+.5 log 2 (.5)] = 1; 90%/10%, e= 1[.9 log 2 (.9)+.1 log 2 (.1)]=.47; 100%/0%, e = 1[1 log 2 (1)+0 log 2 (0)] = 0. Total impact of split = proportion reaching node 1 e 1 + proportion reaching node 2 e 2 +... Select split that results in largest reduction in this weighted average (entropy loss information gain). Examples: example from class 9 on credit risk; homework 5 question 6 (based on book example p181 2). Refinement: information gain ratio to prevent input vars with many categories leading to bushy trees. c Iain Pardoe, 2006 10 / 16 5

Classification purity: chi-square Based on chi-square test from chapter 5. loan observed expected difference Class 9 example on credit risk, history split: bad good total bad good bad good good hist 2 14 16 7.5 8.5 5.5 5.5 bad hist 13 3 16 7.5 8.5 5.5 5.5 total 15 17 32 15 17 Test statistic: χ 2 = 5.52 7.5 + 5.52 8.5 + 5.52 7.5 + 5.52 8.5 = 15.2. Significant differences since p-value = 0.0001: Excel: =CHIDIST(15.2,1); degrees of freedom = (r 1)(c 1). Credit card split has χ 2 = 0.536, p-value = 0.464. Prefer history split as differences more significant. c Iain Pardoe, 2006 11 / 16 Regression purity: variance reduction Designed for quantitative target variables, but works for qualitative target with 2 categories also. Class 9 example on credit risk, history split: Parent (root) node has 15 bad loans (target=1) and 17 good loans (target=2): mean: (15/32)1 + (17/32)2 = 1.53125; var: [15( 0.53125) 2 +17(0.46875) 2 ]/32=0.249. History split first child node, 2 bad, 14 good: mean: (2/16)1 + (14/16)2 = 1.875; var: [2( 0.875) 2 +14(0.125) 2 ]/16=0.109375. History split second child node, 13 bad, 3 good: mean: (13/16)1 + (3/16)2 = 1.1875; var: [13( 0.1875) 2 +3(0.8125) 2 ]/16=0.152344. Reduction in variance = 0.249 [0.5(0.109375)+0.5(0.152344)] = 0.118. Better than credit card split with 0.004 reduction. c Iain Pardoe, 2006 12 / 16 6

Regression purity: F-test Designed for quantitative target variables, but works for qualitative target with 2 categories also. Class 9 example on credit risk, history split: Parent 15/17 bad/good (1/2), mean = 1.53125. History split: first child, 2/14, mean = 1.875; second child, 13/3, mean = 1.1875. Between mean square error = [16(0.34375) 2 +16( 0.34375) 2 ]/(2 1) = 3.78125. Within mean square error = [2( 0.875) 2 +14(0.125) 2 +13( 0.1875) 2 +3(0.8125) 2 ]/(32 2) = 0.139583. F = 3.78125/0.139583 = 27.09, p-value = 0.00001 (=FDIST(27.09,1,30)). Better than credit card split with F = 0.133/0.261 = 0.511, p-value = 0.480. c Iain Pardoe, 2006 13 / 16 Pruning Remove lower branches to prevent overfitting. XLMiner: uses overall error rate in validation sample: minimum error tree; best pruned tree (smallest tree within one standard error of minimum). CART: adjusts error rate in training sample to penalize trees with too many branches (cf. adjusted R 2 ), then assesses results with validation sample. C5: calculates a confidence interval for true error rate, and uses high end of interval as an estimate. Stability-based pruning: uses validation sample directly to prune unstable branches. c Iain Pardoe, 2006 14 / 16 7

Extracting rules How many distinct rules are there for this tree? c Iain Pardoe, 2006 15 / 16 Further refinements Taking asymmetric costs into account. Using more than one input variable at a time. Allowing linear combinations of quantitative input variables. Neural network trees. Piecewise regression using trees. Alternate tree representations: box diagrams; tree ring diagrams. Decision trees in practice: as a data exploration tool (picking important input variables); application to sequential events; simulating the future. c Iain Pardoe, 2006 16 / 16 8