An Overview and Evaluation of Decision Tree Methodology



Similar documents
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Data Mining Techniques Chapter 6: Decision Trees

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Gerry Hobbs, Department of Statistics, West Virginia University

A fast, powerful data mining workbench designed for small to midsize organizations

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Data Mining Classification: Decision Trees

6 Classification and Regression Trees, 7 Bagging, and Boosting

Classification/Decision Trees (II)

Lecture 10: Regression Trees

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Decision Trees from large Databases: SLIQ

How To Make A Credit Risk Model For A Bank Account

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining Methods: Applications for Institutional Research

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Modeling Lifetime Value in the Insurance Industry

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Leveraging Ensemble Models in SAS Enterprise Miner

CART 6.0 Feature Matrix

Decision Tree Learning on Very Large Data Sets

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Data mining and statistical models in marketing campaigns of BT Retail

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

2015 Workshops for Professors

Word Count: Body Text = 5, ,000 (4 Figures, 4 Tables) = 7,500 words

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Mining for Knowledge Management. Classification

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

A Property & Casualty Insurance Predictive Modeling Process in SAS

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page Analytics and Data Mining 1

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Decision-Tree Learning

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

Better credit models benefit us all

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Clustering Via Decision Tree Construction

Azure Machine Learning, SQL Data Mining and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining - Evaluation of Classifiers

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Didacticiel Études de cas

Decision Trees What Are They?

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Paper AA Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Structural Health Monitoring Tools (SHMTools)

Interactive Data Mining and Design of Experiments: the JMP Partition and Custom Design Platforms

SAS Software to Fit the Generalized Linear Model

A Data Mining Tutorial

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Unit 4 DECISION ANALYSIS. Lesson 37. Decision Theory and Decision Trees. Learning objectives:

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Identifying and Overcoming Common Data Mining Mistakes Doug Wielenga, SAS Institute Inc., Cary, NC

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Model Validation Techniques

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Classification and Prediction

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Data Mining Practical Machine Learning Tools and Techniques

A Hybrid Decision Tree Approach for Semiconductor. Manufacturing Data Mining and An Empirical Study

Data Preprocessing. Week 2

Social Media Mining. Data Mining Essentials

Why do statisticians "hate" us?

A New Approach for Evaluation of Data Mining Techniques

Polynomial Neural Network Discovery Client User Guide

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Multiple Linear Regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Data analysis process

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

Cluster this! June 2011

Multivariate Analysis of Variance (MANOVA)

Prediction of Stock Performance Using Analytical Techniques

The Data Mining Process

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Data Mining Part 5. Prediction

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Enhancing Compliance with Predictive Analytics

Developing Credit Scorecards Using Credit Scoring for SAS Enterprise Miner TM 12.1

SPSS Explore procedure

Exploratory data analysis (Chapter 2) Fall 2011

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Getting Started with SAS Enterprise Miner 7.1

Data Mining and Visualization

Transcription:

An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com Richard Kittler Yield Dynamics, Inc. Santa Clara, CA rich@ydyn.com

An Overview and Evaluation of Decision Tree Methodology Outline Background What are decision trees? How are decision trees constructed? What are their goals and how are they assessed? Pros and Cons over Traditional methods Software Section Data Set Specifics used for the Evaluation SAS Enterprise Miner v 4.0 (NT) [sas.com] S-PLUS 2000 Professional Release 2 (NT) [insightful.com] Yield Dynamics Yield Mine v 2.0 (NT) [yielddynamics.com] A comparison of results and features 2

What are Decision Trees? A flow chart or diagram representing a classification system or a predictive model. The tree is structured as a sequence of simple questions. The answers to these questions trace a path down the tree. The end product is a collection of hierarchical rules that segment the data into groups, where a decision (classification or prediction) is made for each group. The hierarchy is called a tree, and each segment is called a node. The original segment contains the entire data set, referred to as the root node of the tree. A node with all of its successors forms a branch of the node that created it. The final nodes (terminal nodes) are called leaves. For each leaf, a decision is made and applied to all observations in the leaf. Classification Tree: Categorical Response / Target Regression Tree: Continuous Response / Target Predictor variables => Inputs 3

Example: Classification Tree Root Node: All Data Tree Assessment: (Proportion Correctly Classified): Yes X 1 < 2.8? X 2 < -10.0? X 3 < 0.003? Yes No Yes No No Leaf 1: 100% Leaf 2: 74% Leaf 3: 69% Leaf 4: 100% Overall: 82% Terminal Nodes with classifications Leaf 1: MAX Leaf 2: MED Leaf 3: MED Leaf 4: MIN Rules: if X1 < 2.8 and X2 < -10 then class = MAX if (X1 < 2.8 and X2 > -10) or (X1 > 2.8 and X3 < 0.003) then class = MED if X1 > 2.8 and X3 > 0.003 then class = MIN 4

Basic Elements in Tree Construction: 1. Splitting each node in a tree How are Decision Trees Constructed? How to select each split so that descendent subsets are purer than the data in the parent subset? 2. Deciding when a tree is complete (two approaches) How large to grow the tree? What nodes to prune off the tree? 3. Assigning Classes (or predicted values) to terminal nodes Which model to use in the leaves? 5

Basic Elements in Tree Construction 1. Splitting each node in a tree (two steps): Generate a set of candidate splitting rules - Look at all possible splits for all variables included in the analysis - Example: a data set with 215 observations and 19 possible input variables would consider up to (215)(90) = 4085 possible splits Choose the best split from among the candidate set - Rank order each splitting rule on the basis of some quality-of-split criterion ( purity function). The most frequently used ones are: - Entropy reduction (nominal / binary targets) - Gini-index (nominal / binary targets) - Chi-square tests (nominal / binary targets) - F-test (interval targets) - Variance reduction (interval targets) 6

Basic Elements in Tree Construction 2. Deciding when a tree is complete (two approaches): Continue splitting nodes until some goodness-of-split criterion fails to be met. - when the quality of a particular split falls below the threshold, the tree is not grown further along that branch. - when all branches from the root reach terminal nodes then tree is complete. Grow the tree too large and then prune the nodes off. - after tree construction stops create a sequence of subtrees from the original tree - choose one subtree for each possible number of leaves (the subtree chosen with p leaves has the best assessment value of all candidate subtrees with p leaves) - once the sequence of subtrees is established select which subtree to use according to some criterion. -best assessment value 7

Basic Elements in Tree Construction 3. Assigning classes (or predictions) to terminal nodes: The plurality rule is one criterion: - the group with the greatest representation determines the class assignment. - the conditional mean within the node determines the predicted value. Other criterion modify simple plurality to account for - costs of making a mistake in classification. - to adjust for over- or under-sampling from certain classes. 8

What are their goals and how are they assessed? Goal of Decision Trees: To obtain the most accurate prediction possible. Since these models are not used in isolation but in conjunction with subject matter experts, the most desirable trees are: 1. Accurate (low generalization error rates) 2. Parsimonious (representing and generalizing the relationships succinctly) 3. Non-trivial (producing interesting results) 4. Feasible (time and resources) 5. Transparent and interpretable (providing high level representations of and insights into the data relationships, regularities, or trends) - Tree Assessment: Estimation of the error or missclassification rate is derived from both a Training data set (used to train the model) and an independent Test / Validation data set (used for testing the model). - Resubstitution Estimate (internal estimate, biased) - Test Sample Estimation (independent estimate) - V-fold and N-fold Cross-Validation (resampling techniques) 9

Pros and Cons Over Traditional Methods Pros: Ease of interpretation and graphical in nature Nonparametric based (does not require specification of functional form) Allow more general interactions and non-additive behavior among inputs Ability to learn step functions (discontinuities) Easier to interpret when predictors are both continuous and categorical Robust to the effects of outliers (for both input and target variables) Invariant to monotone transformations of input variables Can process inputs w/missing values (surrogate splits / alternative splits) Cons: Can become overly complex Based on univariate splits Require many branches to approximate linear functions accurately (unlike regression or neural nets) 10

Software: Data Set Description LOW MED HI Target Response Variable (YCLASS): - Lot level yield classification - 372 lots - Categorical response: HI, MED, LOW Questions: Input (predictor) variables: - 70 Electrical test parameters (e.g. P_221342) - Lot level median values - All continuous Which combination of parameters are best at predicting lot level yield? What is the process space of this best combination? How good is this prediction? 11

SAS Enterprise Miner: GUI Interface Project Diagram Node Types Window EM Workspace Window Available Projects Window - Process Flow Diagrams (PFD) are built in the workspace. - These are specific to a particular data mining project. - This is the flow built for this decision tree analysis 12

Enterprise Miner: Components of the Flow Diagram Input Data Source: Data Set Attributes: Data partition: Tree: Assessment: Reporter: Define data set to be analyzed - Define target and input variables. - Define target profile (priors and/or cost/profit functions) - Define and create training, validation, and test data sets - Define sampling method used for partitioning - Specify splitting criteria - Specify tree characteristics (depth, leaf size, branches,etc.) - Specify model assessment criteria - Run decision tree analysis, view results - Additional model assessment measures - Used to compare models from different analyses types - Journaling function - Results and documentation from each node stored in html format with links to data sets, results, graphics, etc. 13

Enterprise Miner: Summary of Decision Tree Analysis Results Summary Table Tree Ring Assessment Table Assessment Plot Default Model Fitting Criteria: Note: YCLASS YCLASS2 HI = 2 MED = 1 LOW = 0 This criteria set to 5 14

Enterprise Miner: Summary of Decision Tree Analysis Results For categorical targets: The selected subtree has the best classification rate for the validation data and fewest number of leaves Color denotes node purity: Mix (Yellow) Pure (Red) Proportion correctly classified Table Leaf 3 Training Data Statistics Leaf 4 Leaf 5 Validation Data Statistics Leaf statistics summary table 15

Enterprise Miner: Summary of Decision Tree Analysis Results Performance Matrix for Training and Validation: Actual vs. Predicted class values Diagnostic Charts for Validation Data Predictor rankings with surrogate rules 16

Enterprise Miner: Analysis Summary from Reporter Node 17

Enterprise Miner: Summary of Features Pros: No scripting required Easy definition of input and target variables Easy definition of target profiles (use of priors and cost matrices) Option for automatic training and validation data set creation Supports both automatic and interactive training of the model Has various types of tree and model assessment graphics Automatic scripting option available Automatic updating of results for different trees Generation of surrogate rules Calculates overall model assessment values as well as leaf specific assessment values Journaling option available in web / html format (Reporter Node) Cons: Available only as part of SAS v.8 or higher Requires some knowledge of data acquisition and manipulation in SAS Classification Chart was incomplete for the predicted levels Does not label final classification in the tree Could not figure out how to get level of importance of predictors in the model 18

S-PLUS Tree Model - Model and Results Tabs Note: Omit Rows with Missing Values does not appear to be optional. 358 complete records present in 372 Unable to find TrainMod. 19

S-PLUS Tree Model - Plot and Prune/Shrink Tabs Note: can specify Validation data Unable to find ValidateRes 20

S-PLUS Tree Model - Predict Tab Note: Saved predicted values not easily merged with actual values 21

S-PLUS: Report Window 22

S-PLUS: Graphical Tree Window Uniformly Sized Proportional to Node Deviance 23

S-PLUS: Relationship of Report to Tree 1) root 358 761.10 MED ( 0.33800 0.22350 0.4385 ) 2) P.264286<-0.075825 264 528.60 HI ( 0.43560 0.14020 0.4242 ) 4) P.221341<-11.9245 207 424.70 HI ( 0.47340 0.17390 0.3527 ) 8) P.244494<-0.19651 20 38.01 LOW ( 0.20000 0.60000 0.2000 ) * 9) P.244494>-0.19651 187 365.40 HI ( 0.50270 0.12830 0.3690 ) 18) P.264286<-0.12115 109 195.10 HI ( 0.60550 0.10090 0.2936 ) * 19) P.264286>-0.12115 78 159.10 MED ( 0.35900 0.16670 0.4744 ) * 5) P.221341>-11.9245 57 78.82 MED ( 0.29820 0.01754 0.6842 ) * 3) P.264286>-0.075825 94 166.60 MED ( 0.06383 0.45740 0.4787 ) 6) P.264292<-6.87435 70 123.30 MED ( 0.08571 0.31430 0.6000 ) * 7) P.264292>-6.87435 24 18.08 LOW ( 0.00000 0.87500 0.1250 ) * Rule for branch Nobs Branch Class Indention reflects branch levels Deviance (homogeneity) HI s LOW s MED s In branch * = terminal node 2) 3) 7) 24

S-PLUS: Model Assessment Graphics Pruning Method = misclass Pruning Method = deviance Note: This graphic based on no restriction on final tree size, (number of terminal nodes). We believe the assessment is Training (or Validate?). 25

S-PLUS: Summary of Features Pros: No scripting required All relevant summary information succinctly provided in Report window Basic tree graphic with option for branch lengths proportional to split significance Model Assessment Graphics based on misclass or deviance Option for specifying validation data simultaneously to running training data Seems to have ability to adjust for cost of misclassification (but not intuitive) Calculates overall model assessment value File Import Data feature is extensive, most file types recognized Cons: No dynamic linking between Report and Graphics (user intensive interpretation) Journaling option for Report Window only Automatic scripting option not obvious Generation of surrogate rules not obvious Unable to find output/results for Validation data Unable to find saved Model Object Only uses subset of the data with complete records for the independent variables Predicted values not easily merged with the data (Training or Validation) 26

Genesis Yield Mine Histogram of YIELD_1 partitioned by YCLASS 70 65 60 55 50 45 40 YIELD_1 YCLASS=HI YCLASS=LOW YCLASS=MED Normal Density Mean: 28.8665 Std Dev: 11.5291 Count: 372 Norm: No at signif. 5.00E-02 (Chi-Sq stat. = 32.5535 on 6 df; p-value = 1.2778E-5) 35 30 25 20 15 10 5 0-20 -10 0 10 20 30 40 50 60 70 Setting up Yield Mine for YCLASS 27

Yield Mine Results for YCLASS Genesis Yield Mine Interactive options available at each node (Datatype Dependent) 28

Graphical representation of topmost rule: Distribution of variable P_264286 partitioned by YCLASS [Histogram Chart(Partition)] P_264286 Genesis Yield Mine 75 70 65 60 55 50 YCLASS=HI YCLASS=LOW YCLASS=MED Mean: -0.1118 Std Dev: 0.0566 Count: 366 Norm: Yes at signif. 5.00E-02 (Chi-Sq stat. = 12.8960 on 7 df; p-value = 0.0747) 45 40 35 30 25 20 15 10 Box-Whisker chart showing that the P_264286 variable selected as most significant. [Box-Whisker Chart] 5 0-0.35-0.3-0.25-0.2-0.15-0.1-0.05 0 0.05 0.05 P_264286 [YCLASS] 0-0.05-0.1-0.15-0.2-0.25-0.3-0.35 HI Mean=-0.1345 Std. Dev.=0.0387 Count=124 Range=0.1920 LOW Mean=-0.0844 Std. Dev.=0.0772 Count=82 Range=0.4025 YCLASS MED Mean=-0.1084 Std. Dev.=0.0486 Count=160 Range=0.2425 29

Genesis Yield Mine Option to override variable selection by double-clicking on a node 30

Yield Mine: Summary of Features Pros: Easily setup and run by novice user Integrated within a full interactive data analysis environment No scripting required Can be run interactively or in batch mode Tree display is an interactive graphic Supports linear models at nodes Ability to work with missing data Options to select a surrogate variable by double-clicking on a node Options to view explanatory graphics or run linear models by double-clicking on node Options to create a subset of all observations used at a node Ability to save a model and apply to classify future datasets Ability to control depth of tree from setup panel Some ability to look ahead to the next layer and overcome greediness Tuned for use in semiconductor manufacturing Cons: Few adjustable parameters available for access by a statistician No option for automatic training set creation and use No option to specify the minimum number of observations in a terminal node Data repair and outlier screening routines must be run manually beforehand Available only as part of the Genesis Yield Management software system 31

A Comparison of Models Enterprise Yield Mine S-PLUS Miner Final # of predictors 2 4 4 Final # of leaves 3 5 6 Overall Assessment 0.56 / 0.42 0.57 /? 0.61 /? Combination of parameters for predicting lot level yield class: P264292 P264286 P264286 P221341 P221340 P221341 P244494 P244494 P264292 P264292 Note: P264292 and P264286 highly correlated P221341 and P221340 highly correlated 32

A Comparison of Process Spaces Enterprise Miner Process Space Yield Mine Process Space S-PLUS Process Space LOW HI HI HI MED MED LOW MED LOW MED LOW MED LOW 33

Summary of Features Enterprise Miner S-PLUS Yield Mine General Tree Graphics excellent excellent excellent Model Assessment graphics excellent good good Journaling options excellent fair excellent GUI Interface (user friendliness) excellent excellent excellent On Line Help good good good Good balance btw ease of use and knowledge of methodology excellent excellent excellent Scripting options yes yes yes 'on-the fly' graphics fair fair excellent Liscensing Cost $$$$$$ $ $$$$$$ Specifics of Model Flexibility of model criteria excellent good fair Ranking of Model predictors fair good good Surrogate Rules yes no? yes Robustness to Missing Data good poor good 34

Summary Decision tree methods are now well proven as a form of data mining The techniques provide an intuitive and efficient method for discovering relationships across a broad range of data sets. The software implementations in SAS Enterprise Miner, S-PLUS, and Genesis Yield Mine were tested on a sample data set from IC manufacturing and produced comparable results The software evaluations were an initial look at the potential each product offers in decision tree methodology. The strengths and weaknesses of each package should be weighted according to the audience (engineer or statistician) and the problem domain (semiconductor or other). These methods can be a valuable tool in one s arsenal of data analysis tools and we encourage you to consider their use in your practice. 35

References Classification and Regression Trees, 1984, L. Breiman, J.H. Friedman, R.A.Olsen, C.J. Stone, Pacific Grove, Wadsworth. R. Kittler, and W. Wang, 2000, Data Mining for Yield Improvements, Proceedings from MASM 2000. Salford Systems White Paper Series, 2001, An Overview of CART Methodology, http://www.salford-systems.com/whitepaper.htm SAS Institute Inc., 1999, SAS Institute White Paper: Finding the Solution to Data Mining, Cary, NC: SAS Institute Inc. SAS Institute Inc., 2000, Enterprise Miner Version 4.0 On-line Reference Help, Cary, NC: SAS Institute Inc. S-PLUS, 2000, Guide to Statistics, Vol. 1 Statsoft, The Statistics Homepage, 2000, http://www.statsoft.com/textbook/stathome.html 36