An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com Richard Kittler Yield Dynamics, Inc. Santa Clara, CA rich@ydyn.com
An Overview and Evaluation of Decision Tree Methodology Outline Background What are decision trees? How are decision trees constructed? What are their goals and how are they assessed? Pros and Cons over Traditional methods Software Section Data Set Specifics used for the Evaluation SAS Enterprise Miner v 4.0 (NT) [sas.com] S-PLUS 2000 Professional Release 2 (NT) [insightful.com] Yield Dynamics Yield Mine v 2.0 (NT) [yielddynamics.com] A comparison of results and features 2
What are Decision Trees? A flow chart or diagram representing a classification system or a predictive model. The tree is structured as a sequence of simple questions. The answers to these questions trace a path down the tree. The end product is a collection of hierarchical rules that segment the data into groups, where a decision (classification or prediction) is made for each group. The hierarchy is called a tree, and each segment is called a node. The original segment contains the entire data set, referred to as the root node of the tree. A node with all of its successors forms a branch of the node that created it. The final nodes (terminal nodes) are called leaves. For each leaf, a decision is made and applied to all observations in the leaf. Classification Tree: Categorical Response / Target Regression Tree: Continuous Response / Target Predictor variables => Inputs 3
Example: Classification Tree Root Node: All Data Tree Assessment: (Proportion Correctly Classified): Yes X 1 < 2.8? X 2 < -10.0? X 3 < 0.003? Yes No Yes No No Leaf 1: 100% Leaf 2: 74% Leaf 3: 69% Leaf 4: 100% Overall: 82% Terminal Nodes with classifications Leaf 1: MAX Leaf 2: MED Leaf 3: MED Leaf 4: MIN Rules: if X1 < 2.8 and X2 < -10 then class = MAX if (X1 < 2.8 and X2 > -10) or (X1 > 2.8 and X3 < 0.003) then class = MED if X1 > 2.8 and X3 > 0.003 then class = MIN 4
Basic Elements in Tree Construction: 1. Splitting each node in a tree How are Decision Trees Constructed? How to select each split so that descendent subsets are purer than the data in the parent subset? 2. Deciding when a tree is complete (two approaches) How large to grow the tree? What nodes to prune off the tree? 3. Assigning Classes (or predicted values) to terminal nodes Which model to use in the leaves? 5
Basic Elements in Tree Construction 1. Splitting each node in a tree (two steps): Generate a set of candidate splitting rules - Look at all possible splits for all variables included in the analysis - Example: a data set with 215 observations and 19 possible input variables would consider up to (215)(90) = 4085 possible splits Choose the best split from among the candidate set - Rank order each splitting rule on the basis of some quality-of-split criterion ( purity function). The most frequently used ones are: - Entropy reduction (nominal / binary targets) - Gini-index (nominal / binary targets) - Chi-square tests (nominal / binary targets) - F-test (interval targets) - Variance reduction (interval targets) 6
Basic Elements in Tree Construction 2. Deciding when a tree is complete (two approaches): Continue splitting nodes until some goodness-of-split criterion fails to be met. - when the quality of a particular split falls below the threshold, the tree is not grown further along that branch. - when all branches from the root reach terminal nodes then tree is complete. Grow the tree too large and then prune the nodes off. - after tree construction stops create a sequence of subtrees from the original tree - choose one subtree for each possible number of leaves (the subtree chosen with p leaves has the best assessment value of all candidate subtrees with p leaves) - once the sequence of subtrees is established select which subtree to use according to some criterion. -best assessment value 7
Basic Elements in Tree Construction 3. Assigning classes (or predictions) to terminal nodes: The plurality rule is one criterion: - the group with the greatest representation determines the class assignment. - the conditional mean within the node determines the predicted value. Other criterion modify simple plurality to account for - costs of making a mistake in classification. - to adjust for over- or under-sampling from certain classes. 8
What are their goals and how are they assessed? Goal of Decision Trees: To obtain the most accurate prediction possible. Since these models are not used in isolation but in conjunction with subject matter experts, the most desirable trees are: 1. Accurate (low generalization error rates) 2. Parsimonious (representing and generalizing the relationships succinctly) 3. Non-trivial (producing interesting results) 4. Feasible (time and resources) 5. Transparent and interpretable (providing high level representations of and insights into the data relationships, regularities, or trends) - Tree Assessment: Estimation of the error or missclassification rate is derived from both a Training data set (used to train the model) and an independent Test / Validation data set (used for testing the model). - Resubstitution Estimate (internal estimate, biased) - Test Sample Estimation (independent estimate) - V-fold and N-fold Cross-Validation (resampling techniques) 9
Pros and Cons Over Traditional Methods Pros: Ease of interpretation and graphical in nature Nonparametric based (does not require specification of functional form) Allow more general interactions and non-additive behavior among inputs Ability to learn step functions (discontinuities) Easier to interpret when predictors are both continuous and categorical Robust to the effects of outliers (for both input and target variables) Invariant to monotone transformations of input variables Can process inputs w/missing values (surrogate splits / alternative splits) Cons: Can become overly complex Based on univariate splits Require many branches to approximate linear functions accurately (unlike regression or neural nets) 10
Software: Data Set Description LOW MED HI Target Response Variable (YCLASS): - Lot level yield classification - 372 lots - Categorical response: HI, MED, LOW Questions: Input (predictor) variables: - 70 Electrical test parameters (e.g. P_221342) - Lot level median values - All continuous Which combination of parameters are best at predicting lot level yield? What is the process space of this best combination? How good is this prediction? 11
SAS Enterprise Miner: GUI Interface Project Diagram Node Types Window EM Workspace Window Available Projects Window - Process Flow Diagrams (PFD) are built in the workspace. - These are specific to a particular data mining project. - This is the flow built for this decision tree analysis 12
Enterprise Miner: Components of the Flow Diagram Input Data Source: Data Set Attributes: Data partition: Tree: Assessment: Reporter: Define data set to be analyzed - Define target and input variables. - Define target profile (priors and/or cost/profit functions) - Define and create training, validation, and test data sets - Define sampling method used for partitioning - Specify splitting criteria - Specify tree characteristics (depth, leaf size, branches,etc.) - Specify model assessment criteria - Run decision tree analysis, view results - Additional model assessment measures - Used to compare models from different analyses types - Journaling function - Results and documentation from each node stored in html format with links to data sets, results, graphics, etc. 13
Enterprise Miner: Summary of Decision Tree Analysis Results Summary Table Tree Ring Assessment Table Assessment Plot Default Model Fitting Criteria: Note: YCLASS YCLASS2 HI = 2 MED = 1 LOW = 0 This criteria set to 5 14
Enterprise Miner: Summary of Decision Tree Analysis Results For categorical targets: The selected subtree has the best classification rate for the validation data and fewest number of leaves Color denotes node purity: Mix (Yellow) Pure (Red) Proportion correctly classified Table Leaf 3 Training Data Statistics Leaf 4 Leaf 5 Validation Data Statistics Leaf statistics summary table 15
Enterprise Miner: Summary of Decision Tree Analysis Results Performance Matrix for Training and Validation: Actual vs. Predicted class values Diagnostic Charts for Validation Data Predictor rankings with surrogate rules 16
Enterprise Miner: Analysis Summary from Reporter Node 17
Enterprise Miner: Summary of Features Pros: No scripting required Easy definition of input and target variables Easy definition of target profiles (use of priors and cost matrices) Option for automatic training and validation data set creation Supports both automatic and interactive training of the model Has various types of tree and model assessment graphics Automatic scripting option available Automatic updating of results for different trees Generation of surrogate rules Calculates overall model assessment values as well as leaf specific assessment values Journaling option available in web / html format (Reporter Node) Cons: Available only as part of SAS v.8 or higher Requires some knowledge of data acquisition and manipulation in SAS Classification Chart was incomplete for the predicted levels Does not label final classification in the tree Could not figure out how to get level of importance of predictors in the model 18
S-PLUS Tree Model - Model and Results Tabs Note: Omit Rows with Missing Values does not appear to be optional. 358 complete records present in 372 Unable to find TrainMod. 19
S-PLUS Tree Model - Plot and Prune/Shrink Tabs Note: can specify Validation data Unable to find ValidateRes 20
S-PLUS Tree Model - Predict Tab Note: Saved predicted values not easily merged with actual values 21
S-PLUS: Report Window 22
S-PLUS: Graphical Tree Window Uniformly Sized Proportional to Node Deviance 23
S-PLUS: Relationship of Report to Tree 1) root 358 761.10 MED ( 0.33800 0.22350 0.4385 ) 2) P.264286<-0.075825 264 528.60 HI ( 0.43560 0.14020 0.4242 ) 4) P.221341<-11.9245 207 424.70 HI ( 0.47340 0.17390 0.3527 ) 8) P.244494<-0.19651 20 38.01 LOW ( 0.20000 0.60000 0.2000 ) * 9) P.244494>-0.19651 187 365.40 HI ( 0.50270 0.12830 0.3690 ) 18) P.264286<-0.12115 109 195.10 HI ( 0.60550 0.10090 0.2936 ) * 19) P.264286>-0.12115 78 159.10 MED ( 0.35900 0.16670 0.4744 ) * 5) P.221341>-11.9245 57 78.82 MED ( 0.29820 0.01754 0.6842 ) * 3) P.264286>-0.075825 94 166.60 MED ( 0.06383 0.45740 0.4787 ) 6) P.264292<-6.87435 70 123.30 MED ( 0.08571 0.31430 0.6000 ) * 7) P.264292>-6.87435 24 18.08 LOW ( 0.00000 0.87500 0.1250 ) * Rule for branch Nobs Branch Class Indention reflects branch levels Deviance (homogeneity) HI s LOW s MED s In branch * = terminal node 2) 3) 7) 24
S-PLUS: Model Assessment Graphics Pruning Method = misclass Pruning Method = deviance Note: This graphic based on no restriction on final tree size, (number of terminal nodes). We believe the assessment is Training (or Validate?). 25
S-PLUS: Summary of Features Pros: No scripting required All relevant summary information succinctly provided in Report window Basic tree graphic with option for branch lengths proportional to split significance Model Assessment Graphics based on misclass or deviance Option for specifying validation data simultaneously to running training data Seems to have ability to adjust for cost of misclassification (but not intuitive) Calculates overall model assessment value File Import Data feature is extensive, most file types recognized Cons: No dynamic linking between Report and Graphics (user intensive interpretation) Journaling option for Report Window only Automatic scripting option not obvious Generation of surrogate rules not obvious Unable to find output/results for Validation data Unable to find saved Model Object Only uses subset of the data with complete records for the independent variables Predicted values not easily merged with the data (Training or Validation) 26
Genesis Yield Mine Histogram of YIELD_1 partitioned by YCLASS 70 65 60 55 50 45 40 YIELD_1 YCLASS=HI YCLASS=LOW YCLASS=MED Normal Density Mean: 28.8665 Std Dev: 11.5291 Count: 372 Norm: No at signif. 5.00E-02 (Chi-Sq stat. = 32.5535 on 6 df; p-value = 1.2778E-5) 35 30 25 20 15 10 5 0-20 -10 0 10 20 30 40 50 60 70 Setting up Yield Mine for YCLASS 27
Yield Mine Results for YCLASS Genesis Yield Mine Interactive options available at each node (Datatype Dependent) 28
Graphical representation of topmost rule: Distribution of variable P_264286 partitioned by YCLASS [Histogram Chart(Partition)] P_264286 Genesis Yield Mine 75 70 65 60 55 50 YCLASS=HI YCLASS=LOW YCLASS=MED Mean: -0.1118 Std Dev: 0.0566 Count: 366 Norm: Yes at signif. 5.00E-02 (Chi-Sq stat. = 12.8960 on 7 df; p-value = 0.0747) 45 40 35 30 25 20 15 10 Box-Whisker chart showing that the P_264286 variable selected as most significant. [Box-Whisker Chart] 5 0-0.35-0.3-0.25-0.2-0.15-0.1-0.05 0 0.05 0.05 P_264286 [YCLASS] 0-0.05-0.1-0.15-0.2-0.25-0.3-0.35 HI Mean=-0.1345 Std. Dev.=0.0387 Count=124 Range=0.1920 LOW Mean=-0.0844 Std. Dev.=0.0772 Count=82 Range=0.4025 YCLASS MED Mean=-0.1084 Std. Dev.=0.0486 Count=160 Range=0.2425 29
Genesis Yield Mine Option to override variable selection by double-clicking on a node 30
Yield Mine: Summary of Features Pros: Easily setup and run by novice user Integrated within a full interactive data analysis environment No scripting required Can be run interactively or in batch mode Tree display is an interactive graphic Supports linear models at nodes Ability to work with missing data Options to select a surrogate variable by double-clicking on a node Options to view explanatory graphics or run linear models by double-clicking on node Options to create a subset of all observations used at a node Ability to save a model and apply to classify future datasets Ability to control depth of tree from setup panel Some ability to look ahead to the next layer and overcome greediness Tuned for use in semiconductor manufacturing Cons: Few adjustable parameters available for access by a statistician No option for automatic training set creation and use No option to specify the minimum number of observations in a terminal node Data repair and outlier screening routines must be run manually beforehand Available only as part of the Genesis Yield Management software system 31
A Comparison of Models Enterprise Yield Mine S-PLUS Miner Final # of predictors 2 4 4 Final # of leaves 3 5 6 Overall Assessment 0.56 / 0.42 0.57 /? 0.61 /? Combination of parameters for predicting lot level yield class: P264292 P264286 P264286 P221341 P221340 P221341 P244494 P244494 P264292 P264292 Note: P264292 and P264286 highly correlated P221341 and P221340 highly correlated 32
A Comparison of Process Spaces Enterprise Miner Process Space Yield Mine Process Space S-PLUS Process Space LOW HI HI HI MED MED LOW MED LOW MED LOW MED LOW 33
Summary of Features Enterprise Miner S-PLUS Yield Mine General Tree Graphics excellent excellent excellent Model Assessment graphics excellent good good Journaling options excellent fair excellent GUI Interface (user friendliness) excellent excellent excellent On Line Help good good good Good balance btw ease of use and knowledge of methodology excellent excellent excellent Scripting options yes yes yes 'on-the fly' graphics fair fair excellent Liscensing Cost $$$$$$ $ $$$$$$ Specifics of Model Flexibility of model criteria excellent good fair Ranking of Model predictors fair good good Surrogate Rules yes no? yes Robustness to Missing Data good poor good 34
Summary Decision tree methods are now well proven as a form of data mining The techniques provide an intuitive and efficient method for discovering relationships across a broad range of data sets. The software implementations in SAS Enterprise Miner, S-PLUS, and Genesis Yield Mine were tested on a sample data set from IC manufacturing and produced comparable results The software evaluations were an initial look at the potential each product offers in decision tree methodology. The strengths and weaknesses of each package should be weighted according to the audience (engineer or statistician) and the problem domain (semiconductor or other). These methods can be a valuable tool in one s arsenal of data analysis tools and we encourage you to consider their use in your practice. 35
References Classification and Regression Trees, 1984, L. Breiman, J.H. Friedman, R.A.Olsen, C.J. Stone, Pacific Grove, Wadsworth. R. Kittler, and W. Wang, 2000, Data Mining for Yield Improvements, Proceedings from MASM 2000. Salford Systems White Paper Series, 2001, An Overview of CART Methodology, http://www.salford-systems.com/whitepaper.htm SAS Institute Inc., 1999, SAS Institute White Paper: Finding the Solution to Data Mining, Cary, NC: SAS Institute Inc. SAS Institute Inc., 2000, Enterprise Miner Version 4.0 On-line Reference Help, Cary, NC: SAS Institute Inc. S-PLUS, 2000, Guide to Statistics, Vol. 1 Statsoft, The Statistics Homepage, 2000, http://www.statsoft.com/textbook/stathome.html 36