Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit scoring, target marketing, medical diagnosis and fraud detection. The SAS System provides many tools that may be used in the prediction of both continuous and categorical targets. In this presentation I will limit myself to prediction algorithms based on recursive partitioning, commonly called decision trees. Several widely used algorithms exist and are known by the names; CART, C5.0 and CHAID, among others. Decision trees involve splitting the data into groups by successively dividing the data into subgroups based on empirically derived associations between the response (target) and one or more predictor variables. In that effort observations are sorted into bins based on the value(s) of the predictor(s). Criteria must be established for each predictor to determine which observations go in which bins in such a way as to maximize the association with the response and then how decide which of the predictors has the best association with the target variable in the particular subgroup being divided. The discussion will describe ways in which those decisions can be made and then ways in which the predictive algorithms, thus derived, can be validated. Simple decision trees often do not perform well in comparison with other predictive modeling methods (neural networks, regression, etc). Their performance can be improved in a number of ways. We will discuss some methods such as bootstrapping that often produce improved results compared to initial results. Introduction When many researchers, marketers, investigators or other analysts think of prediction they think in terms of classical (OLS) regression analysis. Indeed, regression in its many and varied guises continues to be both widely and successfully used in a large number of prediction problems. A very different approach to prediction, called the decision tree, has become increasingly popular in recent years. Just as the regression approach may be applied to problems in which the response (target) variable is continuous or categorical (logistic regression), decision trees may also be applied to categorical or continuous response problems. In a similar way either methodology may be applied to problems in which the candidate predictors are continuous, categorical or some mixture of the two. The process of fitting a decision tree is an algorithm that leads to a solution that is typically displayed in a form that is shown below as Figure 1. The data for Figure 1 comes from an artificially created 10,000 observation data set that contained a 1
binary response divided as 6,319 0 s and 3,681 1 s. Five hundred of the 1 s were selected at random as were 500 of the 0 s in a process sometimes known as separate sampling or, perhaps even more commonly as stratified random sampling. The resulting data set is obviously enriched in terms of the proportion of 1 s (50%) as compared to the original population. 1,000 Obs 500 0 s 500 1 s X4 <.65 X >.65 616 Obs 384 Obs 194 0 s 306 0 s 422 1 s 78 1 s Figure 1 X1<3 X1>3 X4<.81 X4>.81 348 Obs 268 Obs 180 Obs 204 Obs 81 0 s 113 0 s 120 0 s 186 0 s 267 1 s 155 1 s 60 1 s 18 1 s Decision Tree displays similar to that shown in Figure I are available in both SAS/Enterprise Miner Software and in JMP software. In the very small portion of a much larger decision tree shown above there are seven candidate predictor variables, X1 X7, and a binary target. A small proportion, five observations, of the 1,000 observation training data set is displayed below. X1 and X2 are ordinal categorical variables while X5 and X7 are (unordered) nominal variables. X3 and X4 are continuous while X6 is a count variable that might be considered ordinal. The goal of this prediction method and, indeed generally, is to use the set of predictor variables to form groups that are as homogeneous as is possible with respect to the target variable. That is to say that the ideal would be to form groups based on the values of the predictor variables in such a way that within each leaf all target values are either one or all of the target values are zero. At each step in the partitioning process our goal is to maximize node purity minimizing within node variability is an equivalent expression. In other words, we want each split to separate the target values in groups of zeros and groups on ones as well as it is possible to do so. X1 X2 X3 X4 X5 X6 X7 Target 2 2 24.3 0.26 B 0 Blue 1 3 4 29 0.42 A 0 Red 1 2 2 23.8 0.26 C 3 Blue 1 4 5 31.1 0.55 B 0 Blue 1 1 2 30.7 0.16 C 1 Red 1 Three of the predictors X1, X4 (twice) and X6 (twice) are involved in the five binary splits necessary to create the six leaf tree shown. In this model the predicted value 2
of any current or future observation depends on just those three values. Since the target is binary one possibility is that the goal is to estimate the probability that Target=1 for any set of predictors. If, for instance X1=4, X4=0.5 and X6=0 we would get the predicted probability that Target=1 by following the path from the root node as follows. Because X4=0.5 we go from the root node to the node below and to the left because X4=0.5 is < 0.65. From there we go to the node below and to the right because X1=4 is > 3. Finally, we choose the leaf below and to the left because X6=0 is < 1. Terminal nodes are called leaves. Of the 116 observations that fall into that leaf, 79 of them have Target=1 and 37 have Target=0. The proportion where Target=1 is, then, 79/116 =.681 and that can serve as our estimate.. Of course, there are other estimators of proportions besides the sample proportion and they could equally well be used. Of course we can also think of our prediction as the decision, Target=1, since that is more likely event, according to our estimate, than Target=0. How we find splits Consider now the extremely simple case where we have exactly three observations with just one predictor variable and a binary (0,1) target. Assume further that the predictor assumes only the three values; A, B and C displayed in the table that follows. predictor target A 1 Allowing only binary splits, there are 3 ways to form B 0 2 groups from A, B and C. They are A vs B,C; B vs A, C & C 1 C vs A, B. Arranging the data into the three possible 2x2 contingency tables associating the predictor and target variables we display the associated Pearson Chi square statistic as follows. target 0 1 0 1 0 1 A 0 1 B 1 0 C 1 0 B,C 1 1 A,C 0 2 A,B 1 1 Chi square 0.75 3.00 0.75 The largest value of the Pearson Chi square statistic, 3.00, results from placing A and C in one node and B in the other. That suggests that groups formed as A,C vs. B are closely associated with the target outcomes than either of the other two possibilities. Splitting criteria other than the Pearson Chi square statistic are certainly possible to use. The likelihood ratio Chi square is another obvious choice and the Gini coefficient is probably the most popular. Please note that we are not using any of these as a test statistic. Statistical significance is not an important issue at the moment. Indeed, one can argue that in pure prediction problems it is not generally an important consideration at all. Now consider the case where the predictor is either an ordinal categorical variable or is continuous. In fact, the big distinction in splitting is whether the predictor is at least ordinal (ordinal or continuous) as compared to nominal because ordinal and 3
continuous predictors are treated in the same way. Specifically, when the data are at least ordinal splits must respect the ordinal nature the predictor. In other words, a numeric predictor would not be divided so that 3 and 8 were in one group while 4 and 6 were in another. Again, consider a binary target but this time let the predictor take on the ordered values A, B, C and D in the data below. predictor target A 0 There are 7 possible ways to split the letters A,B,C & D B 0 but only three of the splits respect the imposed order C 1 structure. They are, A BCD, AB CD and ABC D since the D 1 others, for instance AC BD, place non contiguous values in the same group. A and C are non contiguous because B is between them. Displaying the 2x2 tables and associated Chi Square as before we get the following. Please note that we could substitute A=1, B=2, C=3 and D=4 into this example and use it to demonstrate splitting on a continuous predictor. target 0 1 0 1 0 1 A 1 0 AB 2 0 ABC 2 1 BCD 1 2 CD 0 2 D 0 1 Chi square 1.333 4.000 1.333 Clearly, the AB vs. CD split produces the largest value of the Pearson Chi square statistic and so, at least by that single criterion, it would be selected as the chosen split. If a continuous or ordinal predictor has five distinct values then the number of order consistent splits is four instead of fifteen. If a nominal variable has even ten distinct values then the number of possible binary splits is 511. With an increased number of candidate splits to search there is a better chance to achieve a large Chisquare by chance therefore Bonferroni and other adjustments have been suggested. Indeed it is the case that the number of possible splits can be enormous. For a categorical variable that has eight levels there are 2 8 1 1 = 127 possible binary splits and 4,139 possible splits of sizes 2 through 8. P values may be associated with the Chi square statistics (here they all have one degree of freedom) and those p values may be adjusted for the multiplicity of splits considered for any particular variable. Without that adjustment variables with many levels would be favored over those with few. In our case, of course, the best split is just the one with the smallest p value. Again we emphasize that the p value need not be understood as a test of significance in order to use it as a splitting criterion. In the situation where the response variable is continuous the goal of node purity is one of minimizing the variability of the response within the chosen splits the within group variance. We can consider the result of any possible split as an analysis of variance problem with two or more groups formed by the splits. For a fixed number of splits node purity is maximized when the SS error is minimized. Alternately, of course, that is equivalent to maximizing SS groups, the F statistic or R 2. 4
A p value can be associated with these calculations in the usual way and that gives up a way to directly compare splits of different sizes, i.e., 2 way splits and 3 way splits as the following example illustrates. Suppose a categorical predictor takes on the values a,b,c,d in a data set with only four observations. That corresponding responses are 1,2,4,6. There are ten ways to split the predictor values into two groups. Placing d is one group and a,b,c in another leads to an ANOVA F value of 4.32 (p=.173) while placing a,b in one group and c,d in the other leads to an F = 9.80 (p=.089). That second choice maximizes F over all two way splits and, because all such splits result in an F that nominally has 1 and 2 degrees of freedom, also minimizes the p value. Therefore (a,b)::(c,d) is the best two way split. Among three way splits we must put 2 values in one group and one in each of the other two groups. If we put c,d together the resulting F is 3.19 (p=.368). The best of the threeway splits results when a & b are grouped together. For that split, F=14.25 (p=.184). Note that, while 14.25 is greater than the calculated F (9.80) from the best two way split, the associated p value is larger than that resulting from the best two way split. That occurs because the three way split results in an F with different degrees of than we had for the two way split. In a real prediction problem, of course there would be several candidate predictors so, at any point, we would have to find the best split for each of the candidates and then choose the best of the best to determine the actual splitting variable. In all of this the p value is the common currency smaller adjusted p values better splits. The adjustments are too complex to go into here but mainly relate the number of possible splits considered for each candidate. Stopping Tree Growth In certain instances a tree can be grown until each terminal node contains only a single observation. Each terminal node is then perfectly pure with respect to the target. To do that would be to create a vastly over fitted model. That is tantamount to fitting a high degree polynomial, super flexible spline function or some other overly complex model to a small data set. The problem of course is that while the various twists and turns in the fitted function help to fit the given data set, those random complexities are most unlikely to be replicated in any new data set from the same or a similar source. There are a couple of things that we can use in order to avoid over fitting. The first has to do with limiting the growth of the tree in the first place. The second has to do with pruning the tree back to a simpler form after it has been fully grown. Even when using large data sets the number of observations in some or all nodes will become small if you move far enough down the tree. With the smaller counts the split Chi square values become proportionately smaller and so the p values become correspondingly larger. In addition, certain p value adjustments are made related to what roughly can be called multiple comparisons. Those adjustments become larger as you move down the tree. At some threshold, perhaps based on a p value but not necessarily 0.05, we usually choose to stop growing the tree. Other 5
considerations, such as establishing a minimum leaf size or maximum depth, may also be involved in decisions to stop growing the tree. There are various strategies in tree growth. One of the most popular has been labeled CART (an acronym for Classification And Regression Trees). In that and some other strategies the goal is to over fit the data with a view towards using another data set in order to prune the tree back to a more parsimonious size. On the other hand, CHAID (Chi square Automatic Interaction Detector) is an algorithm that relies on stopping the growth of trees before over fitting occurs. Pruning the Tree The processes we described earlier are meant to be applied to the training data and are meant to find what has sometimes been called a maximal tree. The idea of a maximal tree is to establish a somewhat over fitted tree that can be the basis for a series of steps in which the tree may be pruned back to a simpler form. Another data set, ordinarily constructed to contain the same proportions of the binary target outcomes, is held back for the purposes of validation. As the tree is grown, in the case where we limit ourselves to binary splits, that is just one additional node at a time we form a series of trees. First, one with two leaves, then one with three leaves, then one with four leaves, and so on. Each of those trees may be thought of as a prediction model and each of them may be applied to the validation data set. Each model in the sequence can then be assessed to see how well it fits the training data. Any of a number of assessment criteria may be used in the comparison of the series of prediction models. If our prediction takes the form of a decision, say, to contact a person or to ignore them then perhaps the most obvious choice is to assess the models according to accuracy where accuracy is simply the proportion of observations in the validation data set that are correctly predicted. Other criteria may be, and often are, more appropriate for specific tasks. We can then choose the prediction model that best fits the validation data according to whatever selected assessment measure we have chosen. Improving Performance Decision trees are useful tools for fitting noisy data. They are easy to explain to people who may not be comfortable with mathematics and they do, in many ways, reflect the mindset in which many humans naturally approach the task of prediction. It is also no small point that they can handle missing values of the predictor variables in a direct and nearly seamless manner a point not discussed here. Unfortunately, they often don t yield predictions that are as precise as we might prefer and, more importantly, that they are often out performed by methodologies like regression, neural networks and some other less well known techniques. One reason they don t always predict well is that they are multivariate step functions and, as such, they are discontinuous. Lacking smoothness, observations that are very close together in the input space may get assigned predicted values that are substantially different and the topology of the predictor 6
space may be highly unstable. Some methods have been developed that can mitigate these problems to a large degree. Ensemble is a description given to a general class of models in which the final predictions are averages of the predictions made by other models. Bagging and boosting are two widely used ensemble methodologies. Ensemble models can be derived for almost any set of models. Here we focus on ensembles formed from tree models. Random Forests constitute one successful strategy that combines information from many trees. The process involves selecting several, n t, bootstrap (with replacement) samples of N observations from the original population of N observations. At each splitting opportunity (node) we select a subset of m << M inputs at random from among the M input variables available. We grow maximal trees in the sense that there is no pruning, although growth may be limited, for instance, by specifying a minimum tree size or some threshold for the Gini statistic. If the best available split fails to meet the threshold we cease splitting. Now we repeat that process many times. To predict (score) any observation we pass it through each of the trees and average the predictions over all trees. If the prediction simply consists of choosing a (categorical) outcome, then each tree casts a vote and the prediction is the winner of that vote. For continuous targets each observation is passed through each tree to produce a numerical prediction and the predictions from the many trees are averaged in order to find the final prediction of the target value. In what is somewhere between those two ideas, for a categorical response, you can average the predicted probabilities for an observation and use the averaged value to predict P(Target=1) for that observation. No validation data set is required when using this approach to modeling. Although the proportion of observations not selected in a bootstrap sample will vary for real problems with their finite samples, in a certain limiting sense, the fraction of the observations not selected in any bootstrap sample will be 1/e ~ 37%. Those observations usually referred to as the OOB (out of bag) data and they are usually used to estimate the errors of prediction. Contact Information: Gerry Hobbs may be reached at ghobbs@stat.wvu.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks of the SAS Institute Inc in the USA and in other countries indicates USA registration. Random Forest is a registered trademark of Leo Breiman and Adele Cutler 7