THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Transcription

1 THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether it be sifting profitable from unprofitable, detecting fraudulent cases, identifying repeat buyers; profiling high-value customers likely to attrite, or flagging high credit-risk applications. This paper proposes a new technique for solving these types of classification problems by hybridizing two popular classification tools, decision trees and logistic regression. CAT and LOGIT, datamining tools by Salford Systems, are used to demonstrate how to implement this new approach both from a theoretical and practical perspective. Strengths and Weaknesses of CAT CAT (Classification and egression Trees) is a state-of-the-art decision-tree tool that can investigate any classification task and provide a robust, accurate predictive model. This methodology is characterized by its ability to automate the modeling process, communicate via pictures and handle complex data structure. The core features of Salford Systems CAT include: automatic separation of relevant from irrelevant predictors automatic interaction detection impervious to outliers unaffected by missing values (using surrogates) invariant to variable transformations (e.g., log, square root, etc). CAT and decision trees in general, however, are notoriously weak at capturing strong linear structure While CAT recognizes the structure, it can not effectively represent it, producing very large trees in an attempt to represent very simple relationships. A second weakness of tree-base d tools is that they sometimes produce very coursegrained response images; that is, the tree may contain only a small number of terminal nodes. Given that the classification outcome or score is shared by all cases in a terminal node, a 12-node tree for example can only predict 12 different probabilities. For some problems, this may be considered a classification success (e.g., identifying a large set of cases as responders ) however for other types of problems it can be problematic. A final weakness, again only problematic for particular problem types, is that decision trees generate discontinuous responses; thus, a small change in x could lead to a large change in y. Strengths and Weaknesses of Logit Logistic regression (or Logit) is a traditional methodology that relies on classical Page 1 of 6

2 statistical principles and, like CAT, demonstrates remarkable accuracy in a broad range of contexts. In the STATLOG project, for example, variations on logisticdiscriminant analysis consistently ranked among the top performers, scoring highest in 5 of 21 problems. Logistic regression models effectively capture global linear structure in data and, given that non-linear structure can be reasonably approximated with linear structure, even incorrectly specified models can perform well. This technique also provides a smooth continuous predicted probability of class membership so that a small change in the predictor variable yields a small change in the predicted probability. However, while Logit excels at handling linear and smooth curvilinear data structures, it often requires experts to hand craft the models and the results can be challenging to interpret, often understood only via simulation. In addition, while Logit permits some flexibility with transformations, polynomials and interactions, it is incumbent upon the analyst to correctly identify the best variable representations. Juxtaposing the strengths and weaknesses of CAT and Logit, a natural question arises: given they excel at different tasks, can we capitalize on their strengths by combining them? CAT LOGIT Automatic analysis equires experts Uses surrogates for missing values Deletes records or imputes missing values Unaffected by outliers Sensitive to outliers Discontinuous response small change in x could lead to large change in y Continuous smooth response small change in x leads to small change in y Course-grained response images finite number of probabilities Unique predicted probability for every record High speed Low speed to infeasible if too many inputs Early Attempts to Hybridize The first attempt to capitalize on the different strengths of the two techniques involved running logistic regression models in terminal nodes of deliberately shallow decision trees. CAT Tree Page 2 of 6

3 LOGIT LOGIT LOGIT LOGIT An examination of how CAT works explains why these first attempts were unsuccessful. As noted above, CAT excels in the detection of local data structure. Once a data set is partitioned into two subsets at the root node, each half of the tree is then separately analyzed. As the partitioning continues, the analysis is always restricted to the node in focus. The discovery of patterns becomes progressively more localized and the fit at one node is never adjusted to take into account the fit at another. In this manner, CAT reaches its goal: to split the data into homogeneous subsets. The farther down the tree, the less the variability in the target (dependent) variable. CAT splits send cases with x c to the left and x > c to the right; thus, the variance in predictor (independent) variables is also drastically reduced. For example, if x is normally distributed and the splitting cut point is at the mean of x, the variance in the two child nodes is reduced by about 64%. For subsequent mean splits, the variance reduction will always be greater than 50%. This reduction in predictor variable variance will also apply to correlated predictors. By the time CAT has declared a node as terminal, the information remaining in the node is insufficient to support further statistical analysis. The sample size in the terminal nodes is drastically reduced as is the target and predictor variable variance. Thus, in a well-developed CAT tree, no parametric model should be supportable within terminal nodes. Estimating Logits or other parametric models earlier in the tree for example, after just a few splits, has the same drawback as terminal node models but the results are less extreme. At best, this latter approach provides a mechanism for identifying switching regression albeit this is not very successful in practice. Proposed New Hybrid Approach The key to running a successful hybrid is to run the Logit in the root node thereby taking advantage of Logit s strength in detecting global structure and to include CAT terminal node dummies in the Logit model. This new hybrid approach is implemented as follows: 1. use CAT to assign every case to a terminal node with CAT surrogates, assignment is possible in every case even those with some or all missing values 2. create a new categorical variable that is equal to the terminal node Page 3 of 6

4 assignment the new categorical will have as many levels as terminal nodes 3. feed the categorical dummy variable into the Logit model (LOGIT will automatically expand x-level categorical into x dummy variables) this first-run Logit model is then used as a baseline model (more on this below) 4. add main-effects variables to the baseline CAT-Logit model added variables constitute the hybrid component and can be tested as a group via a log likelihood ratio test CAT Tree Variable Dummy Dummy Dummy Dummy Variable Variable Variable Logit un on Entire Dataset The Logit formulas for the CAT and hyrbid models can be represented as follows: CAT only: y = 0 + 1NODE 1 + 2NODE 2 + KNODE K where NODE I is a dummy variable for i th CAT node, and CAT-Logit Hybrid: y = 0 + 1NODE + 1 X X + 2NODE X NODE + X iqi + iz j i = 1 = + 1 = CAT Node Dummies + Hybrid Covariates Page 4 of 6

5 The Logit model fit to the CAT terminal node dummies converts the dummies into estimated probabilities; otherwise, it is an exact representation of the CAT model. Each dummy represents the rules and interaction structure discovered by CAT, albeit buried in a black box. The likelihood score on this model can then be used as a baseline score for further testing and model assessment. Note also this simple hybrid model is an excellent way to incorporate sampling weights and recalibrate a CAT tree. The addition of main effect variables to the baseline model then allows the now expanded model to capture effects common across all nodes (i.e., global structure). Because all strong effects have already been detected in the initial CAT run, the effects detected across the terminal nodes are likely to be weak; nevertheless, a collection of weak effects can be very significant. A good starting point for expanding on the LOGIT component of the hybrid model is to: add variables already selected as important by CAT add competitor variables in the root node that never actually appeared as splitters or surrogates in the CAT tree add variables known to be important from other studies A stepwise selection procedure can be used to pare down the variable list and then the pared-down list of main-effect variables tested as a group via a likelihood ratio test. In sum, by looking across nodes, Logit finds effects that CAT cannot detect. Because these effects are not very strong, they are not detected by CAT and not used as primary node splitters. Once the sample is split by CAT, these effects become progressively more difficult to detect as the subsamples become increasingly more homogeneous in the child nodes. While these effects may not be the strong individually, collectively they can add enormous predictive power to the model. Finessing the Hybrid Model Other considerations that must be addressed in building a CAT-Logit hybrid model are missing values, variable transformations and interactions. The simplest approach for handling missing values of course is to simply ignore the problem by dropping all records with missing values on model variables. Alternatively, CAT-predicted probabilities can be assigned to those cases with missing values while hybrid-predicted probabilities assigned to all other cases. More complicated approaches include missing value imputation and adding missing-value dummy indicators to the model plus nesting for non-missing. Given that CAT tree give good results, these more complicated approaches are usually not required. Page 5 of 6

6 Variable transforms, such as logs and square roots, will also need to be considered. Given interactions will already be captured in the CAT terminal node dummies, the only interaction terms worth considering are those that capture node-specific effects (i.e., terminal node interactions with selected variables) and interactions with missing value indicators. Salford Systems MAS, which automates the process of identifying optimal variable transformations and interactions, can be effectively utilized as this stage. Assessing Node-Specific Logit Fit CAT segments the data into very different subsamples so why expect that a single common Logit is valid? First, the terminal node dummies capture all of the complex interactions and non-commonality of the hyper-segments. Second, a simple test can be carried out to check if the Logit developed in each node resulted in an improvement over the CAT score. To test whether the hybrid model shows a lack of fit in any node or subset of nodes, perform a simple likelihood test node by node. If the CAT likelihood is greater than the hybrid model likelihood, do not apply the hybrid model to that particular node. eal World esults In our experience, models have often been improved dramatically by the hybrid CAT-Logit technique. In the following direct mail and financial market applications, the hybrid out performed both CAT and LOGIT alone: Direct mail applications: response model for catalog response model for credit card offer response model for an insurance product Financial applications: mortgage model loan delinquency model fraud detection model While these real-world examples are valuable case studies, they constitute a small sample and the results can not be shared due to confidentiality. Because of this, extensive experiments on artificial data sets were conducted. This analysis permitted a more accurate assessment of the possible benefit and flaws of the hybrid methodology. Monte Carlo Test esults For the Monte Carlo hybrid model assessment, samples of various sizes (2,000 to 100,000 records) were randomly drawn (from?). Each experiment was run 100 different times by resetting the random seed. The resulting models were assessed on Page 6 of 6

7 the basis of fit and also performance (e.g., profit yielded if model guides policy). Training and hold-out samples were used to assess possible over-fitting of the data. A description of the specific Monte Carlo experiments is summarized below: simple Logit: one variable CAT tree: one variable, highly non-linear hybrid process Logit: several variables (possibly missing) hybrid: several variables (possibly missing) highly non-linear smooth function (not Logit) complex Logit with informative missingness The Monte Carlo results indicated that in smaller samples (n=2,000), LOGIT performed very well even when it was not the true model. In larger samples (n=20,000), the hybrid model dominated on out-of-sample performance measures. And, in the larger samples, both CAT and the hybrid model manage problems with missing values whereas the Logit model collaspes. Finally, in larger samples with high frequencies of missings, the hybrid outperforms other models regardless of which model is true. eferences Breiman, L., J. Friedman,. Olshen and C. Stone (1994), Classification and egression Trees, Pacific Grove: Wadsworth. Friedman, J. H. (1991), Multivariate Adaptive egression Splines (with discussion), Annals of Statistics, 19, (March). Michie, D., D. J. Spiegelhalter, and C. C. Taylor, eds (1994), Machine Learning, Neural and Statistical Classification, London: Ellis Horwood Ltd. Steinberg, D. and P. Colla (1995) CAT: Tree-Structured Non-Parametric Data Analysis, San Diego, CA: Salford Systems. CAT is a registered trademark of California Statistical Software and licensed exclusively to Salford Systems. All other trademarks mentioned are the property of their respective owners. Page 7 of 6