THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell



Similar documents
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

CART 6.0 Feature Matrix

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining Techniques Chapter 6: Decision Trees

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Data Mining - Evaluation of Classifiers

D-optimal plans in observational studies

Data Mining. Nonlinear Classification

Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

Data Mining Applications in Higher Education

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Data Mining Algorithms Part 1. Dejan Sarka

Handling attrition and non-response in longitudinal data

Prediction of Stock Performance Using Analytical Techniques

REPORT DOCUMENTATION PAGE

Better credit models benefit us all

The Data Mining Process

Data Mining Methods: Applications for Institutional Research

DATA MINING TECHNIQUES AND APPLICATIONS

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Classification of Bad Accounts in Credit Card Industry

Customer and Business Analytic

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Statistics Graduate Courses

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

An Overview and Evaluation of Decision Tree Methodology

Risk pricing for Australian Motor Insurance

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Data Mining Practical Machine Learning Tools and Techniques

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Social Media Mining. Data Mining Essentials

Data Mining: Overview. What is Data Mining?

Knowledge Discovery and Data Mining

On Cross-Validation and Stacking: Building seemingly predictive models on random data

Benchmarking of different classes of models used for credit scoring

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Lecture 10: Regression Trees

A Decision Theoretic Approach to Targeted Advertising

Martian Chronicles: Is MARS better than Neural Networks? by Louise Francis, FCAS, MAAA

Data Mining Classification: Decision Trees

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Regression Modeling Strategies

The Predictive Data Mining Revolution in Scorecards:

Customer Classification And Prediction Based On Data Mining Technique

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

Identifying SPAM with Predictive Models

Better decision making under uncertain conditions using Monte Carlo Simulation

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

6 Classification and Regression Trees, 7 Bagging, and Boosting

not possible or was possible at a high cost for collecting the data.

Machine Learning Methods for Demand Estimation

Bootstrapping Big Data

Studying Auto Insurance Data

Modeling Lifetime Value in the Insurance Industry

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Leveraging Ensemble Models in SAS Enterprise Miner

Churn Modeling for Mobile Telecommunications:

OPTIMIZATION AND FORECASTING WITH FINANCIAL TIME SERIES

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

A Property & Casualty Insurance Predictive Modeling Process in SAS

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

Model Validation Techniques

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

How to Get More Value from Your Survey Data

A fast, powerful data mining workbench designed for small to midsize organizations

STA 4273H: Statistical Machine Learning

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Chapter 12 Discovering New Knowledge Data Mining

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining for Fun and Profit

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

IBM SPSS Direct Marketing 23

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Marketing Mix Modelling and Big Data P. M Cain

Application of Predictive Analytics to Higher Degree Research Course Completion Times

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

A Comparison of Variable Selection Techniques for Credit Scoring

SAS Certificate Applied Statistics and SAS Programming

On the effect of data set size on bias and variance in classification learning

Transcription:

THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether it be sifting profitable from unprofitable, detecting fraudulent cases, identifying repeat buyers; profiling high-value customers likely to attrite, or flagging high credit-risk applications. This paper proposes a new technique for solving these types of classification problems by hybridizing two popular classification tools, decision trees and logistic regression. CAT and LOGIT, datamining tools by Salford Systems, are used to demonstrate how to implement this new approach both from a theoretical and practical perspective. Strengths and Weaknesses of CAT CAT (Classification and egression Trees) is a state-of-the-art decision-tree tool that can investigate any classification task and provide a robust, accurate predictive model. This methodology is characterized by its ability to automate the modeling process, communicate via pictures and handle complex data structure. The core features of Salford Systems CAT include: automatic separation of relevant from irrelevant predictors automatic interaction detection impervious to outliers unaffected by missing values (using surrogates) invariant to variable transformations (e.g., log, square root, etc). CAT and decision trees in general, however, are notoriously weak at capturing strong linear structure While CAT recognizes the structure, it can not effectively represent it, producing very large trees in an attempt to represent very simple relationships. A second weakness of tree-base d tools is that they sometimes produce very coursegrained response images; that is, the tree may contain only a small number of terminal nodes. Given that the classification outcome or score is shared by all cases in a terminal node, a 12-node tree for example can only predict 12 different probabilities. For some problems, this may be considered a classification success (e.g., identifying a large set of cases as responders ) however for other types of problems it can be problematic. A final weakness, again only problematic for particular problem types, is that decision trees generate discontinuous responses; thus, a small change in x could lead to a large change in y. Strengths and Weaknesses of Logit Logistic regression (or Logit) is a traditional methodology that relies on classical Page 1 of 6

statistical principles and, like CAT, demonstrates remarkable accuracy in a broad range of contexts. In the STATLOG project, for example, variations on logisticdiscriminant analysis consistently ranked among the top performers, scoring highest in 5 of 21 problems. Logistic regression models effectively capture global linear structure in data and, given that non-linear structure can be reasonably approximated with linear structure, even incorrectly specified models can perform well. This technique also provides a smooth continuous predicted probability of class membership so that a small change in the predictor variable yields a small change in the predicted probability. However, while Logit excels at handling linear and smooth curvilinear data structures, it often requires experts to hand craft the models and the results can be challenging to interpret, often understood only via simulation. In addition, while Logit permits some flexibility with transformations, polynomials and interactions, it is incumbent upon the analyst to correctly identify the best variable representations. Juxtaposing the strengths and weaknesses of CAT and Logit, a natural question arises: given they excel at different tasks, can we capitalize on their strengths by combining them? CAT LOGIT Automatic analysis equires experts Uses surrogates for missing values Deletes records or imputes missing values Unaffected by outliers Sensitive to outliers Discontinuous response small change in x could lead to large change in y Continuous smooth response small change in x leads to small change in y Course-grained response images finite number of probabilities Unique predicted probability for every record High speed Low speed to infeasible if too many inputs Early Attempts to Hybridize The first attempt to capitalize on the different strengths of the two techniques involved running logistic regression models in terminal nodes of deliberately shallow decision trees. CAT Tree Page 2 of 6

LOGIT LOGIT LOGIT LOGIT An examination of how CAT works explains why these first attempts were unsuccessful. As noted above, CAT excels in the detection of local data structure. Once a data set is partitioned into two subsets at the root node, each half of the tree is then separately analyzed. As the partitioning continues, the analysis is always restricted to the node in focus. The discovery of patterns becomes progressively more localized and the fit at one node is never adjusted to take into account the fit at another. In this manner, CAT reaches its goal: to split the data into homogeneous subsets. The farther down the tree, the less the variability in the target (dependent) variable. CAT splits send cases with x c to the left and x > c to the right; thus, the variance in predictor (independent) variables is also drastically reduced. For example, if x is normally distributed and the splitting cut point is at the mean of x, the variance in the two child nodes is reduced by about 64%. For subsequent mean splits, the variance reduction will always be greater than 50%. This reduction in predictor variable variance will also apply to correlated predictors. By the time CAT has declared a node as terminal, the information remaining in the node is insufficient to support further statistical analysis. The sample size in the terminal nodes is drastically reduced as is the target and predictor variable variance. Thus, in a well-developed CAT tree, no parametric model should be supportable within terminal nodes. Estimating Logits or other parametric models earlier in the tree for example, after just a few splits, has the same drawback as terminal node models but the results are less extreme. At best, this latter approach provides a mechanism for identifying switching regression albeit this is not very successful in practice. Proposed New Hybrid Approach The key to running a successful hybrid is to run the Logit in the root node thereby taking advantage of Logit s strength in detecting global structure and to include CAT terminal node dummies in the Logit model. This new hybrid approach is implemented as follows: 1. use CAT to assign every case to a terminal node with CAT surrogates, assignment is possible in every case even those with some or all missing values 2. create a new categorical variable that is equal to the terminal node Page 3 of 6

assignment the new categorical will have as many levels as terminal nodes 3. feed the categorical dummy variable into the Logit model (LOGIT will automatically expand x-level categorical into x dummy variables) this first-run Logit model is then used as a baseline model (more on this below) 4. add main-effects variables to the baseline CAT-Logit model added variables constitute the hybrid component and can be tested as a group via a log likelihood ratio test CAT Tree Variable Dummy Dummy Dummy Dummy Variable Variable Variable Logit un on Entire Dataset The Logit formulas for the CAT and hyrbid models can be represented as follows: CAT only: y = 0 + 1NODE 1 + 2NODE 2 + KNODE K where NODE I is a dummy variable for i th CAT node, and CAT-Logit Hybrid: y = 0 + 1NODE + 1 X 1 + + 2 1 X + 2NODE 2 + + 3 X 3 2 + + NODE + X + 0 + iqi + iz j i = 1 = + 1 = CAT Node Dummies + Hybrid Covariates Page 4 of 6

The Logit model fit to the CAT terminal node dummies converts the dummies into estimated probabilities; otherwise, it is an exact representation of the CAT model. Each dummy represents the rules and interaction structure discovered by CAT, albeit buried in a black box. The likelihood score on this model can then be used as a baseline score for further testing and model assessment. Note also this simple hybrid model is an excellent way to incorporate sampling weights and recalibrate a CAT tree. The addition of main effect variables to the baseline model then allows the now expanded model to capture effects common across all nodes (i.e., global structure). Because all strong effects have already been detected in the initial CAT run, the effects detected across the terminal nodes are likely to be weak; nevertheless, a collection of weak effects can be very significant. A good starting point for expanding on the LOGIT component of the hybrid model is to: add variables already selected as important by CAT add competitor variables in the root node that never actually appeared as splitters or surrogates in the CAT tree add variables known to be important from other studies A stepwise selection procedure can be used to pare down the variable list and then the pared-down list of main-effect variables tested as a group via a likelihood ratio test. In sum, by looking across nodes, Logit finds effects that CAT cannot detect. Because these effects are not very strong, they are not detected by CAT and not used as primary node splitters. Once the sample is split by CAT, these effects become progressively more difficult to detect as the subsamples become increasingly more homogeneous in the child nodes. While these effects may not be the strong individually, collectively they can add enormous predictive power to the model. Finessing the Hybrid Model Other considerations that must be addressed in building a CAT-Logit hybrid model are missing values, variable transformations and interactions. The simplest approach for handling missing values of course is to simply ignore the problem by dropping all records with missing values on model variables. Alternatively, CAT-predicted probabilities can be assigned to those cases with missing values while hybrid-predicted probabilities assigned to all other cases. More complicated approaches include missing value imputation and adding missing-value dummy indicators to the model plus nesting for non-missing. Given that CAT tree give good results, these more complicated approaches are usually not required. Page 5 of 6

Variable transforms, such as logs and square roots, will also need to be considered. Given interactions will already be captured in the CAT terminal node dummies, the only interaction terms worth considering are those that capture node-specific effects (i.e., terminal node interactions with selected variables) and interactions with missing value indicators. Salford Systems MAS, which automates the process of identifying optimal variable transformations and interactions, can be effectively utilized as this stage. Assessing Node-Specific Logit Fit CAT segments the data into very different subsamples so why expect that a single common Logit is valid? First, the terminal node dummies capture all of the complex interactions and non-commonality of the hyper-segments. Second, a simple test can be carried out to check if the Logit developed in each node resulted in an improvement over the CAT score. To test whether the hybrid model shows a lack of fit in any node or subset of nodes, perform a simple likelihood test node by node. If the CAT likelihood is greater than the hybrid model likelihood, do not apply the hybrid model to that particular node. eal World esults In our experience, models have often been improved dramatically by the hybrid CAT-Logit technique. In the following direct mail and financial market applications, the hybrid out performed both CAT and LOGIT alone: Direct mail applications: response model for catalog response model for credit card offer response model for an insurance product Financial applications: mortgage model loan delinquency model fraud detection model While these real-world examples are valuable case studies, they constitute a small sample and the results can not be shared due to confidentiality. Because of this, extensive experiments on artificial data sets were conducted. This analysis permitted a more accurate assessment of the possible benefit and flaws of the hybrid methodology. Monte Carlo Test esults For the Monte Carlo hybrid model assessment, samples of various sizes (2,000 to 100,000 records) were randomly drawn (from?). Each experiment was run 100 different times by resetting the random seed. The resulting models were assessed on Page 6 of 6

the basis of fit and also performance (e.g., profit yielded if model guides policy). Training and hold-out samples were used to assess possible over-fitting of the data. A description of the specific Monte Carlo experiments is summarized below: simple Logit: one variable CAT tree: one variable, highly non-linear hybrid process Logit: several variables (possibly missing) hybrid: several variables (possibly missing) highly non-linear smooth function (not Logit) complex Logit with informative missingness The Monte Carlo results indicated that in smaller samples (n=2,000), LOGIT performed very well even when it was not the true model. In larger samples (n=20,000), the hybrid model dominated on out-of-sample performance measures. And, in the larger samples, both CAT and the hybrid model manage problems with missing values whereas the Logit model collaspes. Finally, in larger samples with high frequencies of missings, the hybrid outperforms other models regardless of which model is true. eferences Breiman, L., J. Friedman,. Olshen and C. Stone (1994), Classification and egression Trees, Pacific Grove: Wadsworth. Friedman, J. H. (1991), Multivariate Adaptive egression Splines (with discussion), Annals of Statistics, 19, 1-141 (March). Michie, D., D. J. Spiegelhalter, and C. C. Taylor, eds (1994), Machine Learning, Neural and Statistical Classification, London: Ellis Horwood Ltd. Steinberg, D. and P. Colla (1995) CAT: Tree-Structured Non-Parametric Data Analysis, San Diego, CA: Salford Systems. CAT is a registered trademark of California Statistical Software and licensed exclusively to Salford Systems. All other trademarks mentioned are the property of their respective owners. Page 7 of 6