Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009

Similar documents

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Data Mining Opportunities in Health Insurance

Classification of Bad Accounts in Credit Card Industry

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Identifying SPAM with Predictive Models

Data Mining - Evaluation of Classifiers

A Property & Casualty Insurance Predictive Modeling Process in SAS

Churn Modeling for Mobile Telecommunications:

The Predictive Data Mining Revolution in Scorecards:

CART 6.0 Feature Matrix

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Risk pricing for Australian Motor Insurance

Chapter 6. The stacking ensemble approach

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Gerry Hobbs, Department of Statistics, West Virginia University

Multiple Linear Regression in Data Mining

Using Adaptive Random Trees (ART) for optimal scorecard segmentation

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Benchmarking of different classes of models used for credit scoring

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Data Mining Practical Machine Learning Tools and Techniques

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 23

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

The Data Mining Process

Modeling Lifetime Value in the Insurance Industry

Azure Machine Learning, SQL Data Mining and R

WHITEPAPER. How to Credit Score with Predictive Analytics

How To Build A Predictive Model In Insurance

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Predictive Modeling Techniques in Insurance

Making Sense of the Mayhem: Machine Learning and March Madness

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Better credit models benefit us all

Cross Validation. Dr. Thomas Jensen Expedia.com

Penalized regression: Introduction

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

The Artificial Prediction Market

Data Mining Algorithms Part 1. Dejan Sarka

Why Ensembles Win Data Mining Competitions

Beating the NCAA Football Point Spread

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Product recommendations and promotions (couponing and discounts) Cross-sell and Upsell strategies

How To Make A Credit Risk Model For A Bank Account

Why do statisticians "hate" us?

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Statistics in Retail Finance. Chapter 6: Behavioural models

Applying Data Science to Sales Pipelines for Fun and Profit

Chapter 7: Data Mining

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Mining Methods: Applications for Institutional Research

IBM SPSS Direct Marketing 19

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Evaluation & Validation: Credibility: Evaluating what has been learned

Supervised Learning (Big Data Analytics)

Easily Identify Your Best Customers

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Paper AA Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Data Mining Techniques Chapter 6: Decision Trees

Social Media Mining. Data Mining Essentials

Leveraging Ensemble Models in SAS Enterprise Miner

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Summary. WHITE PAPER Using Segmented Models for Better Decisions

D-optimal plans in observational studies

Text mining for insurance claim cost prediction

Compliance. Technology. Process. Using Automated Decisioning and Business Rules to Improve Real-time Risk Management

Prediction of Stock Performance Using Analytical Techniques

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Lecture 10: Regression Trees

A Better Statistical Method for A/B Testing in Marketing Campaigns

A Property and Casualty Insurance Predictive Modeling Process in SAS

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

REPORT DOCUMENTATION PAGE

Efficiency in Software Development Projects

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

11. Analysis of Case-control Studies Logistic Regression

A Decision Theoretic Approach to Targeted Advertising

WHITEPAPER. Creating and Deploying Predictive Strategies that Drive Customer Value in Marketing, Sales and Risk

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Fraud Detection for Online Retail using Random Forests

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

2015 Workshops for Professors

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

SOA 2013 Life & Annuity Symposium May 6-7, Session 30 PD, Predictive Modeling Applications for Life and Annuity Pricing and Underwriting

Course Syllabus. Purposes of Course:

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

GLM, insurance pricing & big data: paying attention to convergence issues.

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Transcription:

Data Mining Approaches to Modeling Insurance Risk Dan Steinberg, Mikhail Golovnya, Scott Cardell Salford Systems 2009

Overview of Topics Covered Examples in the Insurance Industry Predicting at the outset of a claim the likelihood of the claim becoming serious (CART example) Developing a model of total projected customer value for a health insurer Premium increase optimization Combination with GLM Other examples TreeNet tutorial

Predicting at the outset of a claim the likelihood of the claim becoming serious Consulting Project in Workers Compensation Insurance (with Pricewaterhouse Coopers Australia). Used Salford Systems CART. In worker s compensation insurance, serious claims comprise a small proportion of all claims by number but the great majority of the incurred cost. In our case study, 14% of reported claims were classified as serious. Those claims made up around 90% of total claim cost. Not obvious, in most cases, which claims will become serious as there are many factors contributing to result.

Predicting at the outset of a claim the likelihood of the claim becoming serious (continuation) Results: Data Mining (CART) identified 19 variables as most important predictor variables from 83 variables, including categorical variables with 100 s of categories. Classified all claims as likely to become serious or not likely to become serious. 31% of claims classified as serious Of these, 10% turn out to be serious 21% turn out to be false positives 69% of claims classified as non-serious Of these, 4% subsequently prove to be serious Interesting: Expected Result: Injury details were important in predicting serious claims Unexpected Result: language skills of the claimant were important in predicting serious claims Copyright Salford Systems 2009

Predicting at the outset of a claim the likelihood of the claim becoming serious (continuation) Gains Chart: 40% lift First 20% of the population, 60% of serious claims.

Developing a model of total projected customer value for a health insurer Consulting Project in Health Insurance (with Pricewaterhouse Coopers Australia). Used Salford Systems CART and MARS. Lifetime customer value is the discounted present value of income less expenses associated with a customer

Developing a model of total projected customer value for a health insurer (continuation) Result: Predicted the high cost claimants with good degree of accuracy. 15% of members predicted as having the highest cost by the data mining model, yielded 56% of the total actual cost. 30% of the members predicted as having the highest cost by the data mining model, yielded 80% of the total actual cost. Much of the model is accounted for by the dependence of cost on age. The model improves the gains chart by around 10% of total cost in the high cost part of the population over a simple model depending on age. Details of the resulting model are commercially sensitive. Results showed that many other factors, in addition to age, contribute to the predicted cost.

Developing a model of total projected customer value for a health insurer (continuation) How CART was used: Exploratory analysis: preliminary tree model suggested segmenting the customer base into 4 broad groups according to age and previous claims experience Separate CART model created for each of the 4 broad segments discovered in Exploratory Analysis phase: Sometimes risk drivers were similar but in many cases there were significant differences How MARS was used: Included CART decision tree segment for each record in the form of a categorical variable as one of the input variables. Also included as inputs: variables selected by CART as important predictors. MARS ranked variables in order of their importance. The predictor variable representing the CART decision tree output was ranked as most important Other, mostly continuous variables were also ranked as important which showed that the MARS model was finding minor linear effects that were not picked up by the decision tree. Copyright Salford Systems 2009

Premium Increase Optimization, TreeNet/GLM example Consulting Project in Home Insurance (with Taylor Fry Consulting Actuaries) Used Salford Systems TreeNet and GLM Although Home Insurance example, also applicable to Health Insurance Problem: Average renewal rates are around 87.5% Current renewal price is risk cost + expenses + target profit Constraint: +/- $50 from current price Details camouflaged at customer request Goal: pricing insurance products according to what customer is prepared to pay What is best that can be achieved? Insurance profit vs renewal rate

Premium Increase Optimization, TreeNet/GLM example (continuation) Approach Taken First Modeling Step: Used TreeNet: Quickly provided a benchmark model Variable selection Feature identification 2 nd Step: Built GLM Model 3 rd Step: Assessed Performance of GLM model versus Benchmark model Without data mining techniques, model building would take far longer. TreeNet model gives performance that is usually close to (sometimes better than) the final model adopted for demand modeling.

Premium Increase Optimization, TreeNet/GLM example (continuation) Results: Knowledge of drivers of renewal delivers quick wins Segment into high/low renewal Analyze drivers and levers Manage low expected renewal rate segments throughout the renewal process Identified customers less sensitive to price increases

Other Examples Contact Salford Systems: info@salford-systems.com Actuarial Case Study Presentations Available Combining Linear and Non-Linear Modeling Techniques: Case Study Example from the Insurance Industry Insurance Fraud Detection Use of Data Mining (CART) and Rule-based Methods to Infer Claims Payment Policy from the Analysis of Paid Claims Text Mining Challenges and TreeNet: Impact of Textual Information in Claims Cost Prediction Statistical Case Estimation Modelling - An Overview of the NSW WorkCover Model CART Plus Linear Models for Actuarial Applications Insurance Premium Increase Optimization: Case Study And more.

Other Examples Contact Salford Systems: info@salford-systems.com Marketing Case Study Presentations Available: Marketing Strategies for Retail Customers Based on Predictive Behavior Models Modeling Effectiveness of Marketing Campaigns How to Exploit the Homogeneity of Data or the Homogeneity of Past Experience in High Volume Direct Marketing Programs Predicting Customer Behavior Trends Over Space and Time CART Analysis to Augment Predefined Market Zones & Boost Response Rates for Direct Mail Campaigns High-Level Marketing Strategy: Deciding Whether to Customize Product Offerings and Promotion and Whether to Reward Best Customers with Perks Application of CART in Determining Reserve Levels for Customer Loyalty Points Achieving Better Insights Into Customer Behavior Through Integrating Market Research with Decision Tree Behavior Modeling

Interaction Detection with TreeNet Methodology Example

The challenge of interaction detection Classical statistical modeling is focused primarily on the development of linear additive modeling Models are of the form Y = A + B1X1 + B2X2 + +BkXk The predictors or attributes Xi are either raw data columns or data after repairs are made such as missing value imputations and capping or elimination of extreme values Non-linearity is introduced into the models in limited ways such as via the log transform of Y (for continuous Y>0) and a collection of well known transforms of the Xi Credit risk scorecard technology introduces transforms derived from binning continuous predictors

Classical Statistical Model Performance Classical statistical models represent a huge fraction of real world models deployed in enterprises world wide Such models are popular in part because statisticians are well trained to develop them The models are also popular because they tend to give good performance no matter how such performance is measured (e.g. R-squared, Area Under the ROC curve, top decile lift, etc) Such models are so restrictive they must over-simplify but they benefit from their stability (low variance)

Bias/Variance Trade-Off The universe of possible models is huge whereas the subset of linear or linear models is quite small Forcing a model to conform to a specific mathematical form is likely to introduce distortions We are therefore willing to say that the classical models are almost certainly biased As data become more plentiful the models almost certainly converge to an incorrect representation of the data generating process In contrast, modern learning machines such as TreeNet and CART converge to a correct unbiased representation Classical models have the advantage of lower variance which might translate into an overall better model in smaller samples

Classical Model Shortcomings: Sources of Bias The classical statistical model is expected to fall short in two areas: Incorrect representation of nonlinearities in individual predictors Absence of relevant interactions Interactions will be the key to identifying important data segments which behave differently from the dominant patterns Specifying a model correctly is the most challenging task facing the modeler Until now there has been no reliable way for a modeler to determine whether they have found a correct specification

Classical Interaction Detection A popular method of statistical interaction detection is to begin by first building the best possible additive model. Then, interactions terms, which look like products of main effects, e.g. Xi*Xj are added to the model and tested for significance One problem with this method is that the number of possible interactions grows more rapidly than square of the number of predictors All the usual challenges of model construction apply: the XiXj may not become visible unless one also includes XmXn There may well be higher order interactions XiXjXk Interactions may only exist and may only be detectable in subregions of the predictors (Xi Xi>c)*(Xj Xj<d)

What is Needed The classical approach of building the best possible additive model and then testing for interactions is largely impractical and largely infeasible What is needed is a methodology that can automatically discover the right transformations of each predictor and introduce the right interactions This methodology must be far more flexible than classical modeling and yet be resistant to overfitting It must be able to tell us whether or not interactions are present and to identify precisely which interactions are present

TreeNet and Model Development TreeNet is Jerome Friedman s stochastic gradient boosting first described in 1999. Starting from the base of MART (Multiple Additive Regression Trees) TreeNet has evolved over the past decade into a powerful modeling machine. TreeNet can fit a variety of models, including regression and logistic regression. The models are non-parametric and based on hundreds if not thousands of small regression trees. Each tree is designed to learn very little from the data and thus many are needed to complete the learning process Models are in the form of error correcting updates

Some interesting TreeNet Features Trees are grown on only a random subset of the training data, typically a random half We never train on all the training data at one time Model updates are small. We do not allow the model to change more than a little in any training cycle The equivalent of outlying data (points that are badly mispredicted) are eventually ignored in training cycles We do not allow anomalies to have much influence on the model

A TreeNet Model For the binary dependent variable it can be shown that the TreeNet model is a non-parametric logistic regression For the continuous dependent variable the TreeNet model is a nonparametric regression fit to maximize (or minimize) one of the following objective functions: Least Squares Residuals Least Absolute Deviations Huber-M hybrid of LS and LAD (LS for small residuals, LAD for large)

Some Key TreeNet Features Automatic variable selection. TreeNet is a superb ranking machine. TreeNet tends to include more predictors than other learning machines TreeNet tends to give effective ranking of predictors allowing the modeler to select small subsets of reliable predictors TreeNet contains built-in step-wise variable selection (backward, forward, or backward/forward) to automatically search for a best model Start with all variables in model Using the variable importance ranking remove the R least important predictors (typically R=1) Repeat until all predictors have been removed Identify best model based on preferred performance metric

Forward Stepping Starting with best model test every predictor in a specified subset to see which is best to add Repeat forward stepping for F steps Identify best new model which is a final model candidate

Testing for Interactions When the TreeNet model is built with only 2-node trees the model is essentially limited to an additive model With a single split in the tree the tree cannot capture any interactions A 2-node tree TreeNet is thus an additive model. Observe that an additive model can be highly nonlinear Model is of the form Y = A + F1(X1) + F2(X2) +..Fk(Xk) Where the Fi are the nonlinear functions discovered by TreeNet Note that there may be many Fi associated with a specific predictor Xj ( cumulate these into G(Xi) = Σ Fq(Xi) ) If the TreeNet contains only 2-node trees we can collect all trees associated with a given Xi to arrive at the final Y = A + G1(X1) + G2(X2) + + Gk(Xk) Copyright Salford Systems 2009

Building 2-Node TreeNets Important to keep in mind that when only one split is allowed in a tree the amount of learning that can take place is severely limited. 2-node tree TreeNets may require a very large number of trees to extract all the information in the data Some examples we will show use 20,000 trees Unless the TreeNets are fully expanded it will not be possible to measure the true predictive power of the additive model

General TreeNet Models TreeNet models are permitted to contain trees of any size We can grow trees with 2,3,4,5,6,12,50 etc nodes We generally favor small trees because we want to limit the amount of learning in any training cycle We therefore tend to keep trees to sizes like 6,9,12. However experimentation is always recommended and we have encountered data where larger trees perform better

The Default 6-Node Tree Friedman recommended a default setting of 6-nodes and our experiments confirm that this size of tree performs well across broad range of data sets A 6-node tree clearly permits interactions. A tree with 6 terminal nodes contains 5 internal splits and if it is as close to balanced as possible like the tree below then it can contain up to 3 different variables along its longest branch If the tree is maximally unbalanced then it can contain up to 5 different variables along the longest branch There is no guarantee that different variables will be used as we progress down a tree. The same variable might be used several times. In general we observe that the 6 node tree should be adequate to uncover 3-way interactions

Global Intercation Test Compare Unrestricted TreeNet model allowing moderate sized trees Restricted TreeNet model confined to just 2-node trees Must take into account the total learning in each model (number of trees * number of nodes per tree) Otherwise the 2-node tree model will be at an automatic disadvantage Simply allow each model to reach convergence

A Global Interaction Test Consider a TreeNet model built with 2-nodes and compare this model with a 6-node TreeNet model Recall that the TreeNets must have been allowed to grow out fully to locate the optimal number of trees for prediction By comparing the restricted 2-node tree model with the 6- node tree model we can conduct a definitive test for the presence of interactions of moderate degree: If the larger tree (unrestricted) model sufficiently outperforms of the (restricted) 2-node model then we have compelling evidence that interactions are present in the data generation process At present we do not have a definitive statistic for testing this hypothesis but classical statistical tests can be developed when comparing predictions of constrained and unconstrained models on holdout data

Which Interactions Are Present Our global test is sufficient to establish the existence of interactions but not sufficient to identify which specific interactions are present All we can conclude from a positive result of our test is that interactions exist and from a negative result that interactions do not exist However this is a major step forward as this test is fully automatic We have found evidence that many mainstream consumer risk models are adequately modeled with additive TNs The next step for us is therefore to try to identify specific interactions

Interaction Detection In TreeNet Interaction in TreeNet has progressed along two fronts. Friedman suggested one strategy in his paper. At Salford Systems, Cardell, Golovnya, and Steinberg suggested a slightly different method.

Interaction Measurement From the TreeNet model extract the function (or smooth) Y(Xi, Xj Z) (1) which is based on averaging the Y associated with all observed Xi, Xj pairs over all observed Z Now repeat the process for the single dependencies Y(Xi Xj, Z) and Y(Xj Xi, Z) (2) Compare the predictions derived from (1) and (2) across an appropriate region of the (Xi, Xj) space

A Simple Example: Financial Market Behavior Top 15% of values Bottom 85% of values Dependent Variable: Continuous with a long right hand tail

Summary Stats: Target and 9 Predictors TARGET variable BPS is truly continuous; others are ideal for a tree model We randomly set aside 20% of the data for test

Baseline CART Model: Test Data MSE (Random 20%) 90 node tree SE 0 265.48 22 node tree SE 1 334.64 R2 better than.97 on test data Observe that a CART model allows for high order interactions

Naïve Regression: R2 is only.74 on train data R2 on test data is.68

MARS Models results Main Effects (nonlinear) 793.87 2-way interactions 550.51 3-way interactions 358.05 4-way interactions 353.66 MARS main effects and interactions determined by 4-way model

TreeNet Results: Test MSE TreeNet 6-node unconstrained (3303 trees) 105.51 TreeNet 2-node tree (20,000 trees) 513.83 Although the starting sum of squares in the data is large it seems intuitive that the difference between the two models is substantial The 2-node TreeNet is similar in performance to the MARS model with 2-way interactions even though it prevents interactions The 2-node TreeNet underperforms a single CART tree

TreeNet Interactions Ranking Report Based on the 6-node tree TreeNet we calculate the degree of interaction observed for the most important variables This report reveals that e.g. AVG_VOL is involved in important interactions

Detailed Interaction Reports Measured interactions based on the comparison of 2D and 3D relationships between target and predictors Allows us to hypothesize which interactions are likely to matter

Nine predictors permit 36 2-way Top rated 2-way interactions (6 in all) interactions AVG_VOL * AVG_TREND * SPREAD_Q SPREAD_Q * AVG_TREND * AVG_SIZE AVG_TREND * AVG_SIZE AVG_SIZE * MKT_CAP

Testing Interactions Salford Systems has developed an Interaction Control Language (ICL) for TreeNet models The language allows the modeler to specify precisely the types of interactions which will be permitted in the TreeNet ICL allow X1 X2 X3 X4 / 2 Specifies that only 2-way interactions are allowed among the collection of predictors listed (X1-X4) The ICL language was developed in-house for private clients in 2006 and has been the basis of all of our interaction detection work

The ICL Language The ICL allows a broad range range of controls such as: ICL ADDITIVE x1 x2 x3 ICL ALLOW x5 x6 x7 x8 x9 / 3 ICL DISALLOW x9 x11 x5 x7 x21 x25 / 4 The ADDITIVE keyword prevents any predictor from interacting with any other variable in the model. Practically this means that should such a predictor be selected to split the root node of the tree than it can be the only predictor anywhere in that tree TreeNets restricted to ADDITIVE predictors can still contain trees with as many terminal nodes as the modeler prefers to work with. But each tree will be grown using a single predictor

ADDITIVE vs 2-node Trees 2-node trees are often thought to guarantee ADDITIVE models but this is not strictly true If the training data are complete (no missing values present) then indeed a 2-node tree TreeNet yields an additive (in the predictors) model However, if missing values are present then TreeNet requires the use of missing value indicator predictors of the form If Xi == MISSING then go LEFT; Else if CONDITION then go RIGHT; The problem is that TreeNet permits the condition in the ELSE clause to involve a variable other than Xi

2-node trees in TreeNet Although the user may request 2-node trees in a TreeNet model if missing values are present for all predictors then the smallest possible tree that can be grown contains at least 3-nodes. Here is an example of such a split

Two-Node Tree with Missings One variable in tree Is Xi missing? Yes No Terminal Node Is Xi <= c Terminal Node Terminal Node Both internal nodes are split using the same variable Xi

Two-Node Tree with Missings Two variables in tree Is Xi missing? Yes No Terminal Node Is Xk <= c Terminal Node Terminal Node Xi (root node split) and Xk (internal split) are the two variables in the tree

2 node tree details A 2-node tree in TreeNet can never contain more than one split on a standard predictor. Therefore if the predictor is never missing the tree using this variable will have only two nodes However, the TreeNet mechanism does not count a split on a missing value indicator as a genuine split As a result 2-node tree may contain any number of missing value indicator splits. The following tree is technically a 2- node tree by TreeNet standards.

Allowable 2-node tree Missing? Terminal Node Missing Terminal Node Split Terminal Node Terminal Node

2-node trees and Interactions The important point in this discussion is that in standard TreeNet we cannot guarantee the complete absence of interactions with 2-node trees The 2-node tree will allow interactions of any degree between missing value indicators and also a general interaction between a single predictor and missing value indicators To enforce literal non-interactivity we must rely on the ICL mechanism. If we specify that Xi is to enter the TreeNet as ADDITIVE then if we split the root using the MVI for Xi any subsequent split can use only Xi

Test MSE Performance: Restricted Models 2-node trees (20,000 trees) 513.83 As expected the interaction report shows 0.00 for all interaction scores. The data contain no missing values so literal 2-node trees are generated for this model We require 20,000 trees because 2-node trees can learn only little in any training cycle

Test MSE Performance: Various Interactions Allowed 2-node trees (20,000 trees) 513.83 Allow 2-way interactions 113.91 Allow 3-way interactions 105.01 Allow 4-way interactions 105.29 Allow 5-way interactions 105.51 Allow 6-way interactions 105.51 Allow 7-way interactions 105.51 Unconstrained TreeNet 105.51 It is plain that there is a huge difference between an additive and an interaction model

Refining the Model Our choices include accepting a 2-way interaction model and searching for an explicit set of interactions to work with AVG_DLY_VOL, AVG_TRND, AVG_SIZE / 2 139.55 MKT_CAP, AVG_DLY_VOL, AVG_TRND, AVG_SIZE, SPREAD_Q /2 125.20 Here we see that just 3 two-way interactions can capture more than 90% of the difference between an additive and a fully unconstrained model

TreeNet with ICL TreeNet in the PRO EX version contains the ICL language and is available on request from Salford Systems. Contact: David Tolliver dst@salford-systems.com 619 543 8880 General information http://www.salford-systems.com

Salford Systems Developer of CART, MARS, TreeNet and RandomForests Advanced statistical software since 1983 PROC MLOGIT and PROC MPROBIT add-ins for SAS mainfames Technical Advisers and close collaborators Jerome H. Friedman, Stanford University ( CART, MARS, Treenet) Leo Breiman*, UC Berkeley (RandomForests) Richard Olshen, Stanford University (Survival CART, Bioinformatics) Charles Stone, UC Berkeley (CART, MARS large sample theory) Rob Tibshirani, Stanford (modern statistical methods) * Leo Breiman passed away in July 2005