Applied Data Mining Analysis: Guide Via Examples.

Similar documents

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

CART 6.0 Feature Matrix

Gerry Hobbs, Department of Statistics, West Virginia University

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Risk pricing for Australian Motor Insurance

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Identifying SPAM with Predictive Models

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 22

A Property and Casualty Insurance Predictive Modeling Process in SAS

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Data Mining Approaches to Modeling Insurance Risk. Dan Steinberg, Mikhail Golovnya, Scott Cardell. Salford Systems 2009

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Modeling Lifetime Value in the Insurance Industry

Decision Trees from large Databases: SLIQ

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Fast Analytics on Big Data with H20

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining Methods: Applications for Institutional Research

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

The Predictive Data Mining Revolution in Scorecards:

Data Mining Applications in Higher Education

Better credit models benefit us all

Data Mining. Nonlinear Classification

CALCULATIONS & STATISTICS

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Benchmarking of different classes of models used for credit scoring

Classification of Bad Accounts in Credit Card Industry

IBM SPSS Direct Marketing 19

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Predicting Flight Delays

Data Mining Opportunities in Health Insurance

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

IBM SPSS Direct Marketing

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Why Ensembles Win Data Mining Competitions

Using multiple models: Bagging, Boosting, Ensembles, Forests

How To Make A Credit Risk Model For A Bank Account

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

IBM SPSS Data Preparation 22

Data Mining Applications in Fund Raising

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

A Property & Casualty Insurance Predictive Modeling Process in SAS

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining: Overview. What is Data Mining?

Big Data Big Deal? Salford Systems

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Data Mining - Evaluation of Classifiers

Text Analytics using High Performance SAS Text Miner

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Why do statisticians "hate" us?

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Successful Mailings in The Raiser s Edge

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Easily Identify Your Best Customers

Machine Learning Big Data using Map Reduce

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Chapter 12 Bagging and Random Forests

How To Run Statistical Tests in Excel

Churn Modeling for Mobile Telecommunications:

Predicting borrowers chance of defaulting on credit loans

Additional sources Compilation of sources:

analytics stone Automated Analytics and Predictive Modeling A White Paper by Stone Analytics

Alex Vidras, David Tysinger. Merkle Inc.

Weight of Evidence Module

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses

Myth or Fact: The Diminishing Marginal Returns of Variable Creation in Data Mining Solutions

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Data Mining Practical Machine Learning Tools and Techniques

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

The Data Mining Process

Introduction to Quantitative Methods

Simple Predictive Analytics Curtis Seare

Leveraging Ensemble Models in SAS Enterprise Miner

IBM SPSS Direct Marketing 20

SPSS Explore procedure

Course Syllabus. Purposes of Course:

Microsoft Azure Machine learning Algorithms

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Knowledge Discovery and Data Mining

Classification and Regression by randomforest

Data Mining Techniques Chapter 6: Decision Trees

5. Multiple regression

Some Essential Statistics The Lure of Statistics

Model Validation Techniques

Multiple Linear Regression in Data Mining

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Model Combination. 24 Novembre 2009

Beating the MLB Moneyline

Server Load Prediction

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab Exercise #5 Analysis of Time of Death Data for Soldiers in Vietnam

Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Data Mining: An Overview. David Madigan

Transcription:

Applied Data Mining Analysis: Guide Via Examples Dan Steinberg, Milkail Golovnya, N Scott Cardell July 2013 Salford Systems http://www.salford-systems.com

Modern Analytics Interest and research in what we now think of data mining and machine learning goes back to at least 1960s Perceptron pre-neural Network introduced in 1957 IEEE Transactions on Pattern Analysis and Machine Intelligence, January 1979 (vol. 1 no. 1) ACM KDD Knowledge Discovery in Databases series of conferences began informally in 1989, formally in 1995 Field is now in a stage of extraordinary growth and brings together concepts and techniques from statistics and computer science Recent extension of topics to Big Data fueled by Google s MapReduce, Yahoo! Development of Hadoop, Amazon EC Easy, massively parallel data processing

KDD Conference 1995

IEEE Pattern Analysis 1979 Vol 1, No. 1 Copyright Salford Systems 2013

Data Mining Key concepts differentiating data mining from traditional statistics Very few assumptions about the data or about the models to be built Emphasis on learning as much as possible from the data Emphasis on a fair degree of automation and search Allowing for a far larger space of possible models Typically computer intensive methods (a simple MARS spline regression might fit the equivalent of 70,000 models) Some early definitions of data mining emphasized the volume of data being analyzed (data mining=lots of data) but can use these techniques with very few data records Essentially data mining (or machine learning) is defined by the tools we use to analyze the data

Challenges for Data Miners Conceptually same as for statistician Understand problem Acquire appropriate data Define unit of observation and what is being predicted or explained Select appropriate methodology (classification, regression, survival, clustering) and tools Construct useful predictors if not present in data (feature extraction) Some differences in next steps as Data Mining uses much more flexible and adaptive learning algorithms Predictor selection Choice of learning algorithm Avoid overfitting (model too flexible, memorizes train data) Avoid underfitting (model not flexible enough, too few features)

Bias/Variance TradeOff Underfitting Vs Overfitting Rigid models (eg linear regression) when inappropriate have high bias and relatively low variance New training samples tend to yield similar results Overly flexible models can reach the extreme of memorizing the learn data Low bias but high variance Data miner needs to be alert to signals of over- or underfitting and strike the right balance

Decision Trees A major advance in analytical technology in which many important concepts of modern analytics were first clearly spelled out Learning machine is actually quite simple conceptually (but the details in making a successful implementation are challenging) Abandons the worlds of hypothesis testing and estimation of parameters (as conventionally understood) Several early versions which did not function well (ID3, AID) were followed by CART which perfected the methodology Paper by Jerome H. Friedman in 1975

Tools Used In This Introduction Decision Tree (single CART tree) MARS Adaptive Regression Splines Gradient Boosting (TreeNet Boosted Trees) RandomForests (Ensembles of CART trees) Regularized Regression (GPS Generalized PathSeeker) These tools can get you very far and cover classification, regression, and unsupervised learning (clustering) Neural Networks have cycled in and out of favor (can require considerable experience to learn to use well)

CAR_CLAIM DataSet Insurance related data focused on FRAUD 15,420 records 923 records labeled FRAUD Should keep in mind that not all FRAUD is caught so there may be a few FRAUD cases lurking among the so-called good records Have a classification problem with a minority of the data in the class of interest Data published on the CD ROM included with Dorian Pyle s (1999) Data Preparation for Data Mining http://hfs1.duytan.edu.vn/upload/ebooks/3836.pdf Data is on separate CD ROM Some used copies appear to be available on Amazon Book is focused principally on the specifics of data preparation for Neural Networks which have their own unique requirements

Salford Predictive Modeler Predictive Modeling package used for the examples Core Statistics Linear Regression Logistic Regression CART Decision Tree (original, by Jerome Friedman) MARS Spline Regression (original, by Jerome Friedman) GPS regularized regression (extended elastic net Jerome Friedman) TreeNet gradient boosting machine ((original, by Jerome Friedman) RandomForests (original, Breiman and Cutler) Automation and model acceleration

Open Raw Data: CarClaim.CSV Basic peek at data set to obtain main table dimensions (rows and columns) Copyright Salford Systems 2013

Too many Character Variables We see that variables intended to capture numeric measures have been coded as and imported as text

Request Descriptive Statistics Icon on toolbar for statistics. Here we request just numeric variables

Basic Stats We always check for prevalence of missing data Always review number of distinct values (too few?, too many?

Detailed Stats and Tables Copyright Salford Systems 2013

Data Prep/Data Repair Not different than what any statistician would do in the earliest stages of data cleaning Remove inconsistent coding that varies across records Enforce consistent spelling of character values Checking for missing value codes Some coding uses NULL for a valid value (e.g. NO or 0) In our case many of the Character variables are intended to encode numeric information AGE of VEHICLE coded as text new 2 years 3 years 4 years 5 years Copyright Salford Systems 2013

Straighforward Recoding Here we use built-in BASIC language and command builder/assistant

Prepped Data Set has 17 Numeric Variables Previously only had 8 numeric variables Also requested generation of a SAMPLE$ variable (random partition of TEST data) Copyright Salford Systems 2013

CORR for NUMERIC Vars Copyright Salford Systems 2013

MDS Scaling of CORR Matrix Positions Variables Quick check for anything bizarre NDAYS_POL_ACCIDENT is at upper right far from other variables AGE and CAR_AGE are on the left

Build CART Model Select TARGET (Dependent variable) Avoid clearly inappropriate predictors (RECORDID), clones of TARGET Copyright Salford Systems 2013

Test (or Validation) Method Normally reserve some data for testing (typically called validation data)

CART Model Learn and Test results respectable and also very close to each other Smallest node has 45 records, reasonable size (we can control this if we want) FOCUS class is YES for FRAUD (Blue= Not Fraud, Red= Fraud) Copyright Salford Systems 2013

Quick Overview of Main Tree Logic: Variables Driving Model Copyright Salford Systems 2013

Root Node Split Very Effective Very low FRAUD rate among insureds with just Liability coverage Need to grow a Probability Tree to make progress on left Copyright Salford Systems 2013

Detailed Inspection of an Interesting Split Test data confirms relatively higher FRAUD risk for Chevy, Toyota, Accura But still low risk Copyright Salford Systems 2013

Train vs Test Lift By Node: 9 Nodes versus 4 Nodes Simpler tree generalizes better at the node level If we were to deploy the larger tree we might wish to remap node 4 to its parent (a form of shrinking)

Tweaking the Tree BATTERY: Experimental Parameter Variation 35 pre-packaged experiments we might consider running to tweak model

BATTERY ROOT: Limited Look Ahead Dictate which variable splits the ROOT node YEAR in the root appears to give better performance but not ideal for prediction Copyright Salford Systems 2013

BATTERY ONEOFF Build Trees on One Predictor Only Best predictors in isolation do make sense Allows for nonlinear relationship for continuous variables Copyright Salford Systems 2013

BATTERY BOOTSTRAP: Bootstrap Resample Traini Data/Test Data Fixed Assess performance of CART Tree Via Bootstrap Resampling (100 times)

Variable Importance Averaging Over Trees Unambiguous ranking of importance of predictors

BATTERY PARTITION: Repartition Data Into Learn/Test (Sample Sizes Fixed) Performance evaluation over 100 splits of the data into learn and test (all same size)

Logistic Regression: All Variables in Model 83 coefficients estimated due to dummy variable expansion of categoricals. Test ROC is.775

Regularized Logistic Regression via GPS: 100 replications on different test partitions Median TEST ROC=.7850 5 th pctile=.774 95 th pctile=.794 median coefs=19 Optimal models were LASSO or near-lasso Slightly better performance than conventional logistic but much smaller model Copyright Salford Systems 2013

Generalized PathSeeker Generalized Elastic Net Even a 3 coefficient model can each a test partition ROC of.764

LASSO Variable Importance Ranking Copyright Salford Systems 2013

YES/NO Odds Graph Nice monotonic pattern

RandomForests: BATTERY NPREDS Here BAGGER works just as well or better than true RF. ROC=.804 Copyright Salford Systems 2013

RandomForests Performance Curve ROC test partition=.804

YES/NO Odds Graph

RandomForests Variable Importance Measure Forest is a collection of trees (for post-processing reasons we recommend at least 500 trees) Normal scoring: drop a record down each tree and compute number of votes for each possible target outcome Instead of normal scoring do the following for each tree and each variable in the model: Randomly scramble the values of a specified variable in place Summary statistics for that variable are unchanged but the values of that variable are now located on the wrong row Scrambling is repeated anew for each tree Compute number of votes for each target outcome for each record Compute deterioration of overall sample performance Most important variable would be hurt most by this scrambling

RandomForests Variance Importance Ranking Based on variable scrambling to measure loss of accuracy Scrambling an important variable should hurt accuracy more

Summary Data Preparation was essential to make data fully usable Series of models developed rapidly using automated search tools Judgment assisted perfection of model Variety of well performing models

KDDCup 1998 http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html Page contains data, documentation, description of the challenge (all freely downloadable) KDD conferences began in 1995 and hosted first data mining competition in 1997 KDDCup 1998 used exactly the same data as for 1997 but provided fuller documentation Top performers in 1998 had already analyzed data in 1997 Results of the 1998 competition posted at http://www.kdnuggets.com/meetings/kdd98/kdd-cup-98-results.html

KDDCup 1998 Data Set and Challenge Raw data contains 481 variables Learn sample N: 95,412 Validation N: 96,367 TARGET_B=0 90,569 94.92% 91,494 94.945 TARGET_B=1 4,843 5.08% 4873 5.06% Mean IQR SD Mean IQR SD TARGET_D 15.62 10.00 12.45 15.61 10.00 15.51 Objective: Create net revenue maximizing mailing list if each mail piece costs $0.68 to send Optimal decision rule: mail if expected value of donation>0.68 defined as Prob(Respond)*E(gift Respond)

Modeling Strategies Require two models RESPONSE to mailing BINARY YES/NO GIFTAMOUNT if responded REGRESSION conditional on response Naïve models ignore sample selection process Model each part separately and combine for final scoring Two stage model: Weight records in regression by inverse probability of sample inclusion Upweight records representative of those excluded Model each historical campaign Every list member responded with a gift at least once Every mailing represents a new opportunity to respond Have gift amount at least once for everybody Start with naïve two separate models for simplicity

KDDCup98 Possible Outomes Perfect targeting would mail only to the 4,873 respondents Total gifts $76,090 Mailing Cost $ 3,314 Net Revenue $72,776 Mail all net revenue $10,560 Winner in 1998 $14,712 mailing 56,330 people 58.5% of list Using routine methods in SPM (train just on learn sample) net revenue of $15,596 mailing 58,525 people Should beat winner using SPM out of the box (train on learn)

Check Randomness of Data Partition Pool data into one file containing all data Create LABEL variable, 1 for learn sample and 0 otherwise Check if LABEL is predictable using any kind of model We just use all variables available Could of course run t-tests or single variable models looking for any problematic difference We found nothing of concern here But have seen major problems in other partitioned data set Something that was supposed to have been randomly determined was clearly not

Nature of Data-I Elementary Demographics AGE GENDER Number of children by age group (0-3, 4-7, 8-12, 13-18) STATE of Residence, ZIPCODE Homeowner/renter Household Income Salutation (Dr., Mr., Mrs., Admiral, etc) Census Tract Level Data 286 socioeconomic and demographic indicators covering ethnicity, occupation, industry of employment, type of housing

Nature of Data-II Behavioral Data on prior gifts and response patterns RFA style of data, details and coded into groups Recency of response -how recently from a given date Frequency of response -how frequently in previous 12-24 months Amount of gift when responded -dollar value of gift Campaign characteristics Offer type Calendars Stickers Christmas Cards Other types of cards, such as birthday, condolence, blank Notepad Thank You printed outside Date of mailing

Activity Window: Useful Launchpad to Next Actions RAW DATA: Select next action such as histograms, View Data, Summary Stats

Stats: Brief View Essential to review data for unexpected coding, quirks, special handling needed

Sort By %Missing Descending Might want to drop variables that are genuinely missing at some ultra-high rate But for variables with only 1 good level the missing usually means something -- Often presence/absence or for continuous measures might really signify a 0 Also useful to sort ascending to note vars which have no or vey few missings Copyright Salford Systems 2013

Diagnostic CART Run DEPTH=1 LIMIT command ATOM=50 MINCHILD=30 DEPTH=1 Missing Values Controls Dummies for continuous Extra Level for categorical Want to examine power in ROOT node (nonlinear) Should run very fast Using controls suggested results in very useful diagnostics Informative missingness, nonlinear relationship Copyright Salford Systems 2013

Partition Test=.2 Some exploratory work needs to cover all the learn data Here we allocate a 20% random subset for testing

Ranking ROOT Node Splitters Top splitters are mostly missing. Need to consider why and maybe drop variables RDATE is date of response and missing if donor did not respond to that mailing

Model Setup PENALTY Tab Second stage model Penalize for missingness; penalize for high cardinality categorical

Key Points to Remember Working with RAW DATA Specify automatic creation of missing value dummy vars Dummy now repairs all binary variables coded BLANK=0 For all other variables tests for informative missingness Use a modeling tool that can handle missings CART, MARS, TreeNet have built-in missing handling CART methods most sophisticated and most flexible as can score any future pattern of missings MARS and TreeNet only handle missing patterns seen in learn data RandomForests uses imputation for missing value handling GPS is a regression/logistic regression tool and will listwise delete

ROOT Node Splitter Rankings (penalties active) Missing is now a predictive value. RFA_2$ only has 14 levels and so penalty is inactive (NLEARN>2 K-1 where K=14) PEPSTRFL$ blank is predictive Best splitters have 0 missing and all look reasonable RDATE variables now appear only in the form: Missing or Not Missing (gave or not) Copyright Salford Systems 2013

CART Run on All Variables Test Partition ROC=.5822 Kitchen sink model with PENALTIES on missing and HLCs Performance for a single tree is very good and hard to beat

BATTERY BOOTSTRAP and PARTITION 30 resamples, test partition fixed (shown below) 30 reruns test partition varying Can we have confidence in our main CART results ROC=.5822? BATTERY BOOTSTRAP: Median ROC=.5761 3 rd rank=.5644 27 th rank=.5824 BATTERY PARTITION : Median ROC=.5911 3 rd rank=.5767 27 th rank=.6009 Copyright Salford Systems 2013

Data Prep-1 Here we report on the data prep we did to facilitate subsequent analysis Conversion of dates to days since January 1, 1960 to facilitate date arithmetic (aka SAS dates) Create REGION$ variable from STATE$ (South, West, Northeast, Midwest, Other) Conversion of ZIP to a number and extraction of ZIP1 (first digit) and ZIP3 (first 3 digits) Dummy (0/1) recoding of variables that use 1 vs BLANK coding Break various RFA codes into 3 separate variables (R version, F version, A version)

Exploding RFA L3F becomes three separate variables Copyright Salford Systems 2013

Data Prep -2 Extract separate Socioeconomic Status (SES) and Urbanicity from combined variable DOMAIN Calculate average, min, max solicited donations Check for unsolicited donations, count, stats of gift amounts Create trend variables in donation amount, frequency Create one principle component per group of Census vars Create TREND vars for Recency, Frequency, and Amount variables Create overall summaries of RFA dimensions limited to last 24 months

Census Variables Examples Neighborhood level variables POP901 Number of Persons ETH1 Percent White AGE901 Median Age of Population CHIL1 Percent Children Under Age 7 HHN1 Percent 1 Person Households MARR1 Percent Married DW1 Percent Single Unit Structure HV1 Median Home Value in hundreds HVP1 Percent Home Value >= $200,000 RP1 Percent Renters Paying >= $500 per Month IC1 Median Household Income in hundreds TPE1 Percent Driving to Work Alone Car/Truck/Van LFC1 Percent Adults in Labor Force OCC1 Percent Professional EIC1 Percent Employed in Agriculture EC1 Median Years of School Completed by Adults 25+ VC1 Percent Vietnam Veterans Age 16+ ANC1 Percent Dutch Ancestry LSC1 Percent English Only Speaking

Principle Components Created Census variables only Overall Group Specific for 17 subgroups of census variables Also, judgmental slection of a few solo variables PRIN1 PRIN5 over all census variables PRINxx1 group specific first principle component only

CART on Census Principle Component Vars Ranks: Ethnicity, Home Costs, Education, Income, Omnibus. ROC=.5288

TreeNet Variable Importance Rankings: ROC=.5520 Ranking: Income, Housing, Transportation Ancestry, Labor Force, Household, Ethnicity, Interests Compared to using all raw variables these are quite effective

Data Prep-3 Create a new data set which contains one record per mailing per donor Donor who was mailed 5 times would have 5 rows of data By definition each donor responded at least once to a campaign Can build donation models using donation data from every person in data set In original data format TARGET_D is available only for the subset of responders to the most recent campaign Alternative approach did not improve test sample performance and so was abandoned But modeling experiments on this data helped refine predictor list

TreeNet Prepped Data: ROC=.6256 Some selection of variables, dropping of redundant variables Very difficult to beat this model (3 rd decimal place ROC only) But can make model performance more stable by trimming predictor list

Prepped TreeNet Variable Importances Large predictor list

Wrapper Vs Filter Feature Selection Filters treat each predictor in isolation and are thus vulnerable to missing variables important only via interactions Wrapper method uses a model to select variables, typically repeatedly and recursively Build model with many variables and rank all by importance Remove some variables from bottom of list and repeat We call this variable shaving and often run process removing one variable at a time until only one left Recursive Feature Elimination Judgment to select smallest defensible model (tradeoff accuracy for substantial simplification)

BATTERY SHAVING: Wrapper Method for Variable Selection Recursive Feature Elimination phrase also seen in literature For rapid scan of data we removed 5 predictors at each step Shaving from the BOTTOM of the ranked list of predictors Shaving from the TOP can also be helpful and enlightening Sometimes removing most important predictors improves generalization error

Tabular Display of Backward Shaving Five variables eliminated in each step Eliminating more than one variable per back step is a rough and ready to make rapid progress Could drop many more when working with tens of thousands of predictors Final refinements better done dropping ONE variable per back step

SHAVE ERROR and LOVO Instead of dropping least important predictor could TEST which variable to drop by running a LOVO experiment Leave One Variable Out (LOVO) With 20 predictors Shave via Importance Ranking 20 models required Shave via LOVO requires (20*21)/2 210 models required SHAVE ERROR number of models is quadratic in K SHAVE ERROR is repeated LOVO We shave from bottom generally to reach a reduced set of predictors and use SHAVE ERROR for the final refinement

Best Performing Model TARGET_B: ROC=.6293 More stable than previous model using many more predictors Can cut back to 12 predictors and still maintain ROC=.6228

12 variable RESPONSE Model Copyright Salford Systems 2013

TreeNet Partial Dependency Plots Copyright Salford Systems 2013

Dependency Plots-2

Dependency Plots-3

KDD98 Objective: Maximum Net Revenue Objective is not simply to maximize response rate Want to select mailing list based on expected donation Cost of mailing is $0.68 so maximizing net revenue means mailing to those whose expected donation>$0.68 Have moderately good response model after much experimentation Now need donation model which will be a regression Simple approach is to construct Prob(response)* E(Gift Predictors, Response=1) Might want to factor in sample selection bias as regression on TARGET_D is fit to the 5.08% of the prospect list who actually responded

KDDCup 1998 Data Set and Challenge Raw data contains 481 variables Learn sample N: 95,412 Validation N: 96,367 TARGET_B=0 90,569 94.92% 91,494 94.945 TARGET_B=1 4,843 5.08% 4873 5.06% Mean IQR SD Mean IQR SD TARGET_D 15.62 10.00 12.45 15.61 10.00 15.51 Objective: Create net revenue maximizing mailing list if each mail piece costs $0.68 to send Optimal decision rule: mail if expected value of donation>0.68 defined as Prob(Respond)*E(gift Respond)

TARGET_D Amount of Gift to Campaign 97 Learn Data N=4843 100% Max 200 99% 50 97.5% 46 95% 32 90% 25 75% Q3 20 50% Median 13 25% Q1 10 10% 5 5% 5 2.5% 4 1% 3 0% Min 1 80% of all gifts between $5 and $25 Smallest gift $1.00 so if we know for sure someone will give we should mail 2 nd model required to perfect targeted mailing list Sample Size much smaller so simpler model probably required

Kitchen Sink Model Prepped Data Exclude only Raw variables with processed versions TreeNet Test Sample MSE=77.229 CART MSE=94.292 No point in pursuing a CART model here

Outlier Analysis: Lift (Percent of PredictionError/Percent of Data) N Test sample=964 Just 10 records account for 58.33% of SSE (sum of prediction error)

Largest Positive Residuals Larger Than Negative Copyright Salford Systems 2013

Shaving Five Variables Every Step: For quick search for smaller better model Test sample MSE=76.598 with 50 predictors Copyright Salford Systems 2013

Imposing Additivity Constraints: TARGET_D Backwards stepwise imposition of additivity. Slight improvement obtained when two variables are constrained. Fully additive model is only slightly worse

Sequence of Models Built Raw Data Largest plausible KEEP list on Prepared Data Shaving (Backwards Feature Elimination) to Moderate number of variables Judgmentally selected smallest plausible model Less likely to be overfit and likely to have smallest prediction variance Raw data models were inferior to refined models Looked at Validation results only after Moderate models constructed

Results of Several Models Results of models following recommended procedures BATTERY RELATED intended to correct for sampling bias and uses 114 variables in KEEP list Combination BATTERY RELATED Inverse BATTERY RELATED no weight RA Best Smallest Keeplists tgtb18tlud17th tgtb18tlud17tlad tgtb18tlud156th # sent (validate) profit (validate) 58525 15596.95 57682 15622.19 58145 15848.17 54739 14781.25 56613 15318.11 53983 15037.01 53823 15095.81 Observe that every model reported does better than original winners ($14,712) Any user modeling with TreeNet and everyday data prep and model selection should beat winners in 2-3 rounds of model refinement