Better credit models benefit us all

Similar documents

Classification of Bad Accounts in Credit Card Industry

Predicting borrowers chance of defaulting on credit loans

Data Mining Practical Machine Learning Tools and Techniques

Data Mining. Nonlinear Classification

Knowledge Discovery and Data Mining

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Gerry Hobbs, Department of Statistics, West Virginia University

Knowledge Discovery and Data Mining

Data Mining - Evaluation of Classifiers

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Risk pricing for Australian Motor Insurance

Using multiple models: Bagging, Boosting, Ensembles, Forests

Model Combination. 24 Novembre 2009

Classification and Regression by randomforest

Predictive Data modeling for health care: Comparative performance study of different prediction models

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Azure Machine Learning, SQL Data Mining and R

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Data Analytics and Business Intelligence (8696/8697)

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT. Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly

How To Make A Credit Risk Model For A Bank Account

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining Methods: Applications for Institutional Research

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Decision Trees from large Databases: SLIQ

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Predictive Modeling Techniques in Insurance

Comparison of Data Mining Techniques used for Financial Data Analysis

Didacticiel Études de cas

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Benchmarking of different classes of models used for credit scoring

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Chapter 12 Bagging and Random Forests

How To Choose A Churn Prediction

Data Mining Algorithms Part 1. Dejan Sarka

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

MS1b Statistical Data Mining

A Property & Casualty Insurance Predictive Modeling Process in SAS

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

Knowledge Discovery and Data Mining

Random Forest Based Imbalanced Data Cleaning and Classification

GLM I An Introduction to Generalized Linear Models

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Data Mining Techniques Chapter 6: Decision Trees

Statistics in Retail Finance. Chapter 2: Statistical models of default

Getting Even More Out of Ensemble Selection

MHI3000 Big Data Analytics for Health Care Final Project Report

Studying Auto Insurance Data

Using Random Forest to Learn Imbalanced Data

Course Syllabus. Purposes of Course:

Fast Analytics on Big Data with H20

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Beating the MLB Moneyline

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Selecting Data Mining Model for Web Advertising in Virtual Communities

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

The Predictive Data Mining Revolution in Scorecards:

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

An Overview and Evaluation of Decision Tree Methodology

A Methodology for Predictive Failure Detection in Semiconductor Fabrication

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Efficiency in Software Development Projects

Modeling to improve the customer unit target selection for inspections of Commercial Losses in Brazilian Electric Sector - The case CEMIG

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Lecture 10: Regression Trees

SAS Software to Fit the Generalized Linear Model

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Ensemble Approach for the Classification of Imbalanced Data

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining: Overview. What is Data Mining?

Why Ensembles Win Data Mining Competitions

Feature Selection with Monte-Carlo Tree Search

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

High Performance Predictive Analytics in R and Hadoop:

Package trimtrees. February 20, 2015

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

Lecture 13: Validation

Modeling Lifetime Value in the Insurance Industry

Prediction of Stock Performance Using Analytical Techniques

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Multiple Linear Regression in Data Mining

Paper Evaluation of methods to determine optimal cutpoints for predicting mortgage default Abstract Introduction

Chapter 6. The stacking ensemble approach

Data Mining Techniques

Transcription:

Better credit models benefit us all

Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis for random forest advantage I* algorithm and Synthetic experiment Conclusion

Credit Scoring - Overview One of the most successful operations research/data science techniques Over-arching purpose is to predict risk of serious delinquency default Unique characteristics of credit scoring models include: Highly correlated variables (multicollinearity) P-values are relatively unreliable Alternative specifications of models using completely different but correlated variables give similar results High visibility (it impacts us all very personally) U.S. law mandates that components of the score must be analyzed to ensure fair and equal treatment Borrowers must be given top 3 denial reasons for decision most important variables affecting their score Guide to credit scoring in r ref: http://cran.rproject.org/doc/contrib/sharma-creditscoring.pdf

Roger Bannister and credit scoring

Credit scoring is plagued by Flat max or 4 minute mile problem Scholars have argued that model performance cannot be enhanced further and has approached its limits: The predictive ability of linear models is insensitive to large variations in the size of regression weights and to the number of predictors. (Lovie and Lovie, 1986; Overstreet etal 1992; ) More recently Classifier Technology and Illusion of progress (Hand, D; 2006) Hand s point is 80-90% of model lift with simple model; Still true but 5-7% relative lift is still stellar for modelers and data scientists; There was a mystique, a belief that it couldn t be done, but I think it was more of a psychological barrier than a physical barrier. Record for mile run over time Progress 4 min 1.3 seconds pre-bannister 3 min 59 seconds Bannister 3 min 43 seconds Gourrouj 17 sec/4 min 7% lift Proportional to rf advantage on app scoring

Random Forests - Overview Ensemble method using random selection of variables and bootstrap samples (by Leo Breiman at berkeley, CA 1999) Develop recursive partitioning trees using majority vote 1) Each tree is grown on a bootstrap sample of the training set. 2) A number m is specified much smaller than the total number of variables M. At each node, m variables are selected at random out of the M, and the split is the best split on these m variables. (Breiman, 2002 Wald Lecture looking inside the black box) The most powerful aspect about random forests is variable importance ranking which estimates the predictive value of variables by scrambling the variable and seeing how much the model performance drops Unclear why ensembles work, research has shown a random forest is a form of adaptive nearest neighbor (Yin,2002)

Random Forests Outperform in Credit Scoring There are two credit modeling domains: Origination score used to predict whether to give a loan to someone Behavioral models used to predict customer history and behavior related variables to predict collection or line of credit performance later in the life of loan product As a credit risk manager I have found that random forests outperform logit models on numerous credit products (credit cards, mortgages, auto loans, home installments, and home equity line of credit) Random forests score an estimated 5-7% higher area under the curve ROC for origination scoring Random forests have been shown to improve behavioral models by as much as 15%-20%

Thrust of Research In practice, variable importance rankings can be used to test one way interaction effects and add them to logit models, resulting in performance which matches or sometimes outperforms random forests A hypothesis for the performance gap in the random forest and logit model is that ensemble methods ferret out predictive interactions effects between variables To test this hypothesis a synthetic data set was created where the Y variables is dependent on X,Z and X*Z interaction. Both the random forest and logit models were tested using the dataset and the performance gap or advantage of the random forest became clear

I* algorithm To tune logit to match or exceed random forest performance The i* algorithm. This algorithm takes the random forest rankings and searches for interaction terms and keeps adding terms in a step wise manner if adding them increases out of sample performance. Using the random forest ranking results in an ordered search with a smaller search space than all models and quickly converges to an effective model Generate interaction terms with important variables n way one at a time vs. all interactions using rf var imp to generate a manageable set of interactions one way and building step wise based on validation performance is effective Reason interactions not been explore more: Unreliable p values; Modeling too many interactions at once doesn t work; overfit poor performance (Gayler, 2008)

Summary of improvements to logit vs. interaction terms Models and Performance (AUC) Datasets Original glm logistic I* glm w interactions Original random forest German 0.81 0.88 0.83 Credit card 0.60 0.64 0.62 Home equity 0.88 0.96 0.95

Example 1: Credit card data Pacific Asian KDD 2009 source: Brazilian credit card data set of 50k credit cards

Example 2: Home Equity Data set from SAS Home equity data set out of sample area under the curve: Logistic.79 Wallinga s neural net tuned logit link function.87 Breiman s random forest.96 out of the box

PA KDD CC Data: Tuned logit via i* vs. rf vs. logit

Home Eq: Tuned logit vs. rf vs. logit

Synthetic data example Developed a synthetic data set where the Y= 1+ 0.5x1+ - 1.5x2+2.0*x3 x3 is an interaction effect x3=x1*x2 where x1,x2, and x3 are normally distributed random numbers with a mean 0f 0 and a standard deviation of 1 Three additional noise variables are created x4 (normal mean=1, stand dev=50) and x5 and x6 which are random numbers with mean 0 and st. dev =1 X3 is not in the training set

Synthetic performance without x3: RF dominates 30% OOS AUC AUC for random forest is.897, for glm (logistic) is.66

Results with interaction term x3 added to models AUC for logit with interaction effect.905

Synthetic experiments Random 50/50 assignment of binary outcome associated with an X variable generated with different probability distributions RF dominates logit when X is 2-d gaussians with different var-covariance matrix RF dominates when one of the X distributions is nonexponential (e.g. cauchy, sum of gaussians) Logit dominates when distributions are 2-d gaussian but same variance-covariance structure, exponential distributions like gaussians

Results of synthetic experiments The table below lists results from experiments on synthetic binary data Two classes were sampled randomly and X was sampled from two different distributions 10,000 draws Distribution Gaussian (mean=0 sd=1) vs. Gaussian (mean=0, sd=1) Gaussian (mean=0, sd=1) vs. Uniform (mean=0, sd=2) 2d Gaussian with same variances/covariances Gaussian (mean=0, sd=1) vs. Cauchy with scale=1 (nonexponential) 2d Gaussian with different variances/covariances Sum of 2 Gaussians (non-exponential) vs. Gaussian Data with interaction effects Winner Logit Logit Logit Random Forests Random Forests Random Forests Random Forests

2-d Gaussians w/o same variance Random forest dominates when variance/co-variance is different but not when same

Conclusions Random forests outperform traditional logistic regression based score when there are interaction effects in data that are predictive Resulting tuned logit can match or exceed rf classification The performance advantage of the random forest can be integrated with logit models via interaction terms identified through random forest variable rankings In a sense using var imp one can algorithmically search for promising interaction effects Which have always been in credit scoring like debt ratio (expenses * 1/income) or months reserves (assets/piti etc) Or even Hand s superscorecard (the product of 2 scores i.e. interaction effect )(2002) If a random forest model dominates logit it suggests that there are statistically significant interaction effects in the data which should be modeled Random forest variable importance is a powerful tool to identify interaction effects within a small search space

Details in papers http://cran.r-project.org/doc/contrib/sharma-creditscoring.pdf http://papers.ssrn.com/sol3/cf_dev/absbyauth.cfm?per_id=644 975 Papers on i*, synthetic experiments, credit scoring, de-biased forest algorithm with r code appendices with detailed bibliographies http://en.wikipedia.org/wiki/mile_run_world_record_progressi on http://www.stat.berkeley.edu/~breiman/wald2002-2.pdf http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1975801 Data: Credit card http://sede.neurotech.com.br/pakdd2010/ Home equity www.sasenterpriseminer.com/data/hmeq.xls