Predicting borrowers chance of defaulting on credit loans



Similar documents
Better credit models benefit us all

Knowledge Discovery and Data Mining

Data Mining. Nonlinear Classification

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining Practical Machine Learning Tools and Techniques

Classification of Bad Accounts in Credit Card Industry

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

arxiv: v1 [stat.ml] 21 Jan 2013

Data Mining - Evaluation of Classifiers

Classification and Regression by randomforest

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

IST 557 Final Project

Fraud Detection for Online Retail using Random Forests

Random Forest Based Imbalanced Data Cleaning and Classification

Chapter 12 Bagging and Random Forests

Package trimtrees. February 20, 2015

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

A Logistic Regression Approach to Ad Click Prediction

Decision Trees from large Databases: SLIQ

Beating the MLB Moneyline

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Getting Even More Out of Ensemble Selection

Logistic Regression for Spam Filtering

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Comparison of Data Mining Techniques used for Financial Data Analysis

Chapter 6. The stacking ensemble approach

Machine Learning Capacity and Performance Analysis and R

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

E-commerce Transaction Anomaly Classification

Predicting the Stock Market with News Articles

Package bigrf. February 19, 2015

II. RELATED WORK. Sentiment Mining

The Scientific Data Mining Process

Azure Machine Learning, SQL Data Mining and R

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

The Operational Value of Social Media Information. Social Media and Customer Interaction

MHI3000 Big Data Analytics for Health Care Final Project Report

Making Sense of the Mayhem: Machine Learning and March Madness

Benchmarking of different classes of models used for credit scoring

Predictive Data modeling for health care: Comparative performance study of different prediction models

Drugs store sales forecast using Machine Learning

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Active Learning SVM for Blogs recommendation

Towards better accuracy for Spam predictions

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Distributed forests for MapReduce-based machine learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Equity forecast: Predicting long term stock price movement using machine learning

Data Mining: Overview. What is Data Mining?

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Using Random Forest to Learn Imbalanced Data

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Beating the NCAA Football Point Spread

Fast Analytics on Big Data with H20

Knowledge Discovery and Data Mining

MS1b Statistical Data Mining

The Predictive Data Mining Revolution in Scorecards:

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Knowledge Discovery and Data Mining

Machine learning for algo trading

Leveraging Ensemble Models in SAS Enterprise Miner

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Implementation of Breiman s Random Forest Machine Learning Algorithm

How To Predict The Outcome Of The Ncaa Basketball Tournament

The Artificial Prediction Market

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Chapter 7. Feature Selection. 7.1 Introduction

Identifying SPAM with Predictive Models

Gerry Hobbs, Department of Statistics, West Virginia University

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Recognizing Informed Option Trading

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Distributed Tuning of Machine Learning Algorithms using MapReduce Clusters

Predicting Flight Delays

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Machine Learning with MATLAB David Willingham Application Engineer

Predicting daily incoming solar energy from weather data

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

DATA MINING TECHNIQUES AND APPLICATIONS

Graph Mining and Social Network Analysis

New Ensemble Combination Scheme

Social Media Mining. Data Mining Essentials

ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction

Package acrm. R topics documented: February 19, 2015

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CART 6.0 Feature Matrix

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

Lecture 13: Validation

Boosting.

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

Transcription:

Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm is used to determine if borrowers are likely to default on their loans. This in turn affects whether the loan is approved. In this report I describe an approach to performing credit score prediction using random forests. The dataset was provided by www.kaggle.com, as part of a contest Give me some credit. My model based on random forests was able to make rather good predictions on the probability of a loan becoming delinquent. I was able to get an AUC score of 0.867262, placing me at position 122 in the contest. 1 Introduction Banks often rely on credit prediction models to determine whether to approve a loan request. To a bank, a good prediction model is necessary so that the bank can provide as much credit as possible without exceeding a risk threshold. For this project, I took part in a competition hosted by Kaggle where a labelled training dataset of 150,000 anonymous borrowers is provided, and contestants are supposed to label another training set of 100,000 borrowers by assigning probabilities to each borrower on their chance of defaulting on their loans in two years. The list of features given for each borrower is described in Table 1. Variable Name Description Type SeriousDlqin2yrs Person experienced 90 days past due delinquency or worse Y/N RevolvingUtilizationOfUnsecuredLines Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by percentage the sum of credit limits Age Age of borrower in years Number of times borrower has been 30-59 NumberOfTime30- days past due but no worse in the last 2 59DaysPastDueNotWorse years. DebtRatio Monthly debt payments, alimony,living costs divided by monthy gross income percentage MonthlyIncome Monthly income real NumberOfOpenCreditLinesAndLoans Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) NumberOfTimes90DaysLate Number of times borrower has been 90 days or more past due. NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse NumberOfDependents Table 1: List of features provided by Kaggle Number of mortgage and real estate loans including home equity lines of credit Number of times borrower has been 60-89 days past due but no worse in the last 2 years. Number of dependents in family excluding themselves (spouse, children etc.) 1

2 Method Random forests is a popular ensemble method invented by Breiman and Cutler. It was chosen for this contest because of the many advantages it offers. Random forests can run efficiently on large databases, and by its ensemble nature, does not require much supervised feature selection to work well. Most importantly, it supports not just classification, but regression outputs as well. However, there are still performance parameters that need to be tuned to improve the performance of the random forest. 2.1 Data Imputation The data provided by Kaggle were anonymous data taken from a real-world source and hence, it is expected that the input contains errors. From observation, I determined that there were three main types of data that required imputation: 1. Errors from user input: These errors probably came from a typo during data entry. For example, some of the borrowers ages were listed as 0 in the dataset, suggesting that the values might have been entered wrongly. 2. Coded values: Some of the quantitative values within the dataset were actually coded values that had qualitative meanings. For example, under the column NumberOfTime30-59DaysPastDueNotWorse, a value of 96 represents Others, while a value of 98 represented Refused to say. These values need to be replaced so that their large quantitative values do not skew the entire dataset. 3. Missing values: Lastly, some entries in the dataset were simply listed as NA, so there is a need to fill in these values before running the prediction. In particular, NA values were found for features NumberRealEstateLoansOrLines and NumberOfDependents. Since the choice of method of data imputation could significantly affect the outcome of the prediction algorithm, I tested several ways of doing the imputation: 1. Just leave the data alone: This method can only be used for errors in the first 2 categories. Since the random forest algorithm works by finding appropriate splits in the data, it will not be adversely affected even if there were coded keys with large quantitative values in the input. However, this would not work for filling in the NA entries in the data, so for this method, I removed the features NumberRealEstateLoansOrLines and NumberOfDependents completely from the dataset. 2. Substitute with the median value: Another way to handle the missing data is to use the median values among all valid data within the feature vector. In a way, this is using an available sample within the dataset that would not affect predictions greatly to fill in the missing entry. 3. Coded value of -1: Since the randomforest library that I was using would not take NA for an input entry, I simply replaced them with a value of -1 (essentially, a value that did not appear anywhere else) to ensure that NA is seen as a separate value in the data. As will be described below, from tests it was determined that method 3 (filling in missing entries with -1) worked the best in giving prediction accuracy. 2

Lastly, the labels in the dataset were skewed so that there were about 14 times more nondefaulters than defaulters. This is to be expected since most people do not default on their loans, or try not to. However, this will tend to skew the predictions made by the random forest as well. To overcome this problem, I ``rebalanced'' the input by repeating each row with defaulters 14 times, so that overall, the input file contained an even number of defaulters and non-defaulters. 2.2 Parameter Tuning One of the nice features of random forests is that there are relatively few parameters that need to be tuned. In particular, the parameters that I tested for optimizing were: 1. Sample size: Size of sample to draw at each iteration of the split when building the random forests. 2. Number of trees: The number of trees to grow. This value cannot be too small, to ensure that every input row gets predicted at least a few times. In theory, setting this value to a large number will not hurt, as the random forests should converge. However, in my tests it appears that a value that is too large reduces the accuracy of the prediction, possibly due to overfitting. 2.3 Experimental Setup For the test setup, I set up a data pipeline that first took the raw input and passed it through the data imputation stage. Thereafter, the parameters to the random forests are chosen, and a 4-fold cross is done on the labeled inputs. An AUC score is calculated for each of the 4 runs, and the average AUC is used to get an estimate of the performance of the algorithm on the actual test data. The experimental setup is shown in Figure 1. Figure 1: Experimental workflow 3 Results 3.1 Methods of data imputation I shall now present some of the findings from my experiments. Among the three methods of doing data imputation, it was clear that the method which used coded values gave the best results. This could be because borrowers who had NA in their entries were grouped together and used to predict each other s credit reliability, which would implicitly mean a nearest neighbor -like match was being done. The results of the three methods can be seen in Figure 2. 3

Average AUC from 4-fold cross Average AUC from 4-fold cross Comparing methods for data imputation 0.865 0.8645 0.864 0.8635 0.863 0.8625 0.862 0.8615 Leave alone Median Coded value AUC 500 trees AUC 1000 trees Figure 2: Comparing different ways of doing data imputation 3.2 Number of inputs sampled at each split This input number refers to the number of inputs sampled at each iteration when building the forest. I tested the prediction performance with different sample size, all using the coded value heuristic for data imputation (since it gave the best performance among the three methods). From the results we see that a larger sampling size gives less accurate prediction. This could be because of overfitting to the dataset. The results of the tests can be seen in Figure 3. 0.865 Comparing Sampling size 0.8645 0.864 0.8635 0.863 0.8625 0.862 2500 5000 10000 15000 20000 Sampling size Figure 3: Comparing different sampling size 3.3 Number of trees grown Lastly, from the best performers in the past two parameters, I ran a test for 500 and 1000 trees grown. Actually, most of the learning occurred within the first 350 trees that were grown, and only incremental changes appeared thereafter. Nonetheless, it appears that the performance was slightly better when we grew only 500 trees as opposed to 1000 trees. Again, I suspect this was due to overfitting of the data. Results are shown in Figure 4. 4

Average AUC from 4-fold cross Comparing number of trees grown 0.86475 0.86465 0.86455 0.86445 500 1000 Figure 4: Comparing number of trees built in the forest 3.4 Submission to Kaggle The above results were calculated from doing a 4-fold cross on the labeled data provided by Kaggle. However, after all the parameters were chosen and a best set was found, the parameters were used to train on the labeled training data, and used to predict the unlabelled test data. This was then submitted to Kaggle which did the final AUC scoring. Using the parameters chosen above, I got an AUC of 0.867262. As a reference, the top team got a much better result, an AUC score of 0.869558. 4 Discussion Simply by tweaking a few parameters, I was able to get rather good prediction results on the dataset. This is an example of the strength of random forests: its default parameters are generally quite good. From discussions that occurred on Kaggle after the competition is over, some of the teams also using random forests were able to get much better results, with smarter ways of doing data imputation, and by combining random forests with other ensemble methods like gradient boosting. 5 References 1. Bastos, Joao. "Credit scoring with boosted decision trees." Munich Personal RePEc Archive 1 (2007). Print. 2. Chen, Chao, Andy Liaw, and Leo Breiman. "Using random forests to learn imbalanced data." 3. Dahinden, C.. "An improved Random Forests approach with application to the performance prediction challenge datasets." Hands on Pattern Recognition - (2009). Print. 4. "Description - Give Me Some Credit - Kaggle." Data mining, forecasting and bioinformatics competitions on Kaggle. N.p., n.d. Web. 17 Dec. 2011. <http://www.kaggle.com/c/givemesomecredit>. 5. Leo, Breiman. "Random Forests." Machine Learning 45.1 (2001): 5-32. Print. 6. "Random forest - Wikipedia, the free encyclopedia." Wikipedia, the free encyclopedia. N.p., n.d. Web. 17 Dec. 2011. <http://en.wikipedia.org/wiki/random_forest>. 7. Robnik-Sikonja, M.. "Improving Random Forests." Lecture Notes in Computer Science 1.3201 (2004): 359-370. Print. 5