Credit Scoring Modelling for Retail Banking Sector.



Similar documents
Statistics for Retail Finance. Chapter 8: Regulation and Capital Requirements

Univariate Regression

Additional sources Compilation of sources:

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Geostatistics Exploratory Analysis

A Basic Introduction to Missing Data

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Part 2: Analysis of Relationship Between Two Variables

The Probit Link Function in Generalized Linear Models for Data Mining Applications

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

5. Continuous Random Variables

Loss Given Default models for UK retail credit cards

Statistics 104: Section 6!

Predicting Defaults of Loans using Lending Club s Loan Data

Imputing Missing Data using SAS

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Time Series Analysis

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Statistics in Retail Finance. Chapter 6: Behavioural models

Risk and return (1) Class 9 Financial Management,

The CAPM (Capital Asset Pricing Model) NPV Dependent on Discount Rate Schedule

1 Sufficient statistics

Validation of Internal Rating and Scoring Models

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Risk Decomposition of Investment Portfolios. Dan dibartolomeo Northfield Webinar January 2014

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

4. Continuous Random Variables, the Pareto and Normal Distributions

17. SIMPLE LINEAR REGRESSION II

Lecture 3: Linear methods for classification

Multivariate normal distribution and testing for means (see MKB Ch 3)

Econometrics Simple Linear Regression

Efficiency and the Cramér-Rao Inequality

Dongfeng Li. Autumn 2010

Introduction to General and Generalized Linear Models

Least Squares Estimation

Comparing Neural Networks and ARMA Models in Artificial Stock Market

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

STAT 830 Convergence in Distribution

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

9.07 Introduction to Statistical Methods Homework 4. Name:

Introduction to mixed model and missing data issues in longitudinal studies

Simple Regression Theory II 2010 Samuel L. Baker

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Credit Risk Models. August 24 26, 2010

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Credit Risk Models: An Overview

Drugs store sales forecast using Machine Learning

An Application of the Cox Proportional Hazards Model to the Construction of Objective Vintages for Credit in Financial Institutions, Using PROC PHREG

Final Exam Practice Problem Answers

Counterparty Credit Risk for Insurance and Reinsurance Firms. Perry D. Mehta Enterprise Risk Management Symposium Chicago, March 2011

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

CREDIT RISK MANAGEMENT

Research Methods & Experimental Design

Inaugural Lecture. Jan Vecer

An Introduction to Basic Statistics and Probability

Introduction to proc glm

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

MATH 10: Elementary Statistics and Probability Chapter 7: The Central Limit Theorem

Review for Exam 2. Instructions: Please read carefully

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Simple linear regression

VI. Introduction to Logistic Regression

Module 3: Correlation and Covariance

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

A Latent Variable Approach to Validate Credit Rating Systems using R

Basics of Statistical Machine Learning

Gamma Distribution Fitting

INCORPORATION OF LIQUIDITY RISKS INTO EQUITY PORTFOLIO RISK ESTIMATES. Dan dibartolomeo September 2010

Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution

Towards running complex models on big data

WHERE DOES THE 10% CONDITION COME FROM?

Week 3&4: Z tables and the Sampling Distribution of X

Tutorial: Structural Models of the Firm

13. Poisson Regression Analysis

2 Binomial, Poisson, Normal Distribution

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

EAD Calibration for Corporate Credit Lines

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

A Log-Robust Optimization Approach to Portfolio Management

Lecture 8: More Continuous Random Variables

Multiple Choice: 2 points each

Transcription:

Credit Scoring Modelling for Retail Banking Sector. Elena Bartolozzi, Matthew Cornford, Leticia García-Ergüín, Cristina Pascual Deocón, Oscar Iván Vasquez & Fransico Javier Plaza. II Modelling Week, Universidad Complutense de Madrid, 16th - 24th June 2008. Contents 1 Credit Scoring 2 1.1 Introduction................................ 2 1.2 Methodology and Data.......................... 2 1.3 Solution.................................. 3 1.3.1 Univariate Analysis........................ 3 1.3.2 Multivariate Analysis....................... 3 1.3.3 Model Creation.......................... 3 1.3.4 Model Validation......................... 4 1.3.5 Model Calibration........................ 4 2 Capital Allocation 5 2.1 The problem................................ 5 2.2 Implementation.............................. 7 2.3 Conclusions................................ 7 Acknowledgements We would like to thank Ignacio Villanueva for his guidance in this project and Estela Luna of Accenture for raising the problems. 1

1 Credit Scoring 1.1 Introduction Our problem is concerned with how and who a bank should loan its money to. When a client applies for a loan the bank would like to be sure that the client will pay back the full amount of the loan. In the past the decision was made solely on the bank s experience in lending money. This method is highly subjective and the banks would like a more systematic approach. One way of doing this, which we present here, is to fit a Generalized Linear Model to past data and use this to produce a probability that the borrower will repay the loan. This probability, along with the lenders experience is then used to decide if the bank should lend to a particular client. This method can also be used for the problem of issuing general insurance, for example car insurance. This problem is a routine one that is already in use by many banks and insurance firms. Many tools exist in MATLAB and SAS which make the implementation of each step trivial. 1.2 Methodology and Data There are five clear steps which we took to complete this problem. These are Univariate Analysis, Multivariate Analysis, Model Creation, Model Validation, Model Calibration. Our data was provided by Accenture and include details of completed loan agreements. The variables included are age, income, wealth, marital status, length as a client, amount of loan and length of loan. 2

1.3 Solution 1.3.1 Univariate Analysis This involves calculating statistics for each variable like mean, median, standard deviation, skewness etc... The purpose of this task is to provide a general feeling for the data. This information can be used as a first check before applying the model to a particular client. For example, if the average age in the data is 65 and the client is 18, it is likely that the model will not be relevant. 1.3.2 Multivariate Analysis We need to decide which variables from the data to include within our model. Firstly we compute the correlation matrix. With this information, if any two variables are highly correlated we may discard one of them as it does not provide any new information. The second test is the χ 2 -test. This tests the dependance between each variable and the response variable, in our case whether the client defaulted. If there is significant statistical evidence that a variable is independent of default, it does not make sense to include it in the model. With our data the χ 2 -test indicated we should drop the variables length of loan and amount of loan. 1.3.3 Model Creation We are looking to compute the default probability using the regression model: Default = f(x 1,...,x n ) + ǫ, where x 1,...,x n are the explainatory variables (age, income, ), ǫ is the residual, and f is our function for regression. As our response variable is binomial we use the logistic (logit) model where P(default : X) = 1 1 + exp( Xβ) where β is a vector of parameters to be determined from the data. β can be easily calculated using the MATLAB function glmfit or using the SAS procedure proc logistic. 3

Calculated β i. β 1 Intercept -1.85136 β 2 Age -0.02678 β 3 Income 0.10025 β 4 Wealth -0.01761 β 5 Marital Status 0.79651 β 6 Maturity 0.00892 1.3.4 Model Validation We have used Powerstat method, as it is implicit in the proc logistic at SAS. We have been able to obtain a wrong rate of 23.6%. We establish an estimation like default if it is greater than the frontier: 0.31 Obtained Results: Defaults Estimated: 137 Wrong Estimations: 173 Observed Defaults: 138 1.3.5 Model Calibration The expected loss is defined as: EL = PD EAD LGD, where PD is the is the percentage of defaults, EAD is the exposition to default and LGD are losses given default. PD is defined as default probability calibrated for a year Scoring probabilities allows us to sort people against default. However, when we try to understand these probabilities we must realize that these probabilities do not take into account when default happens. This is the reason for calibrate the Scoring results, and obtain the yerly average probability. To obtain the calibration, we have to get a sample of people who have been observed in periods of years. The model is applied to each person to obtain the scoring and then grouped by score. 4

We count how many people are in default, obtaining a default observed rate. The aim of the calibration is to model with a formula that default rate: A(C + score) B. A, B, C must be estimated by minimizing the least squares error. We compute this in MATLAB using the function fminsearch. The values obtained are: A = 0.0004, B = 3.7410, C = 2.7870. 2 Capital Allocation 2.1 The problem In this problem a lender has a fixed amount of money to lend, EAD, between n blocks of similiar customers. The lender would like to know, for a given level of risk, how he should distribute his money between the blocks to maximise his profit. Each block has associated with it an interest rate ρ i, an a priori probabilty of default PD i, the loss given default LGD i and the number of customers N i. If each customer in each block is independent of the rest then we can easily compute the probability of k defaults using the binomial distribution. However the customers are correlated via the economy. Using Gaussian-Copula we can introduce a loss distrbution for each customer m Z i = a i j Y j + r i w i. j=1 The Y j are indices for different parts of the economy, maybe unemployment rate, stock market index etc... We assume that the state of the economy is fixed. The a i j represent the weighting given to each part of the economy for the customers in block i. w i is a standard normal variable for the systemic risk associated with each customer and r i the standard deviation. We now have that Φ(Z i ) < PD i Default. We can use the above to show that the probability of default given a particular state of the economy is p i = Φ ( Φ 1 (PD i ) ) m j=1 ai jy j. r i 5

When presented this problem starting point was to simulate the w i for each customer in every block and repeat thousands of times to produce a simulated loss distribution. This approach take a long time to compute and it is not clear how to then optimise given the simulated loss distribution. We would like to seek an analytical expression for the loss distribution. The probability p i represents the independent probabilty of default for each customer in block i. We are now able to use the binomial distribution. ( ) Ni P(k defaults) = p k i (1 p k i ) Ni k. As the N i are in the order of 10 3 we can use the Central Limit Theorem to approximate the binomial distribution with a normal random variable, D i, with mean N i p i and variance N i p i (1 p i ). D i N(N i p i, N i p i (1 p i )). We use α 1...α n to denote the fraction of EAD allocated to each of the n blocks. We assume that each person in a block borrows an equal fraction of the moeny, namely α i EAD N i. The loss distribution can now be constructed as L = n i=1 α i EAD N i (LGD i D i (N i D i )ρ i ). As the D i are normal random variables we see the L is also a normal random variable. L N(µ L, σ 2 L ) where µ L = n α i EAD(LGD i p i (1 p i )ρ i ) i=1 σ 2 L = n i=1 α 2 i EAD2 (LGD i + ρ i ) 2 N i p i (1 p i ). The problem is to minimise expected loss, µ L, such that the α i s sum to one and the level of risk is fixed. To measure risk we use Value at Risk (VaR) with a 99% confidence level. Formally: 6

Minimise subject to the constraints f(α) = µ L n α i = 1 i=1 2.3262 σ L + µ L = V ar99, where V ar99 is the fixed level of risk the lender is willing to take. 2.2 Implementation We used MATLAB to implement the above optimisation problem. To start with we took 3 blocks of customers with 3 economic indices. Initially we ran through a large number of possible α = (α 1...α 3 ). For each α we were able to compute the V ar99 and µ L. This was then plotted to produce an efficient border. This did not take long to compute for 3 blocks, however for larger number of blocks this process is computationally expensive. We then used the MATLAB optimisation function fmincon to find the optimal α for a given V ar99. Figure 1 shows the two methods compared. The block parameters suck as ρ i, N i,... were chosen at random. We can see that the optimisation has worked well in choosing the α which gives the lowest loss for a given level of risk. As a check we also implemented the simulation of w i for a given α to compare with our analytic distribution. We got very good agreement between the two, on the order of 10 4. The next step was to increase the number of blocks to 5. The brute force approach worked but took considerably longer than with 3 blocks. The optimisation however did not perform. The reason for this is not known for sure. It could be down to unrealistic parameters, a bad starting point or a bad choice of solver. Figure 2 shows the brute force approach applied to five blocks. 2.3 Conclusions We found that the analytical method outperformed the simulation of w i as expected. To optimise for more than 3 blocks the choice of optimiser needs to be investigated 7

Simulation Vs. Optimisation 0.05 Simulation 0.045 Optimisation 0.04 Expected Loss 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0.02 0.03 0.04 0.05 0.06 0.07 VaR Figure 1: Brute Force vs. Optimisation for 3 blocks Efficient Border for 5 blocks 0.02 0.01 Expected Loss 0 0.01 0.02 0.03 0.04 0.08 0.07 0.06 0.05 0.04 VaR 0.03 0.02 0.01 Figure 2: Brute Force for 5 blocks 8 0

furhter. Another interesting question is to look at the relationship between the efficient border and the interest rates charged for each block. We can also make the ecomony indices random variables and see how that changes the problem. 9