Performance Measures for Credit Risk Models



Similar documents
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Measuring The Performance Of Corporate Bond Ratings

Organizing Your Approach to a Data Analysis

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

A User s Guide to Moody s Default Predictor Model: an Accounting Ratio Approach

Fairfield Public Schools

Credit Risk Models. August 24 26, 2010

Discussion Paper On the validation and review of Credit Rating Agencies methodologies

Cluster Analysis for Evaluating Trading Strategies 1

Credit Risk Modeling: Default Probabilities. Jaime Frade

ASSESSING CORPORATE RISK: A PD MODEL BASED ON CREDIT RATINGS

Probability Models of Credit Risk

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Gamma Distribution Fitting

ETF Total Cost Analysis in Action

Local outlier detection in data forensics: data mining approach to flag unusual schools

How To Check For Differences In The One Way Anova

Application of Quantitative Credit Risk Models in Fixed Income Portfolio Management

Counterparty Credit Risk for Insurance and Reinsurance Firms. Perry D. Mehta Enterprise Risk Management Symposium Chicago, March 2011

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

How to Win the Stock Market Game

Validation of Internal Rating and Scoring Models

Measurement and Metrics Fundamentals. SE 350 Software Process & Product Quality

IBM SPSS Direct Marketing 23

Experiment #1, Analyze Data using Excel, Calculator and Graphs.

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques Page 1 of 11. EduPristine CMA - Part I

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

IBM SPSS Direct Marketing 22

For example, estimate the population of the United States as 3 times 10⁸ and the

3. Data Analysis, Statistics, and Probability

AP Physics 1 and 2 Lab Investigations

2. Simple Linear Regression

Variable Selection for Credit Risk Model Using Data Mining Technique

Data Preprocessing. Week 2

Performance Level Descriptors Grade 6 Mathematics

Prentice Hall Connected Mathematics 2, 7th Grade Units 2009

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Linear Threshold Units

Detecting Corporate Fraud: An Application of Machine Learning

Normality Testing in Excel

Jitter Measurements in Serial Data Signals

Schonbucher Chapter 9: Firm Value and Share Priced-Based Models Updated

Prentice Hall Algebra Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009

Statistics in Retail Finance. Chapter 2: Statistical models of default

Aachen Summer Simulation Seminar 2014

CORPORATE CREDIT RISK MODELING: QUANTITATIVE RATING SYSTEM AND PROBABILITY OF DEFAULT ESTIMATION

Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization

Benchmarking default prediction models: pitfalls and remedies in model validation

Grade 6 Mathematics Assessment. Eligible Texas Essential Knowledge and Skills

Risk and Return in the Canadian Bond Market

Common Core Unit Summary Grades 6 to 8

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

Exercise 1.12 (Pg )

Inequality, Mobility and Income Distribution Comparisons

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Introduction to time series analysis

This article discusses issues in evaluating banks internal ratings

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

7 Time series analysis

Simple Predictive Analytics Curtis Seare

COMMON CORE STATE STANDARDS FOR

Basel Committee on Banking Supervision. Working Paper No. 17

Algebra 1 Course Information

Extending Factor Models of Equity Risk to Credit Risk and Default Correlation. Dan dibartolomeo Northfield Information Services September 2010

Chapter 14 Managing Operational Risks with Bayesian Networks

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Credit Research & Risk Measurement

Creating Short-term Stockmarket Trading Strategies using Artificial Neural Networks: A Case Study

Using Duration Times Spread to Forecast Credit Risk

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Cost of Capital and Corporate Refinancing Strategy: Optimization of Costs and Risks *

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

MULTIPLE DEFAULTS AND MERTON'S MODEL L. CATHCART, L. EL-JAHEL

CORPORATE CREDIT RISK MODELING: QUANTITATIVE RATING SYSTEM AND PROBABILITY OF DEFAULT ESTIMATION

Correlation key concepts:

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Chapter 111. Texas Essential Knowledge and Skills for Mathematics. Subchapter B. Middle School

IMPLEMENTATION NOTE. Validating Risk Rating Systems at IRB Institutions

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

How To Write A Data Analysis

Decomposing Mortgage Portfolio Risk: Default, Prepayment, and Severity YUFENG DING

PCHS ALGEBRA PLACEMENT TEST

INTERNATIONAL COMPARISON OF INTEREST RATE GUARANTEES IN LIFE INSURANCE

Decision Trees What Are They?

What is the purpose of this document? What is in the document? How do I send Feedback?

6.4 Normal Distribution

Market Efficiency: Definitions and Tests. Aswath Damodaran

A Software Tool for. Automatically Veried Operations on. Intervals and Probability Distributions. Daniel Berleant and Hang Cheng

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

In this chapter, you will learn improvement curve concepts and their application to cost and price analysis.

Least Squares Estimation

Financial-Institutions Management

Time Value of Money. Critical Equation #10 for Business Leaders. (1+R) N et al. Overview

Transcription:

Performance Measures for Credit Risk Models S.C. Keenan * and J. R. Sobehart Risk Management Services Moody s Investors Service 99 Church Street, NY 10007 (212) 553-1493 (212) 553-7737 keenans@moodys.com sobehartj@moodys.com Research Report # 1-10-10-99 * The analysis and conclusions set forth are those of the authors only. Moody s Investors Service is not responsible for any statement or conclusion herein, and no opinions, theories, or techniques presented herein necessarily reflect the position of Moody s Investors Service. 1

1. Introduction The objective of this article is to provide guidance for testing and benchmarking credit risk models. Transparency in the area of model validation and benchmarking is extremely important since reliable models can provide a cost-effective means of expediting the credit approval process and can provide a consistency check on credit assessment and monitoring functions, while poorly performing models could have serious consequences for credit risk management practices. We focus on default prediction models, although the techniques described can easily be adapted toward models predicting loss, or other types of credit events. The situation currently faced by many financial institutions is that more than one risk measure may be available for each borrower or potential borrower. These may include internal risk grades, model generated risk scores, and credit opinions from rating agencies expressed as symbolic ratings. Some commercially available credit risk models carry extraordinary claims of predictive accuracy, which are usually backed up by anecdotal comparisons 1 instead of well-documented evidence of superior performance. In such cases, it is beneficial to compare the relative performance of the different measures of credit quality using an objective and rigorous methodology. Institutions who have borrowers credit histories, including default or credit loss information, are in a position to conduct validation tests and make objective determinations as to the relative performance of different credit risk measures. In this article we describe four simple, yet powerful, techniques for comparing the performance of credit risk models and analyzing information redundancy: (1) Cumulative Accuracy Profile (2) Accuracy Ratio (3) Conditional Information Entropy Ratio, and (4) Mutual Information Entropy. These techniques are quite general and can be used to compare different types of models even when the model outputs differ and are difficult to compare directly. Specifically, discrete risk ratings can be compared to continuous numerical outputs. Even categorical outputs such as letter and alpha-numeric symbols used by rating agencies can be evaluated side by side with numerical credit scores. The requirements for implementing these tests are data consisting of multiple risk measures for a cross-sectional or panel data set, and associated default or loss information that provides the criterion of accuracy. Here we will focus on default prediction accuracy as opposed to loss prediction, so the data requirement would be a set of dated default flags. 1 See, for example, Kealhofer, Kwok and Weng (1998). 2

2. Basis for Inter-Model Comparison Comparing the performance of different credit risk models is difficult since the models themselves may be measuring different aspects of credit risk, and may be expressing the outputs in different ways. For example, some models estimate explicitly a probability of default, or expected default frequency 2 (EDF), which is therefore a number between zero and one. Others, such as internal bank scores, rank risk on some ordinal scale, say 1-10 or 1-100. Rating agencies rank along a relatively coarse 21 bin alpha numeric scale, while other models such as Z-scores produce scores reported to several decimal places. Model forms also vary widely -- credit risk models have developed using every available type of linear and non-linear statistical technique. Because of this variety, internal model diagnostics are not helpful for comparisons. A major part of the validation tests found in the literature 3 are of limited scope for practical model comparisons. Commonly cited diagnostics like the F-statistic or Akaike Information criteria may be helpful for comparing the internal performance of simple regression models, but similar tests are not available for expert systems, neural networks, and other default prediction models, let alone risk scores assigned by analysts. Moreover, even in the linear regression case, assumptions that underlie these diagnostics are frequently violated in practice (such as independence of samples or the Gaussian distribution of errors). Although it is usually not difficult to determine to what extent these assumptions are violated in each case, it is difficult to determine how to correct the t-statistics or other statistics that authors cite to demonstrate they have a good model. The techniques discussed below are useful not only because of their power and robustness, but because they can easily be applied to any type of model output including analyst assigned risk ratings. Inter-model comparison is essentially a comparison of model errors produced on data sets used for model training and validation. Because default events are rare, it is often impractical to create a model using one data set, and then test it on a separate hold-out data set containing out-of-sample and out-of-time observations. Such tests would be the best way to compare models performance. However, there is rarely enough default information to support these tests. Too many defaulters left out of the training set will impair the model estimation, while too many defaulters left out of the hold-out sample will reduce the power of the validation tests. The model builder s task is more often one of rationalizing the default experience of a sample, which contains both defaulters and nondefaulters. The modeler seeks to determine those characteristics that distinguish defaulters from non-defaulters, so that defaulters can be consistently identified when confronted with a new and different sample of obligors. Whether expressed as probabilities of default, discrete orderings, or continuous risk scores, model outputs are opinions of credit quality representing different degrees of 2 See Kealhofer, Kwok and Weng (1998). 3 See, Caouette, Altman and Narayanan (1998). 3

belief of default-like characteristics. Default prediction models can err in one of two ways. First, the model can indicate low risk when, in fact, the risk is high. Typically referred to as Type I error, this corresponds to highly rated issuers who nevertheless default on financial obligations. For example, a firm who defaulted on publicly held bonds while holding an A rating from an agency. Secondly, the model can indicate high risk when, in fact, the risk is low. Typically referred to as Type II error, this corresponds to low-rated firms that should, in fact, be rated higher. A model which identifies a start-up fashion retailer as a low-risk borrower relative to Ford Motor Company would be committing a Type II error. It is possible for some risk measures to be better at (i.e., commit less of) one type of error than another. However, success at minimizing one type of error necessarily comes at the expense of increasing the other type of error. A claim such as Model X assigned very high probability of default to 90% of the defaulters in the sample provides an incomplete picture, since we do not know how often the model assigned a high probability of default to a credit worthy borrower who did not default. This is particularly true for models that have been constructed with a proportion of defaulters to non-defaulters that is not representative of the true population of borrowers. 4 Unfortunately, this type of misleading argument is frequently used to reinforce the credibility of some models. It is not unusual to see anecdotal comparisons of the output of quantitative models 5 (e.g: an EDF) against the historical default rate of an agency rating, and incorrect conclusions drawn as to the performance of the model based solely on a higher value of the model output. Good models balance both types of error by effectively differentiating the relative credit risk across the entire spectrum of borrowers credit quality, and do this consistently over time. In order to demonstrate the usefulness of the methodology we introduce here, we compare several univariate and multivariate models on a validation data set extracted from about 9,000 public firms for the period 1989-1999. The total number of firm-year observations is about 54,000 including over 530 default events. It should be stressed that the purpose of the comparison is not to show which of the selected models is better, but to show how performance measures differentiate the models. To illustrate the performance measures, we compare outputs from the following models: (1) Univariate model based on return on assets (ROA) only. (2) Z-score model (a widely used benchmark). 6 (3) Hazard model of bankruptcy. 7 (4) A variant of the Merton (1974) model based on concept of distance to default ( MVA Default Point) Distance to Default = MVA σ 4 See Caouette, Altman and Narayanan (1998), pp. 112-122. 5 Kealhofer, Kwok, and Weng (1998). 6 Here we used the 1968 version of the Z-score model for illustration purposes. 7 See Shumway (1998). 4

Here MVA is the market value of the firm s assets and σ its volatility, and the Default Point is approximately equal to current liabilities plus 50% of long-term liabilities. (5) A nonlinear regression model 8 based on both market and financial information. These models represent a wide range of modeling approaches varying from simple univariate and multivariate analysis to implementations of contingent claims analysis and adaptive computation. We also consider Moody s long-term debt ratings, both as a benchmark, and to illustrate the flexibility of the performance measures, even though the stated goal of Moody s ratings is not short-term default prediction per se. 9 The results for agency ratings are not included in the performance tests described here because most of the obligors in the data set are unrated companies. 3. Accuracy Accounting Credit risk model outputs can be interpreted as a ranking over obligors according to the extent to which they exhibit defaulter-like characteristics. One set of performance comparison measures is based on getting an accurate accounting of how the model performs over an entire data set when ranked from riskiest to safest. These measures are superior to anecdotal comparisons because they compare models ability to predict many default events as well as many non-default events. This section describes methods for directly comparing credit quality discrimination over an entire data comprised of both defaulters and nondefaulters. Cumulative Accuracy Profiles (CAPs) A common feature to all models and agency ratings is the exponential-like curve displayed by the default rate as a function of credit quality or risk score. The better the model the steeper the curve relative to the distribution of underlying scores. However, a simple analysis of the curvature of the default rate would provide an incomplete picture of the discriminatory power of a model. A more refined method is the use of Cumulative Accuracy Profiles (CAP) which help to visualize local features across the spectrum of credit quality, and give a graphical representation of each risk measure s ability to distinguish defaulters from non-defaulters. CAP curves belong to the class of performance measures generically known as dubbed-curves, lift-curves or power curves. These curves are widely used in many fieldsto visualize the overall performance of a model to separate two populations. To plot a CAP curve, companies are ordered by risk score from riskiest to safest. For a given fraction x% of the total number of companies ordered by risk score, a Type I CAP curve is constructed by calculating the percentage y(x) of the defaulters whose risk score is equal to or lower than 8 See Sobehart, Stein, Mikitkyanskaya, Li (2000). 9 See Keenan, Shtogrin and Sobehart (1999). 5

the one for fraction x. A Type II CAP curve is constructed similarly using the function z(x) of non-defaulters. Technically, the CAP curve for Type I errors represents the cumulative fraction of default events for different percentiles of the risk score scale, and the CAP curve for Type II errors represents its complement. A good model concentrates the defaulters at the riskiest scores and so the percentage of all defaulters identified (the y variable above) increases quickly as one moves up the sorted sample (along the x axis). If the model were totally uninformative if, for example, it assigned risk scores randomly -- we would expect to capture a proportional fraction of defaulters with each increment of the sorted sample. That is, x% of the defaulters would be contained in the first x% of the observations, generating a straight line CAP. Exhibit 1. Hypothetical Cumulative Accuracy Profiles Type I and Type II CAP curves 100% 80% Defaults 60% 40% Ideal Type I Ideal Type II 20% Random y(x) z(x) 0% 0% 20% 40% 60% 80% 100% Population (x) A good model also concentrates the non-defaulters at the lowest riskiness. Therefore, the percentage of all non-defaulters (the z variable) should increase slowly at first. One of the most useful properties of CAPs is that they reveal information about the predictive accuracy of the model over its entire range of risk scores for a particular time horizon. Hypothetical Type I CAPs for ideal, intermediate and uninformative (random) risk models are presented in Exhibit 1. Similar curves are shown for Type II CAP plots. The vertical dashed line represents the fraction of defaulters in the total population. In Exhibit 1, the fraction of defaulters has been exaggerated to a hypothetical 20% for illustration purposes. In practice, the fraction of defaulters is in the range of a few percent. (In our validation sample, around 1%.) Exhibit 2 shows the results for the benchmark models. 6

Exhibit 2. Selected Cumulative Accuracy Profiles Cumulative Accuracy Profiles Defaults 100% 90% 80% 70% 60% 50% 40% Random 30% ROA Z Score Model 20% Hazard Model Merton Model Variant 10% Nonlinear Model 0% 0% 20% 40% 60% 80% 100% Population Accuracy Ratios (ARs) It is convenient to have a single summary measure that ranks the predictive accuracy of each risk measure for both Type I and Type II errors. We obtain such a measure by comparing the CAP of any risk measure with both the ideal and random CAPs. The closer the CAP is to its ideal, the more area there is between it and the random CAP. The largest amount of area that can possibly be enclosed is identified by the ideal CAP. The ratio of the area between a model s CAP and the random CAP to the area between the ideal CAP and the random CAP is the Accuracy Ratio (AR). Differences in the proportion of defaulters/non-defaulters in the data sets used to test each model affects the relative performance of each model. Thus, the AR measures are directly comparable for any and all models as long as they are applied to the same data set. Here we derive an accuracy ratio that provides the same performance measure for Type I and Type II errors. The definition of AR is based on the sample frequencies for defaults/non-defaults. Technically, the AR value is defined as 1 1 1 1 AR = = 2 y( x) dx 1 1 2 z( x) dx 1 (1) f 0 f 0 Here y(x) and z(x) are the Type I and Type II CAP curves for a population x of ordered risk scores, and f = D/(N+D) is the fraction of defaults, where D is the total number of defaulting obligors and N is the total number of non-defaulting obligors. A geometrical interpretation of equation (1) can be obtained by examining Exhibit 1 in detail and noticing that the vertical dashed line is located at x = f. 7

The AR measures the proportion of defaulters in a sample that can be identified per increment of the risk score that is being evaluated. It is a fraction between 0 and 1. Risk measures with ARs close to 0 display little advantage over a random assignment of risk scores while those with ARs near 1 display almost perfect foresight. Most of the models we tested had AR s in the range of 50% to 75% for the selected sample of public firms. In order to reduce the sensitivity of the AR to outliers and the rare-event nature of defaults (small samples) we perform sensitivity tests using random resampling. 10 Exhibit 3 shows AR values for the tested models. Exhibit 3. Selected Accuracy Ratios Model AR ROA only 0.53 Z-Score Model 0.56 Hazard Model 0.59 Merton Model Variant 0.67 Nonlinear Model 0.73 4. Entropy Based Performance Measures Information Entropy (IE) Information Entropy (IE) is a summary measure of the uncertainty that a probability distribution represents. This concept has its origin 11 in the fields of Statistical Mechanics and Communication Theory. Intuitively, the information entropy measures the overall amount of uncertainty represented by a probability distribution. We define information entropy as follows. Assume the existence of an event with only two possible outcomes: (A) issuer defaults with probability p, and (B) issuer does not default with probability 1-p. The amount of additional information an investor requires to determine which outcome actually occurred is defined as where log 2 (p) is the logarithm of p in base 2. Information = - log 2 (p) (2) If only the first outcome is possible, then p = 1 and the information required is -log 2 (p) = 0. In this case, there is no uncertainty about the outcome and, therefore, there is no relevant information that was not previously known. If the two events are equally likely for the investor (uninformative case), then p = 1/2 and the amount of information required reaches a maximum value of -log 2 (p) = 1 (bit). Exactly 1 bit of information (the equivalent to a yes-no type of answer) is the information required by the investor to know which of the two equally likely possibilities have occurred. 10 See Herrity, Keenan, Sobehart, Carty and Falkenstein (1999). 11 See Shannon and Weaver (1949), and Jaynes (1957), and Pierce (1970). 8

The use of 2 as the logarithmic base has certain advantages for this example but any base can be used. Usually, the natural logarithms are used for convenience. Note however, that the amount of information depends upon what logarithmic base is used which determines the unit of measure of information. The information entropy of the event is defined as H = p log( p) + (1 p) log(1 ) (3) 0 p Exhibit 4 shows the information entropy as a function of p, and reaches its maximum when the probability is p = 1/2. This is a state of absolute ignorance because both possibilities are equally likely for the investor. If the assigned probability of an event is lower than 1/2, one outcome is more likely to occur than the other. That is, the investor has less uncertainty on the possible outcomes. The reduction in the uncertainty of the outcomes is reflected in the reduction of entropy. Exhibit 4. Information Entropy as a Function of p 1.0 0.8 0.6 H, bits 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 p Consider again the two mutually exclusive outcomes of an event: (A) issuer defaults, and (B) issuer does not default; one of which must be true. Given a set of risk scores S = {R 1,.. R n } produced by a model, the conditional entropy which measures the information about the propositions A and B for a specific risk score R j is ( P( A R )log P( A R ) P( B R )log P( B R )) h ( R j ) = j j + j j (4) where P(A R j ) is the probability that the issuer defaults given that the risk score is R j and P(B R j )= 1-P(A R j ). This value quantifies the average information gained from observing which of the two events A and B actually occurred. The average over all possible risk scores is the conditional information entropy 9

H n 1 ( s, ) = H1( R1,..., Rn, δ ) = h( Rk ) P( Rk ) k = 1 δ (5) For models with continuous outputs, the most straightforward way to estimate the quantities defined in equations (4) and (5) is to use a bin counting approach. The range of the model output is divided into a number of bins of size δ - related to the accuracy of the output. Because equation (4) requires estimating the conditional distributions of defaults and non-defaults, the bins of size δ have to be bigger than the precision of some of the model outputs to provide a meaningful statistics. For illustration, we use 12 δ = 5% of the model output range for each model. Thus, the IE defines an absolute measure of the amount of uncertainty contained in the models as long as all the models outputs describe the same data set. The properties that make the information entropy so appealing are: a) if the risk score set S contains more information about the outcomes A and B than another set S, then H(S) < H(S ). b) acquisition of new information can never increase the value of H. Conditional Information Entropy Ratio (CIER) In the same way we reduced the CAP to a single AR statistic in order to have a measure which lends itself to comparison across models, we can use IE to produce another summary statistic for how well a given model can predict defaults. This is done via the Conditional Information Entropy Ratio (CIER). The CIER compares the amount of uncertainty there is about default in the case where we have no model (a state of more uncertainty about the possible outcomes) to the amount of uncertainty left over after we have introduced a model (presumably, a state of less ignorance). To calculate the CIER, we first calculate the IE H 0 (p), where p is the default rate of the sample. That is, without attempting to control for any knowledge that we might have about credit quality, we measure the uncertainty associated with the event of default. This entropy reflects knowledge common to all models that is, the likelihood of the event given by the probability of default. We then calculate the IE H 1 (S 1,δ) after having taken into account the predictive power of the model. The CIER is one minus the ratio of the latter to the former, that is, CIER( S 1, δ ) H H ( S, δ ) 0 1 1 = (6) H 0 If the model held no predictive power, the CIER would be 0. In this case the model provides no additional information on the likelihood of the outcomes that is not already 12 This resolution can allow an easy comparison with agency ratings whose precision is 1/21 5%. 10

known. If it were perfectly predictive, the information entropy ratio would be 1. In this case, there would be no uncertainty about the outcomes and, therefore, perfect default prediction. Because CIER measures the relative reduction of uncertainty when the model is introduced, a higher CIER indicates a better model. Exhibit 5 shows the results for the tested models. Using the resampling technique described above, the typical deviation results 0.02. Exhibit 5. Selected Information Entropy Ratios Model CIER ROA only 0.06 Z-Score Model 0.09 Hazard Model 0.11 Merton Model Variant 0.14 Nonlinear Model 0.19 Mutual Information Entropy The information-based measures introduced above are not the only tools available for characterizing credit risk models. Many information-based statistics can be expressed in terms of the information entropy. To quantify the dependence between two models 1 and 2, we use a modified version of the mutual information entropy 13 (also called information redundancy). Let S 1 = {r 1,.r n }, and S 2 = {R 1,.R m } be the risk scores associated with models 1 and 2 for a given set of obligors. The mutual information entropy is defined as; 1 MIE( S1, S 2, δ ) = ( H1( S1, δ ) + H1( S 2, δ ) H 2 ( S1, S 2, δ )) (7) H Here H 0 is the entropy of the sample, and H 0 ( P( A r, R )log P( A r, R ) + P( B r, R )log P( B r, R ) n m 2 = P rj, Rk ) j k j k j k j k ) j= 1 k = 1 ( (8) The conditional entropy H 2 is also implemented with a bin counting approach. A partition size δ is chosen and, then, the outputs of the models are discretized into integers j = 1,..n, k = 1,..m depending on what bin of size δ they fall into. The mutual information entropy is a measure of how much uncertainty on default events is introduced by model 2 given the output of model 1 with accuracy δ. The last two terms in equation (6) represent the marginal contribution to the overall uncertainty introduced by model 2. If model 2 is completely dependent on model 1 then MIE(S 1,S 2,δ) = 1-CIER(S 1 ), that is, the uncertainty introduced by the two models reduces to the uncertainty 13 For the standard definition of mutual entropy see Prichard and Theiler (1995). 11

of one model only. Because the MIE is calculated with the joint conditional distribution of the risk scores S 1 and S 2, this measure requires a large number of defaults to be accurate. Exhibit 6 shows the results for selected pairs of the tested models. In contrast to the CIER, a higher MIE value reveals an increase of the overall uncertainty. Note that the diagonal elements of Exhibit 6 are related to the values of Exhibit 5 through the equality: MIE(S 1,S 1,δ) = 1-CIER(S 1 ). Model Exhibit 6. Selected Mutual Entropy Ratios MIE ROA Z-Score Model Hazard Model Merton Model Variant Nonlinear Model ROA only 0.94 0.98 0.97 0.97 0.97 Z-Score Model 0.91 0.96 0.96 0.95 Hazard Model 0.89 0.95 0.93 Merton Model Variant 0.86 0.87 Nonlinear Model 0.81 5. Model Precision A key issue in model comparison is to determine whether a higher degree of refinement in the scale of a given model s output reflects greater precision and hence a more powerful model, or whether small increments in estimated risk do not add statistically significant value to the assessment of credit risk. That is, whether model outputs can be aggregated in coarse grades with no significant loss of information. Importantly, this limitation does not apply only to agency ratings or other discrete score outputs. Due to data limitations and statistical significance most models will exhibit a granularity of their outputs. For example, EDFs are provided with granularity of 1/1,000 in steps of 2 basis points, although the true precision and statistical significance are unknown. 14 If the resolution for very low EDFs (high quality credit) is statistically significant, it could indicate that at least a few defaults occurred for what the model considers high quality obligors. In that case, the model is not discriminating these defaulters from the true population of high quality obligors. In contrast, if there are no defaulters among the population of high quality obligors, the EDF value is only determined by the statistical method employed to create the distribution of low EDFs (for example, kernel estimation, spectral methods, or simple histograms). In this situation, the precision of the model for the high credit quality tail might not be supported by the default data directly but could be simply an artifact of the algorithm used to process the data. 14 J.A. McQuown (1993). 12

These two situations would be reflected on the performance measures described above, such as CAP curves or AR. That is, the model precision can be defined in terms of its impact on a performance measure. The analysis is done by quantifying the average information gained with each refinement of the model output. That is, by generating an ensemble of surrogate data sets of model outputs, each of which reproduces the basic properties of the original set for a specific finite precision. For example, normalizing 15 the model output range to the [0,1] interval and, then, rounding the model outputs to three digits, then to two digits and so on. Rounding to two digits provides a precision of 1 in 100 (or 1:100). The minimum finite precision that produces a statistically significant difference in the performance of the model determines the precision of the model output with respect to the selected performance measure. Exhibit 7 shows the estimated lower and upper precision bounds for the selected benchmark models using AR on our test sample. The precision of the tested models is in the range 2% to 10%, which agree reasonably well with the precision of most institutions internal scales and agency ratings. The lower precision bound indicates a performance reduction of at least one deviation of the AR value. Refinement above the upper bound shows no difference in the value of the performance measure. Exhibit 7. Model precision using AR Model Lower Upper ROA 1:10 1:50 Z-Score Model 1:15 1:50 Hazard Model 1:15 1:50 Merton Model Variant 1:20 1:50 Nonlinear Model 1:20 1:50 6. Conclusions We discussed an approach for validating credit risk models based on alternative model performance measures. These measures are robust and easy to implement and can be added to the standard tools used to validate models. In particular, we introduced two types of performance measures: (a) accounting accuracy metrics, which measure the cumulative accuracy to predict defaults, and (b) information content metrics, which measure the level of uncertainty in the risk scores produce by the tested models. Both types of measures can be used to evaluate model performance over the entire range of credit quality, or can be reduced to a single summary statistic that can be used to rank order competing models. When models appear to be performing equally well, it is important to know whether they are both producing the same information, or different information of equal value. In the former case, either model will do, while in the latter case using both models simultaneously may increase predictive power even if one model is outperforming the other. Mutual Information Entropy and the Conditional Information Entropy Ratio provide 15 Models whose outputs increase exponentially (such as probabilities of default) need to be transformed to a linear scale using a logarithmic transformation. 13

measures that can distinguish between cases where different models are contributing additional information, or are redundant. The techniques described in this report are both powerful and flexible under the appropriate conditions. Importantly, all of these techniques produce measures that are specific to the data set on which they are based. Thus, inter-model comparisons should always be based on identical or nearly identical samples, and samples representative of the general population of obligors. When large and broadly representative testing data is available, these techniques can help determine which model is likely to have the best out-ofsample predictive power. 14

Bibliography Caouette, J. B., Altman, E. I., Narayanan, P., Managing Credit Risk, (John Wiley and Sons, NY, 1998). Herrity, J., Keenan, S.C., Sobehart, J.R., Carty, L.V., Falkenstein, E.G., Measuring Private Firm Default Risk, Moody s Special Comment, (June 1999). Jaynes, E.T., Information Theory and Statistical Mechanics, Physical Review 106 (4) (1957), 620-630. Kealhofer, S., Kwok, S. and Weng, W., Uses and Abuses of Bond Default Rates, CreditMetrics Monitor (First Quarter, 1998), 37-55. Keenan, S.C., Shtogrin, I., and Sobehart, J.R., Historical Default Rates of Corporate Bond Issuers, 1920-1998, Moody s Special Comment (January, 1999). McQuown, J.A., A Comment on Market vs. Accounting Based Measures of Default Risk, KMV Corporation (1993). Merton, R. C., On the Pricing of Corporate Debt: The Risk Structure of Interest Rates, Journal of Finance 29 (1974), 449-470. Pierce, J.R., Symbols, Signals and Noise: The Nature and Process of Communication (Harper & Brothers, NY, 1970). Prichard D., Theiler, J., Generalized Redundancies for Time Series Analysis, Physica D 84 (1995), 476-493. Shannon C., and Weaver, W., The Mathematical Theory of Communication (University of Illinois Press, Urbana, 1949). Shumway, T., Forecasting Bankruptcy More Accurately: A Simple Hazard Model, University of Michigan Business School working paper (1998). Sobehart, J.R., Stein R., Mikitkyanskaya, V., Li L., Moody s Public Firm Risk Model: A Hybrid Approach to Modeling Default Risk Moody s Investors Service Special Comment (March, 2000). 15