1 Recovery Rate Modelling of Non-performing Consumer Credit Using Data Mining Algorithms Markus Hoechstoetter, Abdolreza Nazemi, Svetlozar T. Rachev and Caslav Bozic RMI Working Paper No. 12/09 Submitted: September, 2012 Abstract There have been more studies on recovery rate modeling of bonds than of personal loans and retail credit. As far as the authors are aware, there exists no research of recovery rate modeling in retail credit for third-party buyers. The goal of this paper is to fill this gap. In our study, over nine million defaulted or non-performing consumer credit data provided by a German debt collection company are used. According to the findings, the optimum times of the collection processes are not the same for all industries. Moreover, from a variety of characteristics, those debtor characteristics that are most significant in predicting the recovery have been determined. To select the best prediction and classification model, a variety of statistical and data mining methods such as logistic regression, neural network, K-nearest neighbor, CHAID, CART, Support Vector Machine and regression will be examined. A two-stage model which first classifies debts to extreme and non-extreme recovery rate is applied; then, the extreme debts are classified into full payment and non-payment. Moreover, the non-extreme recovery rates are predicted. Keywords: Recovery rate, Data Mining, Third-party Company Markus Hoechstoetter School of Business Engineering Karlsruhe Institute of Technology, Germany Svetlozar T. Rachev College of Business Stony Brook University, New York, USA Abdolreza Nazemi National University of Singapore Risk Management Institute Caslav Bozic School of Business Engineering Karlsruhe Institute of Technology, Germany Abdolreza Nazemi. Views expressed herein are those of the author and do not necessarily reflect the views of NUS Risk Management Institute (RMI).
2 Recovery Rate Modelling of Non-performing Consumer Credit Using Data Mining Algorithms Markus Hoechstoetter a,1, Abdolreza Nazemi b,2, Svetlozar T. Rachev c,d,3, Caslav Bozic a,4 a School of Business Engineering, Karlsruhe Institute of Technology, Germany b Risk Management Institute, National University of Singapore, Singapore c College of Business at Stony Brook University, New York, USA d FinAnalytica, New York, USA Abstract There have been more studies on recovery rate modeling of bonds than of personal loans and retail credit. As far as the authors are aware, there exists no research of recovery rate modeling in retail credit for third-party buyers. The goal of this paper is to ll this gap. In our study, over nine million defaulted or non-performing consumer credit data provided by a German debt collection company are used. According to the ndings, the optimum times of the collection processes are not the same for all industries. Moreover, from a variety of characteristics, those debtor characteristics that are most signicant in predicting the recovery have been determined. To select the best prediction and classication model, a variety of statistical and data mining methods such as logistic regression, neural network, K-nearest neighbor, CHAID, CART, Support Vector Machine and regression will be examined. A two-stage model which rst classies debts to extreme and non-extreme recovery rate is applied; then, the extreme debts are classied into full payment and non-payment. Moreover, the non-extreme recovery rates are predicted. Keywords: Recovery rate, Data Mining, Third-party Company addresses: (Markus Hoechstoetter), (Abdolreza Nazemi ), (Svetlozar T. Rachev), (Caslav Bozic) 1 Markus Hoechstoetter is a post doctoral researcher at the Chair of Statistics, Econometrics and Mathematical Finance at the School of Economics and Business Engineering, Karlsruhe Institute of Technology, Germany. 2 Abdolreza Nazemi is a post doctoral researcher at the Risk Management Institute, National University of Singapore, Singapore. 3 Svetlozar Rachev is Professor at College of Business, Stony Brook University in New York and Chief Scientist of FinAnalytica. 4 Caslav Bozic is a doctoral student at the AIFB Institute of the Faculty of Economics and Business Engineering, Karlsruhe Institute of Technology, Germany and a research and develpment engineer at Avedas AG, Karlsruhe, Germany.
3 2 1. Introduction The latest, or should we say current, crisis is still present in various aspects of life. The damage to the corporate world can be looked up in Standard & Poor's (2011), for example. However, individuals have also suered from the crisis as many employees were laid o and, to make things worse, banks began to suppress lending on a large scale which resulted in the credit crunch. This is even more worrying since borrowing or, more generally, being in debt has become the omnipresent situation worldwide. For example, in the USA, corporate and consumer debt has reached dizzying levels. According to the Board of Governors of the Federal Reserve System (2011), the accumulated debt amounts to over 10 trillion dollars. This trend, however, is not unique to the USA. In Europe, the trend is similar, as pointed out by Thomas (2009); and, if this trend is not reversed, the need for the expansion of lending will remain a pressing issue. It is obvious that there has to be a sophisticated system that will enable credit to be extended and guarantee that the impact of defaulted debt will be cushioned. Hence, there is a need for third-party buyers to relieve originators from the stressful collection business. In the context of lending and the inherent risk, the terminology and denitions given in the Basel II Accords are widely used. Mainly, the terms 'expected loss', the loss-given-default conditional on default as well as recovery rate are commonly used in the context of nancial debt. Before the regulation in the Basel Accords was constituted, lenders in the retail sector designed a now widelyused tool, the credit scorecard. This device helps to assess the probability of a new customer defaulting. The actual score is usually a three-digit number. It is computed from the set of consumer characteristics. The scorecard helps to distinguish between potentially good and bad borrowers, i.e., those who default and those who do not. As references, we recommend Hand (2001) and Thomas et al. (2002). However, this tool has not proved reliable in predicting complete loss or recovery once the debtor has defaulted on his/her obligations. Instead, logistic regression has been applied more often in the analyses of recovery rates. A problem arising from the denitions in the Basel accords, however, is the sometimes not uniquely specied parameters, leaving room for interpretation and diverging implementations. For example, the recovery rate as the percentage of the original amount of debt outstanding that is repaid to or repossessed by the lender can be stated either as the price the distressed debt would achieve if sold immediately after default or as the sum of discounted future payments received from the debtor. The main interest of this paper is the prediction of the recovery rate or, conversely, the lossgiven-default (LGD). Practitioners as well as researchers agree that the recovery rate has to be modelled by some quantity that is restricted to the interval between zero and one. But there are no standardized models for the modelling and prediction of the recovery rate. This is common to basically all industries. So, some models include macro-economic variables, which is in line with the Basel Accords, while others are restricted to information on the debt. Most of all, the
4 3 methodological side suers from a lack of freely accessible data, thus hampering the production of reliable results of general value. Moreover, since the collection process can be carried out by the original lender, i.e., in-house, as well as by some third party acting either as agent or as purchaser of the debt, the models have to consider the dierence in accessibility of relevant information for the two options. In the model, we will use only debtor-related variables without macro-economic factors. Our ndings will reveal that age of debtor and the amount of debt outstanding are important determinants for the recovery rate. We furthermore see as one of our main contributions the presentation of results on recovery rates from various non-banking industries on a scale not presented before in the literature to our knowledge. The remainder of the paper is organized as follows. The following section presents the features of the dierent options of the lender with respect to collection and ownership of the debt. In section 3, statistical and data mining methods will be reviewed in brief. Then, we present our data in section four. The important results of the exploratory data analysis are set out in section ve. We discuss the recovery rate modelling using data mining methods in section 6. This is followed by an explanation of our results in section Recovery and collection process In this section, we will present the dierent collection options that exist for a lender, and their implications for LGD, an excellent account of which is given in Thomas et al. (2011). Generally, collecting debt and recovering it after default consists of a sequence of measures such as written reminders, more specic letters, telephone calls and, ultimately, legal action such as court orders and sending marshals. The dierent measures commonly result in various actions by the debtor. Often, as a result of a telephone call, creditor and debtor negotiate a repayment plan. It is believed by most collection entities that a telephone call is preferable to an appearance in person because the conversation is still personal while the debtor is not confronted with a physical intrusion. Usually, the collection process becomes more robust as default continues; in the public perception this is commonly attributed to a debt collector's alleged rude demeanour and it may even discredit the entire lending industry. Proof of this public disapproval of the loan and collection business has been appearing in abundance in the press since the culmination of the latest crisis. Unless hindered by the legal framework, a lender might consider any of three basic collection options during the life of the lending relationship, the rst of which is the internal (or in-house) collection and recovery process. Lenders usually resort to this, at least in the beginning. Secondly, a lender may decide to remain owner of the debt but assign collection to an agent. The third option is to sell the debt to a third party and, hence, end the relationship with the borrower. The measures taken may be the result of discretionary considerations relating to certain customers only, or part of a standardized provision formulating automated procedures. For as long as repayment is
5 4 on schedule, the lender will retain the collection in his business. Also, if the lender determines that he is better able than an outside agent to recover the debt after default based on his experience and also because the relationship with the defaulting borrower is valuable to him, he will opt to collect it himself. When the collection process is too involving for the creditor or might harm his reputation because the language to be applied at a certain stage of collection may have to become more direct, the lender may choose to hand the collection process to an outside agent while retaining ownership of the debt. If default seems more likely or the ownership of the distressed debt causes additional aggravation for the lender, he will probably sell the debt at a discount to a third-party buyer. The price will have to be lower than the recovery value expected by the buyer. An advantage of internal collection may be that all characteristics concerning the debt are known whereas a thirdparty buyer is lacking important information such as loan details, borrower repayment behaviour or change in score, which is a privilege of the original lender, according to Fama (1985). The third-party buyer does receive essential information on exactly when default occurred, the exact amount outstanding, and when the last payment was made. The third-party buyer like the outside agent, however, may suer from a negative selection because he will most likely only have access to poorly performing debt, as argued in Ramakrishnan and Thakor (1984). According to Thomas et al. (2011), it is not surprising that only 7% repaid the whole debt, 16.3% paid a fraction, and the vast majority, i.e. 83%, did not pay anything when the collection was carried out by a third party. This is in sharp contrast to the outcome when collection was undertaken by the in-house collection department: 30% repaid the total amount, while 60% and 10% repaid only a share or nothing, respectively. In our results section, we will provide very detailed information on recovery rates faced by a third-party buyer of non-performing consumer debt. 3. Statistical and Data Mining Models 3.1. Regression 3.2. Logistic Regression Logistic regression is used for quantitative variables, particularly when the response variable is a categorical variable. Let us dene a binary random variable as 1 if default occur Y = 0 if default does not occur with π = P r(y = 1) and 1 π = P r(y = 0) which is the famous example of occurrence of default. The multiple logistic regression is as follows: π i = exp(x β) 1 + exp(x β) = exp(β 0 + β 1 X β n X n ) 1 + exp(β 0 + β 1 X β n X n ) (1)
6 3.3 Neural Network 5 where β is vector of coecients and X is matrix of observations Neural Network The design of the neural network is especially appealing because of its several layers of perceptrons. Common to any design are an input layer, one or more hidden layers of neurons, and an output layer. In the simplest version with just one hidden layer, input data consisting of observations x j of j = 1, 2,..., d variables enters neuron i of the hidden layer to be transformed there into a weighted functional output d h i = f (1) (b i + w i,j x j ) with weights w i,j and neuron-specic constant b i. Output from all n h hidden neurons is then turned into network output n h y = f (2) (b (2) + v i h i ) with neuron weights v i. The neural network allows for a exible yet sometimes unintuitive design. This technique is particularly apt in separating samples with respect to objective functions such as, for example, zero or full recovery K-nearest neighbor KNN is a popular method in data mining and is used for classication based on closest observations in the variable space. The K-nearest neighbor was introduced in the early 1950s. KNN can also be applied for prediction. 5 Suppose that the learning sample has n attributes, which means each learning sample is shown as a point in n dimensional space. If we want to classify a new sample using K-nearest neighbor, then KNN is looking for K-nearest learning samples that are nearest to the new sample. After that, K-nearest neighbor detects a class of the new sample as the class majority of these K-nearest neighbors. When k = 1, the class of the new sample is the same as the class of the nearest learning sample to this point. The Euclidean distance or Mahalanobis distance is applied as a distance metric. The Euclidean distance between two points or tuples X 1 = (x 11, x 12,..., x 1n ) and X 2 = (x 21, x 22,..., x 2n ) is dened as D E (X 1, X 2 ) = n (x 1i x 2i ) 2 j=1 i=1 i=1 5 see, Han and Kamber, (2006).
7 3.5 Trees 6 For calculation, we usually use the normalized variables. Suppose that, for X = (x 1,..., x n ) T, the covariance matrix is equal to Σ and µ = (µ 1,..., µ n ) T is the mean vector; then the Mahalanobis distance is dened as D M (X) = (x µ) T Σ 1 (x µ) The Mahalanobis distance is based on correlations between variables by which dierent patterns can be identied and analyzed. It is a useful way of determining similarity of variables. The Euclidean distance applies in many classication problems. In KNN, when the value of a variable is missing, KNN uses the maximum dierence between two samples; this means that, if both values of a normalized variable in sample X 1 and X 2 are missing, the dierence is assumed to be 1. On the other hand, if only one of them is a missing value and the other one has a value b, the dierence is considered to be 1 b or b. For a categorical variable, KNN assumes it to be 1. For the K-nearest neighbor algorithm, we need to determine the number of nearest neighbors and distance measure. The best number of k depends only on data and we can nd it by comparing error rates for dierent k values. In general, the larger the values of k, the more stable the model of classication. However, the larger values of k mean that the learning data samples now being included are not very close to the new sample. If k is equal to 1, then the class of new sample is predicted as the class of the closest learning sample; this is called the nearest neighbor algorithm and it makes a rather unstable classier Trees Breiman and Friedman introduced recursive partitioning algorithms in Decision trees are usually classied into two groups: classication trees and regression trees. In classication trees, the target variable is categorical or qualitative but we can also use classication trees when the dependent variable is continuous. Thomas et al. (2002) mentioned that classication trees were applied in credit scoring by Makowski in The basic idea of tree construction is to nd subsets with maximum homogeneity or cases that are located in a subset belonging only to one class of target variable. At each step of splitting, tree algorithms split cases with independent variables that have maximum homogeneity. We dene impurity of a node as a function of the probability of a dierent class in the node under consideration: i(t) = φ(p 1, p 2,..., p J ) 6 see, Giudici and Figini, (2009), Hand et al. (2001), Han and Kamber, (2006). 7 see, Thomas et al., (2002), Giudici and Figini, (2009).
8 3.6 Support Vector Machine 7 where the p j is a probability of cases belonging to class j. There are dierent kinds of impurity function with these characteristics. One of the principal dierences in tree algorithms is related to impurity function. Breiman et al. (1984), Deville (2006), Giudici (2003) and Thomas et al. (2002) pointed out certain impurity functions, for example, Gini i(t) = j p(j t)(1 p(j t)), Entropy i(t) = (p(0 l) p(0 r))2 j p(j t)log(p(j t)) or Maximize half-sum of squares Chi = n(l)n(r) where n(l)+n(r) n(r) and n(l) are the number of observations in the right and left nodes. The large value of χ 2 statistic Chi means that the two proportions are not the same. The reduction of impurity that the split obtained was dened as quality of a split as: i = i(v) [π(l)i(l) + π(r)i(r)] where π(l) and π(r) are the observed proportions of observations in classication. In fact, tree algorithms select the variable that has best quality of a split. Finally, tree algorithms label leaf nodes based on the majority of target variables. In regression trees, tree tted ŷ i that is equal to mean of dependent variable for observations in considering leaf node. Classication and regression trees (CART) are the most usual tree algorithms (Breiman et al., 1984). In CART, the target variable could be categorical and continuous. The impurity function of CART is assumed to be Gini or entropy. Chi-square Automatic Detection (CHAID) was developed by Kass in Furthermore, the impurity is assumed to be chi-square Support Vector Machine The SVMs are used to separate debtors into two categories (y = 1 or y = 1) based on some hyperplane threshold with perpendicular vector w maximizing the minimal distance of each of the two groups from the threshold. With the optimal hyperplane, the training data keep a minimum distance of b from the hyperplane to guarantee generality of the model. The optimization problem using all n observations (y i, x i ), x i ɛr d is thus given by or in the dual form min w,b w 2 2, s.t. y i(< w, x i > +b) 1, i = 1, 2,..., n (2) min a n a i 1 2 i=1 n a i a j y i y j < x i, x j >, i=1,j s.t. n a i y i = 0 (3) where <, > denotes the inner product. The separating rule is then given by f(x) = sign(< w, x > +b) or, equivalently, f(x) = sign( n i=1 a iy i < x i, x > +b). A problem occurs if the data are not linearly separable as required. To this end, the original data vector xɛr d is mapped into a higher dimensional (K > d) feature space with a non-linear function φ : R d R k, x φ(x). To circumvent the calculations of the inner products and associated dot products in the higher dimension, the so called kernel-trick is applied, requiring only computation i=1
9 3.7 Dmneural 8 of kernel functions k(x i, x j ) =< φ(x i ), φ(x j ) > for the dot products. Thus, the transformation into the higher dimension space can be actually avoided. The resulting separating function is now f(x) = sign( n i=1 a iy i k(x i, x) + b). Common kernel functions are, for example, polynomial k(x i, x) =< x i, x > or radial basis k(x i, x) = exp( x i x 2 /c). The authors state that the advantages are given by the use of key observations only for the sake of speed, the translation of the discrimination problem into a quadratic problem, and the projection of the original problem onto a higher dimensional space to apply a linear discrimination function. They begin the modelling with a stepwise selection process of the most powerful variables to separate the data set into homogenous subsets. LSSVM is a version of SVM to conduct a linear regression of the form y = φ(x) i b + ε with the original data x mapped into a higher feature space by φ to obtain a higher degree of linearity. Using a kernel K(x, x i ) = φ(x) T φ(x i ) simplies the optimization in the preferred dual form y = n i=1 a iφ(x) T φ(x i ) + e Dmneural As far as we are aware, dmneural network training is not a very popular method in data mining but we will compare the performance of this model to other models. In the learning dataset, the dmneural species the best principal components of independent variables for maximum variation in the response variable; consequently it chooses the best group of independent variables in prediction or classication of response variable. Dmneural omits the independent variables that have less information for prediction of the response variable. According to principal components' characteristics, they are uncorrelated. An activation function is applied to the linear combination of independent variables and principal components. Matignon (2007) points out the eight dierent activation functions among them Gaussian, logistic, exponential and square. 8 The misclassication rate in the classication problem and the sum of square error in the prediction problem are used in specifying the best activation function in the next phases. The dmneural applies the response variable and the residual in prediction or classication of response variables from the rst step. The dmneural model constructs an additive non-linear model as follows. Matignon (2007) points out the following additive non-linear model as dmneural model: ŷ = nphases i=1 g(f(x, α)) Where g is the link function and the best activation function is f in phase i. 8 see, Matignon, (2007).
10 9 4. Data description 4.1. Data provider Our data consist of close to ten million dierent unsecured debts purchased between 2001 and 2010 by arvato infoscore, one of the largest debt purchasers in Germany. The company combines a collection business (German Inkasso), scoring services, and factoring. Factoring, as known in Germany, is a particular form of third-party nancial service for originator lenders. The most common variations are full-service, selective, notication, semi-factoring, and silent factoring. In case of normal factoring, the debt buyer, i.e., the factor, receives all debt from the originator in an automatically revolving process agreed upon in advance. The factor is owner as well as collector of the debt after its cession from the originator. It is the most common form of factoring in Germany. Selective factoring describes a construct where only selected debt is sold o to the third-party factor. When the third party oers notication factoring, the debtor is informed about the sale of the debt and can only repay to the third-party factor. The default risk, however, remains with the originator which, in the case of default, has to reimburse the factor. In silent factoring, the debtor is ignorant of the sale of the debt and payment is only possible to the original creditor. A negative consequence for the factor is a lack of inuence over the debtor since he is not entitled to collect. And nally, when semi-factoring is chosen between originator and factor, the debtor remains ignorant of the sale of the debt, as well, but payments are to be made exclusively to accounts or addresses that belong to the factor. In the case of arvato infoscore, the company engages in full-service factoring. Although a legally separate entity, in the case of collection and scoring businesses combined in the same company, the third-party buyer has the advantage that the collection department has often developed long lasting relationships with debtors during the period of initial ownership by the respective originators. These relationships yield precious information the third-party buyer would not have access under any other circumstances. However, legally, this is sometimes limited to the information that would be oered to any third-party buyer. In the following, we consider data only accessible to regular outside buyers Data The data consist of roughly ten million defaulted or non-performing unsecured receivables from nine dierent categories that are customers of the third-party buyer. On the one hand, these categories represent the following industries: telecommunications, online shopping and mail ordering, nancial services including credit cards, and the utility and energy sector. Moreover, receivables from the non-prot-entities of the public sector (community services) and public transport are also part of these categories as are failed return debit notes as well as anything that does not t into any of the prior categories; these are subsumed in the miscellaneous category including,
11 4.2 Data 10 for example, unpaid parking tickets. In the following, we will use these abbreviations to indicate the respective industries: Mail order (MO), business-to-business (B2B), energy and utilities (NRGY), nancial services (FS), miscellaneous (MI), public sector (PS), return debit note (RDN), telecommunications (TC), and public transport (PT). Each debtor is assigned a unique identication number. For each receivable a unique identi- cation number is issued, and all payments on the account of a particular receivable have to be labelled with the respective identication number. The relationship between receivable and borrower is not unique since a borrower might have defaulted on more than one receivable in arrears, whereas a receivable in arrears can only belong to one borrowing entity. A payment is characterized by the identication number of the receivable and, thus, can be traced to the corresponding debtor. Furthermore, we selected from all given payment characteristics those that could most easily be transformed into a numerical variable or a categorical variable of low dimension. The resulting variables relating to the debtor include age, gender, residential status and address, as well as current credit history. The variables related to the accounts receivables include age of debt, date of purchase by third party, amount outstanding and last payment date, while the original receivable amount is usually unknown. This yields about 15 variables that can be used for the subsequent analysis. For example, information on the quality of the location of the residence, which can be obtained by transforming the postal code into a rating, has not yet been considered. Henceforth, we will use the terms 'category' and 'industry' interchangeably. In Table 1, we have presented the most important statistics of the data sorted by industry. It also contains some initial results that we will discuss further in the last section. As we can see from adding the values of `# debts (original)', the total number of receivables is 9,793,590. Because our computational capacity was limited at the time we received the data, we decided to use only 100,000 randomly selected receivables from the mail order industry and only 500,000 randomly selected receivables from the public transport category. We will use the complete data-set for other categories. We assume that the means of debts in the complete data-sets of mail order and public transport categories are the same as in their samples; thus, the amount of debt outstanding is 1,248,266, Euros. The recovery rate in all categories is not the same and the range of mean recovery rate is between and The mean recovery rate of public sector is the lowest and the mail order category has the highest mean of recovery rate. The nancial services category has the highest mean of debt at and one of the lowest mean of recovery rate at In Table 1, if we subtract mean debt age from mean debt age in third-party we have the duration between default occurrence and sell-o to third-party company. The categories with shorter periods between default and sello to third-party company have seen more recovery than the categories that have longer period between default and sell-o to third-party. For example, these periods in nancial services and public sector are around 43 and 23 months, although for mail order and telecommunications, they are around ve months. The payments that did not convey reliable information on all the
12 11 used variables were discarded from further analysis. The missing information is indicated by the respective superscripts of the industry. This was age of debtor, age of receivable, or the identity of the debtor. Moreover, we cleaned with respect to `Earliest entry' and `Last entry' per industry, since there were unreasonable values, most likely the result of laxity during data entry. Eliminating these outliers resulted in the new values as presented here. At this point, it becomes obvious how important the quality of the data is for the third-party buyer since he generally has no means of validating and, if necessary, correcting them. However, the third-party buyer has to cope with many aws in the data. TABLE 1 5. Exploratory data analysis We use the complete data sets except mail order and public transport industries. Because our computational capacity was limited at the time we received the data, we decided to use only 100,000 observations randomly selected from mail order and 500,000 observations from the public transport section. The mail order section has the highest recovery rate that is and the public sector has the lowest recovery rate in the portfolio that is Table 2 shows the quantile of debt amount, recovery rate, time until full payment by debtor who paid fully and time of last payment by debtor who paid at least something. The very important result from this table is that the optimum time of collection process depends on the industry. As an illustration, in nancial services 99 percent of fully-paying debtors paid fully before 48 months and 90% of them paid fully before 28 months; meanwhile, 99% of non-fully-paying debtors did not pay more after 39 months and 90% of them did not pay more after 26 months. In contrast, 99% of fully-paying debtors paid fully before 11 months and 90% of them paid fully before 4 months in the miscellaneous category. Additionally, 90% of non-fully-paying debtors did not pay more after 6 months and 99% of them did not pay more after 12 months in the miscellaneous category. In other words, the time for calculation of nal recovery rate and the reasonable collection process time are dierent in dierent categories. We also analyze the recovery rate distributions for the horizons of 12, 24 and 36 months. 9 Our conclusion is that, for all nine industries, the variation in the respective distributions is minimal with mass slightly shifting from RR = 0 to RR = 1 since more debtors pay-o debts as time progresses. After one year, representing most of the receivables, the frequency of RR = 0 is slightly over 60% while, after three years, this frequency decreases minimally to just below 60%. So, either the rst year successfully predicted the recovery rate or three years are not long enough as a horizon, since we have censored data as payments are observed on debt that is much older 9 For each horizon, only receivables with an age greater than or equal to the horizon are included.
13 12 than the period considered by our scope. So far, our ndings appear to contrast with those of the bank loan data. TABLE 2 6. Recovery rates modelling The above mentioned literature, mostly on bank loan data, reports fairly high recovery rates of 60% or and more, on average. This may be due to two factors. First, the collection may have been retained by the banks; second, banks tend to have an advantage since they have insight into the borrower's nancial situation which lenders from other industries fail to acquire. This is argued by Fama (1985), for example. Since our data are from a non-bank third-party buyer, we expect rather low recovery rates. From Table 1, we see that this is justiable given that recovery rates are below 40% and even below 30% in many cases. In the next analysis, we consider the empirical distribution of the recovery rate across the nine dierent industries. It is apparent that nearly all probability mass is at RR = 0 and RR = 1. Across all nine industries, the majority, by far, of the recoveries are equal to 0. We hypothesize that this is the result of the relatively low average debt amounts (EAD) except for FS. After univariate exploratory analysis, we start to perform data analysis. We are applying a two-stage model which rst classies debts to extreme and non-extreme recovery rate; we then classify the extreme debts to full payment and non-payment. Moreover, the non-extreme recovery rate will be predicted. It is clear that we have classication problems in two steps, as the target variable is binary and the goals are to classify whether a defaulted debt will be extreme or non-extreme and to classify whether an extreme debt will be full payment or non-payment. We also have a prediction problem in the nal step, as the response variable is continuous and the aim is to predict non-extreme recovery rates. As pointed out before, the complete data-sets are applied, apart from the modelling step for mail order and public transport industries. We use only 100,000 randomly selected items from the mail order section and 500,000 observations from the public transport section. We delete outlier samples and impute the missing values in the data cleaning phase. After data cleaning, we divide the considered data-sets into two sets randomly: training or learning data-set and validation dataset. The training data-sets contain 70% of the observations and the validation data-sets contain 30% of the debts in each industry. We used stratied sampling with equal sizes based on target variable for training and validation. The number of observations decreases because data-sets are not balanced based on the response variable. We build each model on the training data-set and then these models are evaluated on the validation data-set to classify the debts. The misclassication rate is one of the usual criteria in classication model comparison.
14 6.1 Classifying debts as extreme and non-extreme 13 Data mining algorithms will be used for classication steps, such as neural network, CART, CHAID, K-nearest neighbor, dmneural, logistic regression and Support Vector Machine. Thereafter, neural network, CART, CHAID, regression and Support Vector Machine will be applied as prediction models in the nal step. In each step, we will compare models using R-squared, ROC curve, misclassication, average square error and sum square errors. However, the basic criteria are misclassication and R-squared Classifying debts as extreme and non-extreme Models building and comparisons We come to the modelling of the recovery rate by means of the well-known logistic regression model, i.e. the recovery rate RR is the non-linear transform of the linear model including real and coded categorical numerical data. The target variables are extreme and non-extreme recovery where 0 indicates non-extreme recovery and 1 shows extreme recovery. It is clear that extreme recovery consists of full payment and non-payment. As an initial step of selecting the individually most signicant debtor-related variables, we perform the logistic regression for each individual variable alone and assess its ability through an R 2 measure. This yielded the following set of seven variables individually most signicant for predicting RR: the debt amount (debt outstanding at sale), debtor age, prop title, debt date until sell-o (time between default and purchase by third party), rating (classifying the creditworthiness into an ordinal rating with seven levels), address (the validity of the debtor's address), and debtor type (either male, female or corporate entity). We choose a logistic regression model so a modelling selection procedure is not applied. This yielded the linear regression model P (Y =1) log P (Y =0) = µ + α.amount + β.debtor age + γ.prop title + δ.debt date till selloff ρ 1.rating 1 + ρ 2.rating 2 + ρ 3.rating 3 + ρ 4.rating 4 + ρ 5.rating ρ 6.rating 6 + ρ 7.rating 7 + φ.addresok + θ 1.debtor type 1 + θ 2.debtor type 2 + ɛ Here, we use the dummy variables rating 1 through rating 6 as well as debtortype 1 and debtortype 2 for the categorical variables. Table 3 shows the maximum likelihood estimates of the logistic regression parameters in classifying debts to extreme and non-extreme corresponding to the nal model and the statistical signicance of the parameters. For the explanatory variables, when the p-value is lower than 0.05, the null hypothesis is rejected. This means these explanatory variables have a statistically signicant inuence on the response variable. Now, we want to interpret the logistic regression model. In our model, when the debtor has extreme recovery rate Y=1; thus we can interpret that, for variables with negative coecients, the probability of extreme recovery rate decreases and, inversely, variables with positive coecients cause an increase in the probability of extreme recovery rate. For example, in nancial services, address ok, time between debt occurrence and sell-o to third-party company, and rating original equal to 5 have statistically signicant positive eects on the probability of extreme recovery
15 6.1 Classifying debts as extreme and non-extreme 14 rate. The probability of extreme recovery rate from debtors with address could be higher than the probability of extreme recovery rate from debtors without address. Debts with a longer time between default occurrence and sell-o to third-party company have a higher probability of extreme recovery rate. On the other hand, debts with 0, 2 and 4 rating original present a higher probability of non-extreme recovery rate than other debts. In summary, for nancial services, the variables that increase the probability of non-extreme recovery rate are: Do not have address Longer period between default and sell-o If we compare the table 5 we can see that the CHAID tree uses exactly the variables that have a statistically signicant inuence in logistic regression. Moreover, the CART tree uses the variables that are signicant in the logistic regression and debtor age. TABLE 3 CART and CHAID algorithms are used as classication methods. The chi-squared and Gini are impurity measures for CHAID and CART. To obtain a parsimonious tree, we use a signicance level of 0.2 in the stopping rule. Table 4 presents the results from the CART classication tree analysis for nancial services. The total number of splitting variables in the CART classication tree for nancial services is 6: address, rating, debt amount, debt date until sell-o, prop title and debtor age. All the dependent variables are used in the CART, except debtor type. The splitting variables in the CHAID classication tree for nancial services are address, rating, debt amount, debt date until sell-o, prop title and debtor age. The CART tree is slightly more complicated than the CHAID tree for nancial services. As mentioned, we classify all the debts based on the majority of debts in each leaf. We can calculate the misclassication rate as a performance measure. In the next section, we will examine the misclassication rate of our models in training and testing data-sets. At the beginning of the application of a neural network, we should specify the structure of the neural network. We used a neural network with one and two hidden layers consisting of between two and ve neurons. For example, we examined neural networks with 2, 3, 4 and 5 neurons for the datasets. In general, neural networks are black boxes and we cannot interpret them. An important step in using the K-nearest-neighbor is the width K specication. This establishes the size of the neighborhood of the independent variables that will be applied to classify the target variable. We checked the misclassication of KNN with dierent K. TABLE 4
16 6.2 Classifying extreme debts to full payment and non-payment 15 As mentioned before, we divide all the datasets into two sets: training or learning data-sets and validation data-sets. The training data-sets contain 70% of the observations and the validation data sets contain 30% of the debts in each industry. We use stratied sampling with equal sizes in target variable, which means that the number of extreme and non-extreme debts in both datasets, training and validation, are the same. We build each model on the training data-set and then evaluate these models on the validation data set to classify the debts. The misclassication rate is one of the usual criteria in model comparison. We should mention that we do not need to use cross-validation because our data sets are large; consequently our results are stable. In Table 5, we show some criteria such as misclassication rate, sum square error in training and validation, and average square error in validation data-sets in dierent industries. The misclassication rate in validation data-set is the principal criterion for model comparison in this study. The CHAID classication tree is the best model for classifying the debts to extreme and non-extreme recovery in public sector, nancial services and mail order. Based on the misclassication rate in the validation data-set, neural network is the best classi- er for business-to-business, miscellaneous and return debit note. On the other hand, the CART classication tree is the best classication model for energy and utilities and public transport. The Support Vector Machine has the highest accuracy rate in the validation data-set for telecommunications. For example, Figure 1 shows the ROC curve for nancial industries. We know that, in a ROC curve, one model dominates another when the curve of one model is completely above the curve of another model. The CHAID classication tree has the highest accuracy rate between our models in dierent industries at 0.805%. TABLE 5 and FIG Classifying extreme debts to full payment and non-payment Models building and comparisons We now come to the modeling of the recovery rate by logistic regression model. The response variables are full payment and non-payment recovery where 0 indicates non-payment recovery and 1 shows full payment. The independent variables are the debt amount, debtor age, prop title, debt date until sell-o, rating, address, and debtor type. This yielded the regression model P (Y =1) log P (Y =0) = µ + α.amount + β.debtor age + γ.prop title + δ.debt date till selloff ρ 1.rating 1 + ρ 2.rating 2 + ρ 3.rating 3 + ρ 4.rating 4 + ρ 5.rating ρ 6.rating 6 + ρ 7.rating 7 + φ.addresok + θ 1.debtor type 1 + θ 2.debtor type 2 + ɛ Here, we use the dummy variables rating 1 through rating 6 as well as debtortype 1 and debtortype 2 for the categorical variables. The maximum likelihood estimates of the logistic regression parameters in classifying extreme debt are shown in table 6.
17 6.2 Classifying extreme debts to full payment and non-payment 16 Now, we want to interpret the logistic regression model. In our model, when the debtor has fully paid Y=1; thus we can interpret that, for variables with negative coecients, the probability of full-payment decreases and, inversely, variables with positive coecients cause an increase in the probability of full-payment. For example, in nancial services, address not ok, debt amount, debt date unit sell-o, and rating original with labels 5, 6 have a statistically signicant negative eect on the recovery rate of debtor; also, prop title and rating original with labels 0, 2 and 4 variables have positive signicant inuence on the recovery rate in nancial services. The probability of full-payment by debtors with address could be higher than the probability of full-payment by debtors without address. Similarly, debts with a longer period from default to sell-o have a lower probability of full-payment. On the other hand, debts with 0 and 1 rating original present a higher probability of full-payment than other debts. In summary, for the nancial services, the variables that increase the probability of non-payment are: Do not have address Higher debt amount Longer period between default and sell-o Rating original= 05, 06 As with the telecommunications and nancial services categories, we can interpret logistic regression models for other categories. The address ok, debt amount, rating original and debt date until sell-o are statistically signicant in all industries. Also, debtor age and debtor type variables are signicant except in nancial services, and the prop title variable has a signicant inuence on nancial services, miscellaneous, return debit note, energy and utilities, telecommunications and mail order. We use CART, CHAID, SVM, neural network, KNN and dmneural as in the last step as classication methods. The misclassication rate in the validation data sets are applied for models comparison. Table 7 shows some criteria such as misclassication rate, sum square error in training and validation, and average square error in validation data set in dierent industries. The classication tree is the best model for classifying the debts with extreme recovery to full-payment and non-payment in return debit note, energy and utilities and telecommunication. Based on the misclassication rate in the validation data set, neural network is the best classier for business-to-business, nancial services and mail order. The Support Vector Machine has the highest accuracy rate in validation data sets for public sector, public transport and miscellaneous categories. TABLES 6,7
18 6.3 Prediction of non-extreme debts Prediction of non-extreme debts Models building and comparisons We now come to the prediction of the recovery rate using the regression model. The response variable is recovery rate of non-extreme debts. The independent variables are the same as in the previous steps: the debt amount (debt outstanding at sale), debtor age, prop title, debt date until sell-o (time between default and purchase by third party), rating (classifying the creditworthiness into an ordinal rating with seven levels), address (the validity of the debtor's address), and debtor type (either male, female, or corporate entity). Our regression model is: y = µ + α.amount + β.debtor age + γ.prop title + δ.debt date till selloff ρ 1.rating 1 + ρ 2.rating 2 + ρ 3.rating 3 + ρ 4.rating 4 + ρ 5.rating ρ 6.rating 6 + ρ 7.rating 7 + φ.addresok + θ 1.debtor type 1 + θ 2.debtor type 2 + ɛ We choose a regression model so the modelling selection procedure is not applied. Table 8 shows the maximum likelihood estimates of the regression parameters in predicting recovery rate of non-extreme debts. For the explanatory variables, where the p-value is lower than 0.05, the null hypothesis is rejected. This means that these explanatory variables have a statistically signicant inuence on the response variable. Now, we want to interpret this model. The response variable in our model is recovery rate of debts; thus we can interpret that, for variables with negative coecients, the debt's recovery rate decreases and, inversely, variables with positive coecients cause an increase in the debt's recovery rate. For example, in business-to-business, address not ok, debt amount and debtor age variables decrease the recovery rate. On the other hand, women debtors, rating original equal to 0 and 1, prop title and time between default and sell-o to third-party company have a positive inuence on recovery rate. The interpretation of regression models for the other categories is the same as the business-tobusiness category. The debt amount, debt date until sell-o and rating variables are signicant in all industries, while the address variable is not signicant except in the miscellaneous category. The debtor type is statistically signicant in prediction of recovery rate except in mail order and public transport. Only in nancial services does the debtor age have no signicant inuence. We use CART and CHAID algorithms as prediction models. The F-test statistic and variance reduction are impurity measures for CHAID and CART. On the other hand, SVM, regression and neural network are applied on all the data sets. As mentioned before, we divided non-extreme data sets into two sets: training and validation data sets. The training data sets contain 70% of the debts and the validation data sets contain 30% of the observations in each industry. We build each model on the learning data set and then
19 18 these models are evaluated on the validation data set for prediction of recovery rate value for non-extreme debts. The R-square and sum square error are applied as performance measures. Table 9 shows some criteria such as R-square, sum square error and average square error in dierent industries. The CHAID or CART classication tree are the best prediction models for predicting recovery rate of non-extreme debts in public sector, nancial services, business-tobusiness, mail order, telecommunications and public transport. On the other hand, neural network is the best model for the return debit note and mail order categories, and SVM is the best predictor in the miscellaneous category. TABLES 8,9 7. Conclusions We analyzed the recovery rates of 9,779,239 debts of a third-party company in this paper; these were classied into 9 categories: mail order, business-to-business, energy and utilities, nancial services, miscellaneous, public sector, return debit note, telecommunications, and public transport. The loss-given-default in all categories is not the same while the range of mean recovery rate is between and The mail order category has the highest mean of recovery rate and the mean recovery rate of public sector is the lowest. The nancial services category has the highest mean of debt at and the lowest mean of recovery rate at According to Table 1, the category with a lower debt date until sell-o has a higher mean of recovery rate. As shown in Table 2, the optimum times of collection process are not the same in all the industries. For instance, one year is denitely enough collection process time for miscellaneous but it is not long enough in the nancial services category. Neural network, CART, CHAID, Support Vector Machine, K-nearest neighbor, dmneural and logistic regression were applied in classifying debts to extreme and non-extreme recovery rate. These techniques were used in all the data-sets. The decision trees algorithms are the best models in ve industries and neural network has the best performance in two industries. The Support Vector Machine is the best classier in the telecommunications category. The important advantage of the classication tree is the interpretability of this model. Conversely, neural network and SVM are black box methods. We applied neural network, CART, CHAID, Support Vector Machine, K-nearest neighbor, dmneural and logistic regression in classifying extreme debts to full payment and non-payment. The decision trees algorithms are the best models in three industries and neural network has the best performance in three industries. The Support Vector Machine is the best classier in the public sector, public transport and miscellaneous categories. Neural network, CART, CHAID, Support Vector Machine and regression were used as prediction models of debts with non-extreme recovery rate. The neural network has the best performance
20 19 in two industries and Support Vector Machine is the best predictor for the miscellaneous category. Decision trees have the best results in the other data sets. In summary, the non-statistical methods produce better results than the statistical methods. References  Altman, E. I., A. Resti, and A. Sironi (2005). Recovery risk: The next challenge in credit risk management, Chapter Loss given default; a review of the literature in recovery risk, pp Risk Books, London.  Asarnow, E. and D. Edwards (1995). Measuring loss on defaulted bank loans. A 24-year study. Journal of Commercial Lending Vol. 77, No. 7.  Avery, R. B., P. U. Calem, and G. B. Canner (2004). Consumer credit scoring: Do situational circumstances matter. Journal of Banking and Finance 28,  Bastos, J. (2010a). Forecasting bank loans loss-given default. Jourrnal of Banking and Finance Vol. 34(10),  Bastos, J. (2010b). Predicting bank loan recovery rates with neural networks. Technical report, Working Paper.  BCBS (2005). International convergence of capital measurement and capital standards. a revised framework, bank for international settlements.  Bellotti, T. and J. Crook (2008). Modelling and predicting loss given default for credit cards. Technical report, Working Paper.  Belotti, T. (2011 (forthcoming)). Loss given default models incorporating macroeconomic variables for credit cards. International Journal of Forecasting.  Board of Governors of the Federal Reserve System,. (2011). Federal reserve statistical release.  Calabrese, R. (2010). Regression for recovery rates with both continuous and discrete characteristics, proccedings of the 45th scientic meeting of the italian statitistical society (sis), italy.  Chen, T. H. and C. W. Chen (2010). Application of data mining to the spatial heterogeneity of foreclosed mortgages. Expert Systems with Application Vol. 37(2),  Crook, J., D. Edelman, and L. Thomas (2007). Recent developments in consumer credit risk assessment. European Journal of Operations Research Vol. 183(3),