Recovery Rate Modelling of Nonperforming Consumer Credit Using Data Mining Algorithms


 Jacob Dorsey
 1 years ago
 Views:
Transcription
1 Recovery Rate Modelling of Nonperforming Consumer Credit Using Data Mining Algorithms Markus Hoechstoetter, Abdolreza Nazemi, Svetlozar T. Rachev and Caslav Bozic RMI Working Paper No. 12/09 Submitted: September, 2012 Abstract There have been more studies on recovery rate modeling of bonds than of personal loans and retail credit. As far as the authors are aware, there exists no research of recovery rate modeling in retail credit for thirdparty buyers. The goal of this paper is to fill this gap. In our study, over nine million defaulted or nonperforming consumer credit data provided by a German debt collection company are used. According to the findings, the optimum times of the collection processes are not the same for all industries. Moreover, from a variety of characteristics, those debtor characteristics that are most significant in predicting the recovery have been determined. To select the best prediction and classification model, a variety of statistical and data mining methods such as logistic regression, neural network, Knearest neighbor, CHAID, CART, Support Vector Machine and regression will be examined. A twostage model which first classifies debts to extreme and nonextreme recovery rate is applied; then, the extreme debts are classified into full payment and nonpayment. Moreover, the nonextreme recovery rates are predicted. Keywords: Recovery rate, Data Mining, Thirdparty Company Markus Hoechstoetter School of Business Engineering Karlsruhe Institute of Technology, Germany Svetlozar T. Rachev College of Business Stony Brook University, New York, USA Abdolreza Nazemi National University of Singapore Risk Management Institute Caslav Bozic School of Business Engineering Karlsruhe Institute of Technology, Germany Abdolreza Nazemi. Views expressed herein are those of the author and do not necessarily reflect the views of NUS Risk Management Institute (RMI).
2 Recovery Rate Modelling of Nonperforming Consumer Credit Using Data Mining Algorithms Markus Hoechstoetter a,1, Abdolreza Nazemi b,2, Svetlozar T. Rachev c,d,3, Caslav Bozic a,4 a School of Business Engineering, Karlsruhe Institute of Technology, Germany b Risk Management Institute, National University of Singapore, Singapore c College of Business at Stony Brook University, New York, USA d FinAnalytica, New York, USA Abstract There have been more studies on recovery rate modeling of bonds than of personal loans and retail credit. As far as the authors are aware, there exists no research of recovery rate modeling in retail credit for thirdparty buyers. The goal of this paper is to ll this gap. In our study, over nine million defaulted or nonperforming consumer credit data provided by a German debt collection company are used. According to the ndings, the optimum times of the collection processes are not the same for all industries. Moreover, from a variety of characteristics, those debtor characteristics that are most signicant in predicting the recovery have been determined. To select the best prediction and classication model, a variety of statistical and data mining methods such as logistic regression, neural network, Knearest neighbor, CHAID, CART, Support Vector Machine and regression will be examined. A twostage model which rst classies debts to extreme and nonextreme recovery rate is applied; then, the extreme debts are classied into full payment and nonpayment. Moreover, the nonextreme recovery rates are predicted. Keywords: Recovery rate, Data Mining, Thirdparty Company addresses: (Markus Hoechstoetter), (Abdolreza Nazemi ), (Svetlozar T. Rachev), (Caslav Bozic) 1 Markus Hoechstoetter is a post doctoral researcher at the Chair of Statistics, Econometrics and Mathematical Finance at the School of Economics and Business Engineering, Karlsruhe Institute of Technology, Germany. 2 Abdolreza Nazemi is a post doctoral researcher at the Risk Management Institute, National University of Singapore, Singapore. 3 Svetlozar Rachev is Professor at College of Business, Stony Brook University in New York and Chief Scientist of FinAnalytica. 4 Caslav Bozic is a doctoral student at the AIFB Institute of the Faculty of Economics and Business Engineering, Karlsruhe Institute of Technology, Germany and a research and develpment engineer at Avedas AG, Karlsruhe, Germany.
3 2 1. Introduction The latest, or should we say current, crisis is still present in various aspects of life. The damage to the corporate world can be looked up in Standard & Poor's (2011), for example. However, individuals have also suered from the crisis as many employees were laid o and, to make things worse, banks began to suppress lending on a large scale which resulted in the credit crunch. This is even more worrying since borrowing or, more generally, being in debt has become the omnipresent situation worldwide. For example, in the USA, corporate and consumer debt has reached dizzying levels. According to the Board of Governors of the Federal Reserve System (2011), the accumulated debt amounts to over 10 trillion dollars. This trend, however, is not unique to the USA. In Europe, the trend is similar, as pointed out by Thomas (2009); and, if this trend is not reversed, the need for the expansion of lending will remain a pressing issue. It is obvious that there has to be a sophisticated system that will enable credit to be extended and guarantee that the impact of defaulted debt will be cushioned. Hence, there is a need for thirdparty buyers to relieve originators from the stressful collection business. In the context of lending and the inherent risk, the terminology and denitions given in the Basel II Accords are widely used. Mainly, the terms 'expected loss', the lossgivendefault conditional on default as well as recovery rate are commonly used in the context of nancial debt. Before the regulation in the Basel Accords was constituted, lenders in the retail sector designed a now widelyused tool, the credit scorecard. This device helps to assess the probability of a new customer defaulting. The actual score is usually a threedigit number. It is computed from the set of consumer characteristics. The scorecard helps to distinguish between potentially good and bad borrowers, i.e., those who default and those who do not. As references, we recommend Hand (2001) and Thomas et al. (2002). However, this tool has not proved reliable in predicting complete loss or recovery once the debtor has defaulted on his/her obligations. Instead, logistic regression has been applied more often in the analyses of recovery rates. A problem arising from the denitions in the Basel accords, however, is the sometimes not uniquely specied parameters, leaving room for interpretation and diverging implementations. For example, the recovery rate as the percentage of the original amount of debt outstanding that is repaid to or repossessed by the lender can be stated either as the price the distressed debt would achieve if sold immediately after default or as the sum of discounted future payments received from the debtor. The main interest of this paper is the prediction of the recovery rate or, conversely, the lossgivendefault (LGD). Practitioners as well as researchers agree that the recovery rate has to be modelled by some quantity that is restricted to the interval between zero and one. But there are no standardized models for the modelling and prediction of the recovery rate. This is common to basically all industries. So, some models include macroeconomic variables, which is in line with the Basel Accords, while others are restricted to information on the debt. Most of all, the
4 3 methodological side suers from a lack of freely accessible data, thus hampering the production of reliable results of general value. Moreover, since the collection process can be carried out by the original lender, i.e., inhouse, as well as by some third party acting either as agent or as purchaser of the debt, the models have to consider the dierence in accessibility of relevant information for the two options. In the model, we will use only debtorrelated variables without macroeconomic factors. Our ndings will reveal that age of debtor and the amount of debt outstanding are important determinants for the recovery rate. We furthermore see as one of our main contributions the presentation of results on recovery rates from various nonbanking industries on a scale not presented before in the literature to our knowledge. The remainder of the paper is organized as follows. The following section presents the features of the dierent options of the lender with respect to collection and ownership of the debt. In section 3, statistical and data mining methods will be reviewed in brief. Then, we present our data in section four. The important results of the exploratory data analysis are set out in section ve. We discuss the recovery rate modelling using data mining methods in section 6. This is followed by an explanation of our results in section Recovery and collection process In this section, we will present the dierent collection options that exist for a lender, and their implications for LGD, an excellent account of which is given in Thomas et al. (2011). Generally, collecting debt and recovering it after default consists of a sequence of measures such as written reminders, more specic letters, telephone calls and, ultimately, legal action such as court orders and sending marshals. The dierent measures commonly result in various actions by the debtor. Often, as a result of a telephone call, creditor and debtor negotiate a repayment plan. It is believed by most collection entities that a telephone call is preferable to an appearance in person because the conversation is still personal while the debtor is not confronted with a physical intrusion. Usually, the collection process becomes more robust as default continues; in the public perception this is commonly attributed to a debt collector's alleged rude demeanour and it may even discredit the entire lending industry. Proof of this public disapproval of the loan and collection business has been appearing in abundance in the press since the culmination of the latest crisis. Unless hindered by the legal framework, a lender might consider any of three basic collection options during the life of the lending relationship, the rst of which is the internal (or inhouse) collection and recovery process. Lenders usually resort to this, at least in the beginning. Secondly, a lender may decide to remain owner of the debt but assign collection to an agent. The third option is to sell the debt to a third party and, hence, end the relationship with the borrower. The measures taken may be the result of discretionary considerations relating to certain customers only, or part of a standardized provision formulating automated procedures. For as long as repayment is
5 4 on schedule, the lender will retain the collection in his business. Also, if the lender determines that he is better able than an outside agent to recover the debt after default based on his experience and also because the relationship with the defaulting borrower is valuable to him, he will opt to collect it himself. When the collection process is too involving for the creditor or might harm his reputation because the language to be applied at a certain stage of collection may have to become more direct, the lender may choose to hand the collection process to an outside agent while retaining ownership of the debt. If default seems more likely or the ownership of the distressed debt causes additional aggravation for the lender, he will probably sell the debt at a discount to a thirdparty buyer. The price will have to be lower than the recovery value expected by the buyer. An advantage of internal collection may be that all characteristics concerning the debt are known whereas a thirdparty buyer is lacking important information such as loan details, borrower repayment behaviour or change in score, which is a privilege of the original lender, according to Fama (1985). The thirdparty buyer does receive essential information on exactly when default occurred, the exact amount outstanding, and when the last payment was made. The thirdparty buyer like the outside agent, however, may suer from a negative selection because he will most likely only have access to poorly performing debt, as argued in Ramakrishnan and Thakor (1984). According to Thomas et al. (2011), it is not surprising that only 7% repaid the whole debt, 16.3% paid a fraction, and the vast majority, i.e. 83%, did not pay anything when the collection was carried out by a third party. This is in sharp contrast to the outcome when collection was undertaken by the inhouse collection department: 30% repaid the total amount, while 60% and 10% repaid only a share or nothing, respectively. In our results section, we will provide very detailed information on recovery rates faced by a thirdparty buyer of nonperforming consumer debt. 3. Statistical and Data Mining Models 3.1. Regression 3.2. Logistic Regression Logistic regression is used for quantitative variables, particularly when the response variable is a categorical variable. Let us dene a binary random variable as 1 if default occur Y = 0 if default does not occur with π = P r(y = 1) and 1 π = P r(y = 0) which is the famous example of occurrence of default. The multiple logistic regression is as follows: π i = exp(x β) 1 + exp(x β) = exp(β 0 + β 1 X β n X n ) 1 + exp(β 0 + β 1 X β n X n ) (1)
6 3.3 Neural Network 5 where β is vector of coecients and X is matrix of observations Neural Network The design of the neural network is especially appealing because of its several layers of perceptrons. Common to any design are an input layer, one or more hidden layers of neurons, and an output layer. In the simplest version with just one hidden layer, input data consisting of observations x j of j = 1, 2,..., d variables enters neuron i of the hidden layer to be transformed there into a weighted functional output d h i = f (1) (b i + w i,j x j ) with weights w i,j and neuronspecic constant b i. Output from all n h hidden neurons is then turned into network output n h y = f (2) (b (2) + v i h i ) with neuron weights v i. The neural network allows for a exible yet sometimes unintuitive design. This technique is particularly apt in separating samples with respect to objective functions such as, for example, zero or full recovery Knearest neighbor KNN is a popular method in data mining and is used for classication based on closest observations in the variable space. The Knearest neighbor was introduced in the early 1950s. KNN can also be applied for prediction. 5 Suppose that the learning sample has n attributes, which means each learning sample is shown as a point in n dimensional space. If we want to classify a new sample using Knearest neighbor, then KNN is looking for Knearest learning samples that are nearest to the new sample. After that, Knearest neighbor detects a class of the new sample as the class majority of these Knearest neighbors. When k = 1, the class of the new sample is the same as the class of the nearest learning sample to this point. The Euclidean distance or Mahalanobis distance is applied as a distance metric. The Euclidean distance between two points or tuples X 1 = (x 11, x 12,..., x 1n ) and X 2 = (x 21, x 22,..., x 2n ) is dened as D E (X 1, X 2 ) = n (x 1i x 2i ) 2 j=1 i=1 i=1 5 see, Han and Kamber, (2006).
7 3.5 Trees 6 For calculation, we usually use the normalized variables. Suppose that, for X = (x 1,..., x n ) T, the covariance matrix is equal to Σ and µ = (µ 1,..., µ n ) T is the mean vector; then the Mahalanobis distance is dened as D M (X) = (x µ) T Σ 1 (x µ) The Mahalanobis distance is based on correlations between variables by which dierent patterns can be identied and analyzed. It is a useful way of determining similarity of variables. The Euclidean distance applies in many classication problems. In KNN, when the value of a variable is missing, KNN uses the maximum dierence between two samples; this means that, if both values of a normalized variable in sample X 1 and X 2 are missing, the dierence is assumed to be 1. On the other hand, if only one of them is a missing value and the other one has a value b, the dierence is considered to be 1 b or b. For a categorical variable, KNN assumes it to be 1. For the Knearest neighbor algorithm, we need to determine the number of nearest neighbors and distance measure. The best number of k depends only on data and we can nd it by comparing error rates for dierent k values. In general, the larger the values of k, the more stable the model of classication. However, the larger values of k mean that the learning data samples now being included are not very close to the new sample. If k is equal to 1, then the class of new sample is predicted as the class of the closest learning sample; this is called the nearest neighbor algorithm and it makes a rather unstable classier Trees Breiman and Friedman introduced recursive partitioning algorithms in Decision trees are usually classied into two groups: classication trees and regression trees. In classication trees, the target variable is categorical or qualitative but we can also use classication trees when the dependent variable is continuous. Thomas et al. (2002) mentioned that classication trees were applied in credit scoring by Makowski in The basic idea of tree construction is to nd subsets with maximum homogeneity or cases that are located in a subset belonging only to one class of target variable. At each step of splitting, tree algorithms split cases with independent variables that have maximum homogeneity. We dene impurity of a node as a function of the probability of a dierent class in the node under consideration: i(t) = φ(p 1, p 2,..., p J ) 6 see, Giudici and Figini, (2009), Hand et al. (2001), Han and Kamber, (2006). 7 see, Thomas et al., (2002), Giudici and Figini, (2009).
8 3.6 Support Vector Machine 7 where the p j is a probability of cases belonging to class j. There are dierent kinds of impurity function with these characteristics. One of the principal dierences in tree algorithms is related to impurity function. Breiman et al. (1984), Deville (2006), Giudici (2003) and Thomas et al. (2002) pointed out certain impurity functions, for example, Gini i(t) = j p(j t)(1 p(j t)), Entropy i(t) = (p(0 l) p(0 r))2 j p(j t)log(p(j t)) or Maximize halfsum of squares Chi = n(l)n(r) where n(l)+n(r) n(r) and n(l) are the number of observations in the right and left nodes. The large value of χ 2 statistic Chi means that the two proportions are not the same. The reduction of impurity that the split obtained was dened as quality of a split as: i = i(v) [π(l)i(l) + π(r)i(r)] where π(l) and π(r) are the observed proportions of observations in classication. In fact, tree algorithms select the variable that has best quality of a split. Finally, tree algorithms label leaf nodes based on the majority of target variables. In regression trees, tree tted ŷ i that is equal to mean of dependent variable for observations in considering leaf node. Classication and regression trees (CART) are the most usual tree algorithms (Breiman et al., 1984). In CART, the target variable could be categorical and continuous. The impurity function of CART is assumed to be Gini or entropy. Chisquare Automatic Detection (CHAID) was developed by Kass in Furthermore, the impurity is assumed to be chisquare Support Vector Machine The SVMs are used to separate debtors into two categories (y = 1 or y = 1) based on some hyperplane threshold with perpendicular vector w maximizing the minimal distance of each of the two groups from the threshold. With the optimal hyperplane, the training data keep a minimum distance of b from the hyperplane to guarantee generality of the model. The optimization problem using all n observations (y i, x i ), x i ɛr d is thus given by or in the dual form min w,b w 2 2, s.t. y i(< w, x i > +b) 1, i = 1, 2,..., n (2) min a n a i 1 2 i=1 n a i a j y i y j < x i, x j >, i=1,j s.t. n a i y i = 0 (3) where <, > denotes the inner product. The separating rule is then given by f(x) = sign(< w, x > +b) or, equivalently, f(x) = sign( n i=1 a iy i < x i, x > +b). A problem occurs if the data are not linearly separable as required. To this end, the original data vector xɛr d is mapped into a higher dimensional (K > d) feature space with a nonlinear function φ : R d R k, x φ(x). To circumvent the calculations of the inner products and associated dot products in the higher dimension, the so called kerneltrick is applied, requiring only computation i=1
9 3.7 Dmneural 8 of kernel functions k(x i, x j ) =< φ(x i ), φ(x j ) > for the dot products. Thus, the transformation into the higher dimension space can be actually avoided. The resulting separating function is now f(x) = sign( n i=1 a iy i k(x i, x) + b). Common kernel functions are, for example, polynomial k(x i, x) =< x i, x > or radial basis k(x i, x) = exp( x i x 2 /c). The authors state that the advantages are given by the use of key observations only for the sake of speed, the translation of the discrimination problem into a quadratic problem, and the projection of the original problem onto a higher dimensional space to apply a linear discrimination function. They begin the modelling with a stepwise selection process of the most powerful variables to separate the data set into homogenous subsets. LSSVM is a version of SVM to conduct a linear regression of the form y = φ(x) i b + ε with the original data x mapped into a higher feature space by φ to obtain a higher degree of linearity. Using a kernel K(x, x i ) = φ(x) T φ(x i ) simplies the optimization in the preferred dual form y = n i=1 a iφ(x) T φ(x i ) + e Dmneural As far as we are aware, dmneural network training is not a very popular method in data mining but we will compare the performance of this model to other models. In the learning dataset, the dmneural species the best principal components of independent variables for maximum variation in the response variable; consequently it chooses the best group of independent variables in prediction or classication of response variable. Dmneural omits the independent variables that have less information for prediction of the response variable. According to principal components' characteristics, they are uncorrelated. An activation function is applied to the linear combination of independent variables and principal components. Matignon (2007) points out the eight dierent activation functions among them Gaussian, logistic, exponential and square. 8 The misclassication rate in the classication problem and the sum of square error in the prediction problem are used in specifying the best activation function in the next phases. The dmneural applies the response variable and the residual in prediction or classication of response variables from the rst step. The dmneural model constructs an additive nonlinear model as follows. Matignon (2007) points out the following additive nonlinear model as dmneural model: ŷ = nphases i=1 g(f(x, α)) Where g is the link function and the best activation function is f in phase i. 8 see, Matignon, (2007).
10 9 4. Data description 4.1. Data provider Our data consist of close to ten million dierent unsecured debts purchased between 2001 and 2010 by arvato infoscore, one of the largest debt purchasers in Germany. The company combines a collection business (German Inkasso), scoring services, and factoring. Factoring, as known in Germany, is a particular form of thirdparty nancial service for originator lenders. The most common variations are fullservice, selective, notication, semifactoring, and silent factoring. In case of normal factoring, the debt buyer, i.e., the factor, receives all debt from the originator in an automatically revolving process agreed upon in advance. The factor is owner as well as collector of the debt after its cession from the originator. It is the most common form of factoring in Germany. Selective factoring describes a construct where only selected debt is sold o to the thirdparty factor. When the third party oers notication factoring, the debtor is informed about the sale of the debt and can only repay to the thirdparty factor. The default risk, however, remains with the originator which, in the case of default, has to reimburse the factor. In silent factoring, the debtor is ignorant of the sale of the debt and payment is only possible to the original creditor. A negative consequence for the factor is a lack of inuence over the debtor since he is not entitled to collect. And nally, when semifactoring is chosen between originator and factor, the debtor remains ignorant of the sale of the debt, as well, but payments are to be made exclusively to accounts or addresses that belong to the factor. In the case of arvato infoscore, the company engages in fullservice factoring. Although a legally separate entity, in the case of collection and scoring businesses combined in the same company, the thirdparty buyer has the advantage that the collection department has often developed long lasting relationships with debtors during the period of initial ownership by the respective originators. These relationships yield precious information the thirdparty buyer would not have access under any other circumstances. However, legally, this is sometimes limited to the information that would be oered to any thirdparty buyer. In the following, we consider data only accessible to regular outside buyers Data The data consist of roughly ten million defaulted or nonperforming unsecured receivables from nine dierent categories that are customers of the thirdparty buyer. On the one hand, these categories represent the following industries: telecommunications, online shopping and mail ordering, nancial services including credit cards, and the utility and energy sector. Moreover, receivables from the nonprotentities of the public sector (community services) and public transport are also part of these categories as are failed return debit notes as well as anything that does not t into any of the prior categories; these are subsumed in the miscellaneous category including,
11 4.2 Data 10 for example, unpaid parking tickets. In the following, we will use these abbreviations to indicate the respective industries: Mail order (MO), businesstobusiness (B2B), energy and utilities (NRGY), nancial services (FS), miscellaneous (MI), public sector (PS), return debit note (RDN), telecommunications (TC), and public transport (PT). Each debtor is assigned a unique identication number. For each receivable a unique identi cation number is issued, and all payments on the account of a particular receivable have to be labelled with the respective identication number. The relationship between receivable and borrower is not unique since a borrower might have defaulted on more than one receivable in arrears, whereas a receivable in arrears can only belong to one borrowing entity. A payment is characterized by the identication number of the receivable and, thus, can be traced to the corresponding debtor. Furthermore, we selected from all given payment characteristics those that could most easily be transformed into a numerical variable or a categorical variable of low dimension. The resulting variables relating to the debtor include age, gender, residential status and address, as well as current credit history. The variables related to the accounts receivables include age of debt, date of purchase by third party, amount outstanding and last payment date, while the original receivable amount is usually unknown. This yields about 15 variables that can be used for the subsequent analysis. For example, information on the quality of the location of the residence, which can be obtained by transforming the postal code into a rating, has not yet been considered. Henceforth, we will use the terms 'category' and 'industry' interchangeably. In Table 1, we have presented the most important statistics of the data sorted by industry. It also contains some initial results that we will discuss further in the last section. As we can see from adding the values of `# debts (original)', the total number of receivables is 9,793,590. Because our computational capacity was limited at the time we received the data, we decided to use only 100,000 randomly selected receivables from the mail order industry and only 500,000 randomly selected receivables from the public transport category. We will use the complete dataset for other categories. We assume that the means of debts in the complete datasets of mail order and public transport categories are the same as in their samples; thus, the amount of debt outstanding is 1,248,266, Euros. The recovery rate in all categories is not the same and the range of mean recovery rate is between and The mean recovery rate of public sector is the lowest and the mail order category has the highest mean of recovery rate. The nancial services category has the highest mean of debt at and one of the lowest mean of recovery rate at In Table 1, if we subtract mean debt age from mean debt age in thirdparty we have the duration between default occurrence and sello to thirdparty company. The categories with shorter periods between default and sello to thirdparty company have seen more recovery than the categories that have longer period between default and sello to thirdparty. For example, these periods in nancial services and public sector are around 43 and 23 months, although for mail order and telecommunications, they are around ve months. The payments that did not convey reliable information on all the
12 11 used variables were discarded from further analysis. The missing information is indicated by the respective superscripts of the industry. This was age of debtor, age of receivable, or the identity of the debtor. Moreover, we cleaned with respect to `Earliest entry' and `Last entry' per industry, since there were unreasonable values, most likely the result of laxity during data entry. Eliminating these outliers resulted in the new values as presented here. At this point, it becomes obvious how important the quality of the data is for the thirdparty buyer since he generally has no means of validating and, if necessary, correcting them. However, the thirdparty buyer has to cope with many aws in the data. TABLE 1 5. Exploratory data analysis We use the complete data sets except mail order and public transport industries. Because our computational capacity was limited at the time we received the data, we decided to use only 100,000 observations randomly selected from mail order and 500,000 observations from the public transport section. The mail order section has the highest recovery rate that is and the public sector has the lowest recovery rate in the portfolio that is Table 2 shows the quantile of debt amount, recovery rate, time until full payment by debtor who paid fully and time of last payment by debtor who paid at least something. The very important result from this table is that the optimum time of collection process depends on the industry. As an illustration, in nancial services 99 percent of fullypaying debtors paid fully before 48 months and 90% of them paid fully before 28 months; meanwhile, 99% of nonfullypaying debtors did not pay more after 39 months and 90% of them did not pay more after 26 months. In contrast, 99% of fullypaying debtors paid fully before 11 months and 90% of them paid fully before 4 months in the miscellaneous category. Additionally, 90% of nonfullypaying debtors did not pay more after 6 months and 99% of them did not pay more after 12 months in the miscellaneous category. In other words, the time for calculation of nal recovery rate and the reasonable collection process time are dierent in dierent categories. We also analyze the recovery rate distributions for the horizons of 12, 24 and 36 months. 9 Our conclusion is that, for all nine industries, the variation in the respective distributions is minimal with mass slightly shifting from RR = 0 to RR = 1 since more debtors payo debts as time progresses. After one year, representing most of the receivables, the frequency of RR = 0 is slightly over 60% while, after three years, this frequency decreases minimally to just below 60%. So, either the rst year successfully predicted the recovery rate or three years are not long enough as a horizon, since we have censored data as payments are observed on debt that is much older 9 For each horizon, only receivables with an age greater than or equal to the horizon are included.
13 12 than the period considered by our scope. So far, our ndings appear to contrast with those of the bank loan data. TABLE 2 6. Recovery rates modelling The above mentioned literature, mostly on bank loan data, reports fairly high recovery rates of 60% or and more, on average. This may be due to two factors. First, the collection may have been retained by the banks; second, banks tend to have an advantage since they have insight into the borrower's nancial situation which lenders from other industries fail to acquire. This is argued by Fama (1985), for example. Since our data are from a nonbank thirdparty buyer, we expect rather low recovery rates. From Table 1, we see that this is justiable given that recovery rates are below 40% and even below 30% in many cases. In the next analysis, we consider the empirical distribution of the recovery rate across the nine dierent industries. It is apparent that nearly all probability mass is at RR = 0 and RR = 1. Across all nine industries, the majority, by far, of the recoveries are equal to 0. We hypothesize that this is the result of the relatively low average debt amounts (EAD) except for FS. After univariate exploratory analysis, we start to perform data analysis. We are applying a twostage model which rst classies debts to extreme and nonextreme recovery rate; we then classify the extreme debts to full payment and nonpayment. Moreover, the nonextreme recovery rate will be predicted. It is clear that we have classication problems in two steps, as the target variable is binary and the goals are to classify whether a defaulted debt will be extreme or nonextreme and to classify whether an extreme debt will be full payment or nonpayment. We also have a prediction problem in the nal step, as the response variable is continuous and the aim is to predict nonextreme recovery rates. As pointed out before, the complete datasets are applied, apart from the modelling step for mail order and public transport industries. We use only 100,000 randomly selected items from the mail order section and 500,000 observations from the public transport section. We delete outlier samples and impute the missing values in the data cleaning phase. After data cleaning, we divide the considered datasets into two sets randomly: training or learning dataset and validation dataset. The training datasets contain 70% of the observations and the validation datasets contain 30% of the debts in each industry. We used stratied sampling with equal sizes based on target variable for training and validation. The number of observations decreases because datasets are not balanced based on the response variable. We build each model on the training dataset and then these models are evaluated on the validation dataset to classify the debts. The misclassication rate is one of the usual criteria in classication model comparison.
14 6.1 Classifying debts as extreme and nonextreme 13 Data mining algorithms will be used for classication steps, such as neural network, CART, CHAID, Knearest neighbor, dmneural, logistic regression and Support Vector Machine. Thereafter, neural network, CART, CHAID, regression and Support Vector Machine will be applied as prediction models in the nal step. In each step, we will compare models using Rsquared, ROC curve, misclassication, average square error and sum square errors. However, the basic criteria are misclassication and Rsquared Classifying debts as extreme and nonextreme Models building and comparisons We come to the modelling of the recovery rate by means of the wellknown logistic regression model, i.e. the recovery rate RR is the nonlinear transform of the linear model including real and coded categorical numerical data. The target variables are extreme and nonextreme recovery where 0 indicates nonextreme recovery and 1 shows extreme recovery. It is clear that extreme recovery consists of full payment and nonpayment. As an initial step of selecting the individually most signicant debtorrelated variables, we perform the logistic regression for each individual variable alone and assess its ability through an R 2 measure. This yielded the following set of seven variables individually most signicant for predicting RR: the debt amount (debt outstanding at sale), debtor age, prop title, debt date until sello (time between default and purchase by third party), rating (classifying the creditworthiness into an ordinal rating with seven levels), address (the validity of the debtor's address), and debtor type (either male, female or corporate entity). We choose a logistic regression model so a modelling selection procedure is not applied. This yielded the linear regression model P (Y =1) log P (Y =0) = µ + α.amount + β.debtor age + γ.prop title + δ.debt date till selloff ρ 1.rating 1 + ρ 2.rating 2 + ρ 3.rating 3 + ρ 4.rating 4 + ρ 5.rating ρ 6.rating 6 + ρ 7.rating 7 + φ.addresok + θ 1.debtor type 1 + θ 2.debtor type 2 + ɛ Here, we use the dummy variables rating 1 through rating 6 as well as debtortype 1 and debtortype 2 for the categorical variables. Table 3 shows the maximum likelihood estimates of the logistic regression parameters in classifying debts to extreme and nonextreme corresponding to the nal model and the statistical signicance of the parameters. For the explanatory variables, when the pvalue is lower than 0.05, the null hypothesis is rejected. This means these explanatory variables have a statistically signicant inuence on the response variable. Now, we want to interpret the logistic regression model. In our model, when the debtor has extreme recovery rate Y=1; thus we can interpret that, for variables with negative coecients, the probability of extreme recovery rate decreases and, inversely, variables with positive coecients cause an increase in the probability of extreme recovery rate. For example, in nancial services, address ok, time between debt occurrence and sello to thirdparty company, and rating original equal to 5 have statistically signicant positive eects on the probability of extreme recovery
15 6.1 Classifying debts as extreme and nonextreme 14 rate. The probability of extreme recovery rate from debtors with address could be higher than the probability of extreme recovery rate from debtors without address. Debts with a longer time between default occurrence and sello to thirdparty company have a higher probability of extreme recovery rate. On the other hand, debts with 0, 2 and 4 rating original present a higher probability of nonextreme recovery rate than other debts. In summary, for nancial services, the variables that increase the probability of nonextreme recovery rate are: Do not have address Longer period between default and sello If we compare the table 5 we can see that the CHAID tree uses exactly the variables that have a statistically signicant inuence in logistic regression. Moreover, the CART tree uses the variables that are signicant in the logistic regression and debtor age. TABLE 3 CART and CHAID algorithms are used as classication methods. The chisquared and Gini are impurity measures for CHAID and CART. To obtain a parsimonious tree, we use a signicance level of 0.2 in the stopping rule. Table 4 presents the results from the CART classication tree analysis for nancial services. The total number of splitting variables in the CART classication tree for nancial services is 6: address, rating, debt amount, debt date until sello, prop title and debtor age. All the dependent variables are used in the CART, except debtor type. The splitting variables in the CHAID classication tree for nancial services are address, rating, debt amount, debt date until sello, prop title and debtor age. The CART tree is slightly more complicated than the CHAID tree for nancial services. As mentioned, we classify all the debts based on the majority of debts in each leaf. We can calculate the misclassication rate as a performance measure. In the next section, we will examine the misclassication rate of our models in training and testing datasets. At the beginning of the application of a neural network, we should specify the structure of the neural network. We used a neural network with one and two hidden layers consisting of between two and ve neurons. For example, we examined neural networks with 2, 3, 4 and 5 neurons for the datasets. In general, neural networks are black boxes and we cannot interpret them. An important step in using the Knearestneighbor is the width K specication. This establishes the size of the neighborhood of the independent variables that will be applied to classify the target variable. We checked the misclassication of KNN with dierent K. TABLE 4
16 6.2 Classifying extreme debts to full payment and nonpayment 15 As mentioned before, we divide all the datasets into two sets: training or learning datasets and validation datasets. The training datasets contain 70% of the observations and the validation data sets contain 30% of the debts in each industry. We use stratied sampling with equal sizes in target variable, which means that the number of extreme and nonextreme debts in both datasets, training and validation, are the same. We build each model on the training dataset and then evaluate these models on the validation data set to classify the debts. The misclassication rate is one of the usual criteria in model comparison. We should mention that we do not need to use crossvalidation because our data sets are large; consequently our results are stable. In Table 5, we show some criteria such as misclassication rate, sum square error in training and validation, and average square error in validation datasets in dierent industries. The misclassication rate in validation dataset is the principal criterion for model comparison in this study. The CHAID classication tree is the best model for classifying the debts to extreme and nonextreme recovery in public sector, nancial services and mail order. Based on the misclassication rate in the validation dataset, neural network is the best classi er for businesstobusiness, miscellaneous and return debit note. On the other hand, the CART classication tree is the best classication model for energy and utilities and public transport. The Support Vector Machine has the highest accuracy rate in the validation dataset for telecommunications. For example, Figure 1 shows the ROC curve for nancial industries. We know that, in a ROC curve, one model dominates another when the curve of one model is completely above the curve of another model. The CHAID classication tree has the highest accuracy rate between our models in dierent industries at 0.805%. TABLE 5 and FIG Classifying extreme debts to full payment and nonpayment Models building and comparisons We now come to the modeling of the recovery rate by logistic regression model. The response variables are full payment and nonpayment recovery where 0 indicates nonpayment recovery and 1 shows full payment. The independent variables are the debt amount, debtor age, prop title, debt date until sello, rating, address, and debtor type. This yielded the regression model P (Y =1) log P (Y =0) = µ + α.amount + β.debtor age + γ.prop title + δ.debt date till selloff ρ 1.rating 1 + ρ 2.rating 2 + ρ 3.rating 3 + ρ 4.rating 4 + ρ 5.rating ρ 6.rating 6 + ρ 7.rating 7 + φ.addresok + θ 1.debtor type 1 + θ 2.debtor type 2 + ɛ Here, we use the dummy variables rating 1 through rating 6 as well as debtortype 1 and debtortype 2 for the categorical variables. The maximum likelihood estimates of the logistic regression parameters in classifying extreme debt are shown in table 6.
17 6.2 Classifying extreme debts to full payment and nonpayment 16 Now, we want to interpret the logistic regression model. In our model, when the debtor has fully paid Y=1; thus we can interpret that, for variables with negative coecients, the probability of fullpayment decreases and, inversely, variables with positive coecients cause an increase in the probability of fullpayment. For example, in nancial services, address not ok, debt amount, debt date unit sello, and rating original with labels 5, 6 have a statistically signicant negative eect on the recovery rate of debtor; also, prop title and rating original with labels 0, 2 and 4 variables have positive signicant inuence on the recovery rate in nancial services. The probability of fullpayment by debtors with address could be higher than the probability of fullpayment by debtors without address. Similarly, debts with a longer period from default to sello have a lower probability of fullpayment. On the other hand, debts with 0 and 1 rating original present a higher probability of fullpayment than other debts. In summary, for the nancial services, the variables that increase the probability of nonpayment are: Do not have address Higher debt amount Longer period between default and sello Rating original= 05, 06 As with the telecommunications and nancial services categories, we can interpret logistic regression models for other categories. The address ok, debt amount, rating original and debt date until sello are statistically signicant in all industries. Also, debtor age and debtor type variables are signicant except in nancial services, and the prop title variable has a signicant inuence on nancial services, miscellaneous, return debit note, energy and utilities, telecommunications and mail order. We use CART, CHAID, SVM, neural network, KNN and dmneural as in the last step as classication methods. The misclassication rate in the validation data sets are applied for models comparison. Table 7 shows some criteria such as misclassication rate, sum square error in training and validation, and average square error in validation data set in dierent industries. The classication tree is the best model for classifying the debts with extreme recovery to fullpayment and nonpayment in return debit note, energy and utilities and telecommunication. Based on the misclassication rate in the validation data set, neural network is the best classier for businesstobusiness, nancial services and mail order. The Support Vector Machine has the highest accuracy rate in validation data sets for public sector, public transport and miscellaneous categories. TABLES 6,7
18 6.3 Prediction of nonextreme debts Prediction of nonextreme debts Models building and comparisons We now come to the prediction of the recovery rate using the regression model. The response variable is recovery rate of nonextreme debts. The independent variables are the same as in the previous steps: the debt amount (debt outstanding at sale), debtor age, prop title, debt date until sello (time between default and purchase by third party), rating (classifying the creditworthiness into an ordinal rating with seven levels), address (the validity of the debtor's address), and debtor type (either male, female, or corporate entity). Our regression model is: y = µ + α.amount + β.debtor age + γ.prop title + δ.debt date till selloff ρ 1.rating 1 + ρ 2.rating 2 + ρ 3.rating 3 + ρ 4.rating 4 + ρ 5.rating ρ 6.rating 6 + ρ 7.rating 7 + φ.addresok + θ 1.debtor type 1 + θ 2.debtor type 2 + ɛ We choose a regression model so the modelling selection procedure is not applied. Table 8 shows the maximum likelihood estimates of the regression parameters in predicting recovery rate of nonextreme debts. For the explanatory variables, where the pvalue is lower than 0.05, the null hypothesis is rejected. This means that these explanatory variables have a statistically signicant inuence on the response variable. Now, we want to interpret this model. The response variable in our model is recovery rate of debts; thus we can interpret that, for variables with negative coecients, the debt's recovery rate decreases and, inversely, variables with positive coecients cause an increase in the debt's recovery rate. For example, in businesstobusiness, address not ok, debt amount and debtor age variables decrease the recovery rate. On the other hand, women debtors, rating original equal to 0 and 1, prop title and time between default and sello to thirdparty company have a positive inuence on recovery rate. The interpretation of regression models for the other categories is the same as the businesstobusiness category. The debt amount, debt date until sello and rating variables are signicant in all industries, while the address variable is not signicant except in the miscellaneous category. The debtor type is statistically signicant in prediction of recovery rate except in mail order and public transport. Only in nancial services does the debtor age have no signicant inuence. We use CART and CHAID algorithms as prediction models. The Ftest statistic and variance reduction are impurity measures for CHAID and CART. On the other hand, SVM, regression and neural network are applied on all the data sets. As mentioned before, we divided nonextreme data sets into two sets: training and validation data sets. The training data sets contain 70% of the debts and the validation data sets contain 30% of the observations in each industry. We build each model on the learning data set and then
19 18 these models are evaluated on the validation data set for prediction of recovery rate value for nonextreme debts. The Rsquare and sum square error are applied as performance measures. Table 9 shows some criteria such as Rsquare, sum square error and average square error in dierent industries. The CHAID or CART classication tree are the best prediction models for predicting recovery rate of nonextreme debts in public sector, nancial services, businesstobusiness, mail order, telecommunications and public transport. On the other hand, neural network is the best model for the return debit note and mail order categories, and SVM is the best predictor in the miscellaneous category. TABLES 8,9 7. Conclusions We analyzed the recovery rates of 9,779,239 debts of a thirdparty company in this paper; these were classied into 9 categories: mail order, businesstobusiness, energy and utilities, nancial services, miscellaneous, public sector, return debit note, telecommunications, and public transport. The lossgivendefault in all categories is not the same while the range of mean recovery rate is between and The mail order category has the highest mean of recovery rate and the mean recovery rate of public sector is the lowest. The nancial services category has the highest mean of debt at and the lowest mean of recovery rate at According to Table 1, the category with a lower debt date until sello has a higher mean of recovery rate. As shown in Table 2, the optimum times of collection process are not the same in all the industries. For instance, one year is denitely enough collection process time for miscellaneous but it is not long enough in the nancial services category. Neural network, CART, CHAID, Support Vector Machine, Knearest neighbor, dmneural and logistic regression were applied in classifying debts to extreme and nonextreme recovery rate. These techniques were used in all the datasets. The decision trees algorithms are the best models in ve industries and neural network has the best performance in two industries. The Support Vector Machine is the best classier in the telecommunications category. The important advantage of the classication tree is the interpretability of this model. Conversely, neural network and SVM are black box methods. We applied neural network, CART, CHAID, Support Vector Machine, Knearest neighbor, dmneural and logistic regression in classifying extreme debts to full payment and nonpayment. The decision trees algorithms are the best models in three industries and neural network has the best performance in three industries. The Support Vector Machine is the best classier in the public sector, public transport and miscellaneous categories. Neural network, CART, CHAID, Support Vector Machine and regression were used as prediction models of debts with nonextreme recovery rate. The neural network has the best performance
20 19 in two industries and Support Vector Machine is the best predictor for the miscellaneous category. Decision trees have the best results in the other data sets. In summary, the nonstatistical methods produce better results than the statistical methods. References [1] Altman, E. I., A. Resti, and A. Sironi (2005). Recovery risk: The next challenge in credit risk management, Chapter Loss given default; a review of the literature in recovery risk, pp Risk Books, London. [2] Asarnow, E. and D. Edwards (1995). Measuring loss on defaulted bank loans. A 24year study. Journal of Commercial Lending Vol. 77, No. 7. [3] Avery, R. B., P. U. Calem, and G. B. Canner (2004). Consumer credit scoring: Do situational circumstances matter. Journal of Banking and Finance 28, [4] Bastos, J. (2010a). Forecasting bank loans lossgiven default. Jourrnal of Banking and Finance Vol. 34(10), [5] Bastos, J. (2010b). Predicting bank loan recovery rates with neural networks. Technical report, Working Paper. [6] BCBS (2005). International convergence of capital measurement and capital standards. a revised framework, bank for international settlements. [7] Bellotti, T. and J. Crook (2008). Modelling and predicting loss given default for credit cards. Technical report, Working Paper. [8] Belotti, T. (2011 (forthcoming)). Loss given default models incorporating macroeconomic variables for credit cards. International Journal of Forecasting. [9] Board of Governors of the Federal Reserve System,. (2011). Federal reserve statistical release. [10] Calabrese, R. (2010). Regression for recovery rates with both continuous and discrete characteristics, proccedings of the 45th scientic meeting of the italian statitistical society (sis), italy. [11] Chen, T. H. and C. W. Chen (2010). Application of data mining to the spatial heterogeneity of foreclosed mortgages. Expert Systems with Application Vol. 37(2), [12] Crook, J., D. Edelman, and L. Thomas (2007). Recent developments in consumer credit risk assessment. European Journal of Operations Research Vol. 183(3),
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationIntroduction to Logistic Regression
OpenStaxCNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStaxCNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationStatistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics: Behavioural
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationSTATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Webbased Analytics Table of Contents INTRODUCTION: WHAT
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationPrediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationStatistics in Retail Finance. Chapter 2: Statistical models of default
Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision
More informationDATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationPredicting Recovery Rates for Defaulting Credit Card Debt
Predicting Recovery Rates for Defaulting Credit Card Debt Angela Moore Quantitative Financial Risk Management Centre School of Management University of Southampton Abstract Defaulting credit card debt
More informationEcommerce Transaction Anomaly Classification
Ecommerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of ecommerce
More informationApplied Data Mining Analysis: A StepbyStep Introduction Using RealWorld Data Sets
Applied Data Mining Analysis: A StepbyStep Introduction Using RealWorld Data Sets http://info.salfordsystems.com/jsm2015ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationMERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION
MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION Matthew A. Lanham & Ralph D. Badinelli Virginia Polytechnic Institute and State University Department of Business
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationNeural Networks for Sentiment Detection in Financial Text
Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationNumerical Algorithms Group
Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the nontrivial extraction of implicit, previously unknown and potentially useful
More informationIra J. Haimowitz Henry Schwarz
From: AAAI Technical Report WS9707. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Clustering and Prediction for Credit Line Optimization Ira J. Haimowitz Henry Schwarz General
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationComparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
More informationharpreet@utdallas.edu, {ram.gopal, xinxin.li}@business.uconn.edu
Risk and Return of Investments in Online PeertoPeer Lending (Extended Abstract) Harpreet Singh a, Ram Gopal b, Xinxin Li b a School of Management, University of Texas at Dallas, Richardson, Texas 750830688
More informationComparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Nonlinear
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationBetter credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring  Overview Random Forest  Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, MayJune 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationRole of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
More informationTHE HYBRID CARTLOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell
THE HYBID CATLOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most datamining projects involve classification problems assigning objects to classes whether
More informationStatistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit
Statistics in Retail Finance Chapter 7: Fraud Detection in Retail Credit 1 Overview > Detection of fraud remains an important issue in retail credit. Methods similar to scorecard development may be employed,
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationCREDIT RISK ASSESSMENT FOR MORTGAGE LENDING
IMPACT: International Journal of Research in Business Management (IMPACT: IJRBM) ISSN(E): 2321886X; ISSN(P): 23474572 Vol. 3, Issue 4, Apr 2015, 1318 Impact Journals CREDIT RISK ASSESSMENT FOR MORTGAGE
More informationA Basic Guide to Modeling Techniques for All Direct Marketing Challenges
A Basic Guide to Modeling Techniques for All Direct Marketing Challenges Allison Cornia Database Marketing Manager Microsoft Corporation C. Olivia Rud Executive Vice President Data Square, LLC Overview
More informationNew Work Item for ISO 35345 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 35345 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationIssues in Information Systems Volume 16, Issue IV, pp. 3036, 2015
DATA MINING ANALYSIS AND PREDICTIONS OF REAL ESTATE PRICES Victor Gan, Seattle University, gany@seattleu.edu Vaishali Agarwal, Seattle University, agarwal1@seattleu.edu Ben Kim, Seattle University, bkim@taseattleu.edu
More information11. Analysis of Casecontrol Studies Logistic Regression
Research methods II 113 11. Analysis of Casecontrol Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationAt first glance, small business
Severity of Loss in the Event of Default in Small Business and Larger Consumer Loans by Robert Eales and Edmund Bosworth Westpac Banking Corporation has undertaken several analyses of the severity of loss
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationAlgorithmic Scoring Models
Applied Mathematical Sciences, Vol. 7, 2013, no. 12, 571586 Algorithmic Scoring Models Kalamkas Nurlybayeva MechanicalMathematical Faculty AlFarabi Kazakh National University Almaty, Kazakhstan Kalamkas.nurlybayeva@gmail.com
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationDECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING
DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four
More informationStandardization and Its Effects on KMeans Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 3993303, 03 ISSN: 0407459; eissn: 0407467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
More informationJetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
More informationData mining and statistical models in marketing campaigns of BT Retail
Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120
More informationSection 6: Model Selection, Logistic Regression and more...
Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building
More informationUNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
More informationMarketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
More informationData Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Webbased Analytics Table
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationBinary Logistic Regression
Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19  Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19  Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.standrews.ac.uk twk@standrews.ac.uk Tom Kelsey ID505919B &
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multiclass classification.
More informationStep 5: Conduct Analysis. The CCA Algorithm
Model Parameterization: Step 5: Conduct Analysis P Dropped species with fewer than 5 occurrences P Logtransformed species abundances P Rownormalized species log abundances (chord distance) P Selected
More informationInsurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationLOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationEfficiency in Software Development Projects
Efficiency in Software Development Projects Aneesh Chinubhai Dharmsinh Desai University aneeshchinubhai@gmail.com Abstract A number of different factors are thought to influence the efficiency of the software
More informationPerformance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network?  Perceptron learners  Multilayer networks What is a Support
More informationSegmentation of stock trading customers according to potential value
Expert Systems with Applications 27 (2004) 27 33 www.elsevier.com/locate/eswa Segmentation of stock trading customers according to potential value H.W. Shin a, *, S.Y. Sohn b a Samsung Economy Research
More informationData Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is
More informationChapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining BecerraFernandez, et al.  Knowledge Management 1/e  2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
More informationCourse Syllabus. Purposes of Course:
Course Syllabus Eco 5385.701 Predictive Analytics for Economists Summer 2014 TTh 6:00 8:50 pm and Sat. 12:00 2:50 pm First Day of Class: Tuesday, June 3 Last Day of Class: Tuesday, July 1 251 Maguire Building
More informationThe relation between news events and stock price jump: an analysis based on neural network
20th International Congress on Modelling and Simulation, Adelaide, Australia, 1 6 December 2013 www.mssanz.org.au/modsim2013 The relation between news events and stock price jump: an analysis based on
More informationOptimization of technical trading strategies and the profitability in security markets
Economics Letters 59 (1998) 249 254 Optimization of technical trading strategies and the profitability in security markets Ramazan Gençay 1, * University of Windsor, Department of Economics, 401 Sunset,
More informationThe Financial Crisis and the Bankruptcy of Small and Medium SizedFirms in the Emerging Market
The Financial Crisis and the Bankruptcy of Small and Medium SizedFirms in the Emerging Market SungChang Jung, Chonnam National University, South Korea Timothy H. Lee, Equifax Decision Solutions, Georgia,
More informationUSING LOGIT MODEL TO PREDICT CREDIT SCORE
USING LOGIT MODEL TO PREDICT CREDIT SCORE Taiwo Amoo, Associate Professor of Business Statistics and Operation Management, Brooklyn College, City University of New York, (718) 9515219, Tamoo@brooklyn.cuny.edu
More informationTRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationCREDIT RISK MANAGEMENT
GLOBAL ASSOCIATION OF RISK PROFESSIONALS The GARP Risk Series CREDIT RISK MANAGEMENT Chapter 1 Credit Risk Assessment Chapter Focus Distinguishing credit risk from market risk Credit policy and credit
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationCONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
More informationPredicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)
260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case
More informationMonitoring the Behaviour of Credit Card Holders with Graphical Chain Models
Journal of Business Finance & Accounting, 30(9) & (10), Nov./Dec. 2003, 0306686X Monitoring the Behaviour of Credit Card Holders with Graphical Chain Models ELENA STANGHELLINI* 1. INTRODUCTION Consumer
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationCOMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
More information