Recovery Rate Modelling of Non-performing Consumer Credit Using Data Mining Algorithms

Size: px
Start display at page:

Download "Recovery Rate Modelling of Non-performing Consumer Credit Using Data Mining Algorithms"

Transcription

1 Recovery Rate Modelling of Non-performing Consumer Credit Using Data Mining Algorithms Markus Hoechstoetter, Abdolreza Nazemi, Svetlozar T. Rachev and Caslav Bozic RMI Working Paper No. 12/09 Submitted: September, 2012 Abstract There have been more studies on recovery rate modeling of bonds than of personal loans and retail credit. As far as the authors are aware, there exists no research of recovery rate modeling in retail credit for third-party buyers. The goal of this paper is to fill this gap. In our study, over nine million defaulted or non-performing consumer credit data provided by a German debt collection company are used. According to the findings, the optimum times of the collection processes are not the same for all industries. Moreover, from a variety of characteristics, those debtor characteristics that are most significant in predicting the recovery have been determined. To select the best prediction and classification model, a variety of statistical and data mining methods such as logistic regression, neural network, K-nearest neighbor, CHAID, CART, Support Vector Machine and regression will be examined. A two-stage model which first classifies debts to extreme and non-extreme recovery rate is applied; then, the extreme debts are classified into full payment and non-payment. Moreover, the non-extreme recovery rates are predicted. Keywords: Recovery rate, Data Mining, Third-party Company Markus Hoechstoetter School of Business Engineering Karlsruhe Institute of Technology, Germany Svetlozar T. Rachev College of Business Stony Brook University, New York, USA Abdolreza Nazemi National University of Singapore Risk Management Institute Caslav Bozic School of Business Engineering Karlsruhe Institute of Technology, Germany Abdolreza Nazemi. Views expressed herein are those of the author and do not necessarily reflect the views of NUS Risk Management Institute (RMI).

2 Recovery Rate Modelling of Non-performing Consumer Credit Using Data Mining Algorithms Markus Hoechstoetter a,1, Abdolreza Nazemi b,2, Svetlozar T. Rachev c,d,3, Caslav Bozic a,4 a School of Business Engineering, Karlsruhe Institute of Technology, Germany b Risk Management Institute, National University of Singapore, Singapore c College of Business at Stony Brook University, New York, USA d FinAnalytica, New York, USA Abstract There have been more studies on recovery rate modeling of bonds than of personal loans and retail credit. As far as the authors are aware, there exists no research of recovery rate modeling in retail credit for third-party buyers. The goal of this paper is to ll this gap. In our study, over nine million defaulted or non-performing consumer credit data provided by a German debt collection company are used. According to the ndings, the optimum times of the collection processes are not the same for all industries. Moreover, from a variety of characteristics, those debtor characteristics that are most signicant in predicting the recovery have been determined. To select the best prediction and classication model, a variety of statistical and data mining methods such as logistic regression, neural network, K-nearest neighbor, CHAID, CART, Support Vector Machine and regression will be examined. A two-stage model which rst classies debts to extreme and non-extreme recovery rate is applied; then, the extreme debts are classied into full payment and non-payment. Moreover, the non-extreme recovery rates are predicted. Keywords: Recovery rate, Data Mining, Third-party Company addresses: (Markus Hoechstoetter), (Abdolreza Nazemi ), (Svetlozar T. Rachev), (Caslav Bozic) 1 Markus Hoechstoetter is a post doctoral researcher at the Chair of Statistics, Econometrics and Mathematical Finance at the School of Economics and Business Engineering, Karlsruhe Institute of Technology, Germany. 2 Abdolreza Nazemi is a post doctoral researcher at the Risk Management Institute, National University of Singapore, Singapore. 3 Svetlozar Rachev is Professor at College of Business, Stony Brook University in New York and Chief Scientist of FinAnalytica. 4 Caslav Bozic is a doctoral student at the AIFB Institute of the Faculty of Economics and Business Engineering, Karlsruhe Institute of Technology, Germany and a research and develpment engineer at Avedas AG, Karlsruhe, Germany.

3 2 1. Introduction The latest, or should we say current, crisis is still present in various aspects of life. The damage to the corporate world can be looked up in Standard & Poor's (2011), for example. However, individuals have also suered from the crisis as many employees were laid o and, to make things worse, banks began to suppress lending on a large scale which resulted in the credit crunch. This is even more worrying since borrowing or, more generally, being in debt has become the omnipresent situation worldwide. For example, in the USA, corporate and consumer debt has reached dizzying levels. According to the Board of Governors of the Federal Reserve System (2011), the accumulated debt amounts to over 10 trillion dollars. This trend, however, is not unique to the USA. In Europe, the trend is similar, as pointed out by Thomas (2009); and, if this trend is not reversed, the need for the expansion of lending will remain a pressing issue. It is obvious that there has to be a sophisticated system that will enable credit to be extended and guarantee that the impact of defaulted debt will be cushioned. Hence, there is a need for third-party buyers to relieve originators from the stressful collection business. In the context of lending and the inherent risk, the terminology and denitions given in the Basel II Accords are widely used. Mainly, the terms 'expected loss', the loss-given-default conditional on default as well as recovery rate are commonly used in the context of nancial debt. Before the regulation in the Basel Accords was constituted, lenders in the retail sector designed a now widelyused tool, the credit scorecard. This device helps to assess the probability of a new customer defaulting. The actual score is usually a three-digit number. It is computed from the set of consumer characteristics. The scorecard helps to distinguish between potentially good and bad borrowers, i.e., those who default and those who do not. As references, we recommend Hand (2001) and Thomas et al. (2002). However, this tool has not proved reliable in predicting complete loss or recovery once the debtor has defaulted on his/her obligations. Instead, logistic regression has been applied more often in the analyses of recovery rates. A problem arising from the denitions in the Basel accords, however, is the sometimes not uniquely specied parameters, leaving room for interpretation and diverging implementations. For example, the recovery rate as the percentage of the original amount of debt outstanding that is repaid to or repossessed by the lender can be stated either as the price the distressed debt would achieve if sold immediately after default or as the sum of discounted future payments received from the debtor. The main interest of this paper is the prediction of the recovery rate or, conversely, the lossgiven-default (LGD). Practitioners as well as researchers agree that the recovery rate has to be modelled by some quantity that is restricted to the interval between zero and one. But there are no standardized models for the modelling and prediction of the recovery rate. This is common to basically all industries. So, some models include macro-economic variables, which is in line with the Basel Accords, while others are restricted to information on the debt. Most of all, the

4 3 methodological side suers from a lack of freely accessible data, thus hampering the production of reliable results of general value. Moreover, since the collection process can be carried out by the original lender, i.e., in-house, as well as by some third party acting either as agent or as purchaser of the debt, the models have to consider the dierence in accessibility of relevant information for the two options. In the model, we will use only debtor-related variables without macro-economic factors. Our ndings will reveal that age of debtor and the amount of debt outstanding are important determinants for the recovery rate. We furthermore see as one of our main contributions the presentation of results on recovery rates from various non-banking industries on a scale not presented before in the literature to our knowledge. The remainder of the paper is organized as follows. The following section presents the features of the dierent options of the lender with respect to collection and ownership of the debt. In section 3, statistical and data mining methods will be reviewed in brief. Then, we present our data in section four. The important results of the exploratory data analysis are set out in section ve. We discuss the recovery rate modelling using data mining methods in section 6. This is followed by an explanation of our results in section Recovery and collection process In this section, we will present the dierent collection options that exist for a lender, and their implications for LGD, an excellent account of which is given in Thomas et al. (2011). Generally, collecting debt and recovering it after default consists of a sequence of measures such as written reminders, more specic letters, telephone calls and, ultimately, legal action such as court orders and sending marshals. The dierent measures commonly result in various actions by the debtor. Often, as a result of a telephone call, creditor and debtor negotiate a repayment plan. It is believed by most collection entities that a telephone call is preferable to an appearance in person because the conversation is still personal while the debtor is not confronted with a physical intrusion. Usually, the collection process becomes more robust as default continues; in the public perception this is commonly attributed to a debt collector's alleged rude demeanour and it may even discredit the entire lending industry. Proof of this public disapproval of the loan and collection business has been appearing in abundance in the press since the culmination of the latest crisis. Unless hindered by the legal framework, a lender might consider any of three basic collection options during the life of the lending relationship, the rst of which is the internal (or in-house) collection and recovery process. Lenders usually resort to this, at least in the beginning. Secondly, a lender may decide to remain owner of the debt but assign collection to an agent. The third option is to sell the debt to a third party and, hence, end the relationship with the borrower. The measures taken may be the result of discretionary considerations relating to certain customers only, or part of a standardized provision formulating automated procedures. For as long as repayment is

5 4 on schedule, the lender will retain the collection in his business. Also, if the lender determines that he is better able than an outside agent to recover the debt after default based on his experience and also because the relationship with the defaulting borrower is valuable to him, he will opt to collect it himself. When the collection process is too involving for the creditor or might harm his reputation because the language to be applied at a certain stage of collection may have to become more direct, the lender may choose to hand the collection process to an outside agent while retaining ownership of the debt. If default seems more likely or the ownership of the distressed debt causes additional aggravation for the lender, he will probably sell the debt at a discount to a third-party buyer. The price will have to be lower than the recovery value expected by the buyer. An advantage of internal collection may be that all characteristics concerning the debt are known whereas a thirdparty buyer is lacking important information such as loan details, borrower repayment behaviour or change in score, which is a privilege of the original lender, according to Fama (1985). The third-party buyer does receive essential information on exactly when default occurred, the exact amount outstanding, and when the last payment was made. The third-party buyer like the outside agent, however, may suer from a negative selection because he will most likely only have access to poorly performing debt, as argued in Ramakrishnan and Thakor (1984). According to Thomas et al. (2011), it is not surprising that only 7% repaid the whole debt, 16.3% paid a fraction, and the vast majority, i.e. 83%, did not pay anything when the collection was carried out by a third party. This is in sharp contrast to the outcome when collection was undertaken by the in-house collection department: 30% repaid the total amount, while 60% and 10% repaid only a share or nothing, respectively. In our results section, we will provide very detailed information on recovery rates faced by a third-party buyer of non-performing consumer debt. 3. Statistical and Data Mining Models 3.1. Regression 3.2. Logistic Regression Logistic regression is used for quantitative variables, particularly when the response variable is a categorical variable. Let us dene a binary random variable as 1 if default occur Y = 0 if default does not occur with π = P r(y = 1) and 1 π = P r(y = 0) which is the famous example of occurrence of default. The multiple logistic regression is as follows: π i = exp(x β) 1 + exp(x β) = exp(β 0 + β 1 X β n X n ) 1 + exp(β 0 + β 1 X β n X n ) (1)

6 3.3 Neural Network 5 where β is vector of coecients and X is matrix of observations Neural Network The design of the neural network is especially appealing because of its several layers of perceptrons. Common to any design are an input layer, one or more hidden layers of neurons, and an output layer. In the simplest version with just one hidden layer, input data consisting of observations x j of j = 1, 2,..., d variables enters neuron i of the hidden layer to be transformed there into a weighted functional output d h i = f (1) (b i + w i,j x j ) with weights w i,j and neuron-specic constant b i. Output from all n h hidden neurons is then turned into network output n h y = f (2) (b (2) + v i h i ) with neuron weights v i. The neural network allows for a exible yet sometimes unintuitive design. This technique is particularly apt in separating samples with respect to objective functions such as, for example, zero or full recovery K-nearest neighbor KNN is a popular method in data mining and is used for classication based on closest observations in the variable space. The K-nearest neighbor was introduced in the early 1950s. KNN can also be applied for prediction. 5 Suppose that the learning sample has n attributes, which means each learning sample is shown as a point in n dimensional space. If we want to classify a new sample using K-nearest neighbor, then KNN is looking for K-nearest learning samples that are nearest to the new sample. After that, K-nearest neighbor detects a class of the new sample as the class majority of these K-nearest neighbors. When k = 1, the class of the new sample is the same as the class of the nearest learning sample to this point. The Euclidean distance or Mahalanobis distance is applied as a distance metric. The Euclidean distance between two points or tuples X 1 = (x 11, x 12,..., x 1n ) and X 2 = (x 21, x 22,..., x 2n ) is dened as D E (X 1, X 2 ) = n (x 1i x 2i ) 2 j=1 i=1 i=1 5 see, Han and Kamber, (2006).

7 3.5 Trees 6 For calculation, we usually use the normalized variables. Suppose that, for X = (x 1,..., x n ) T, the covariance matrix is equal to Σ and µ = (µ 1,..., µ n ) T is the mean vector; then the Mahalanobis distance is dened as D M (X) = (x µ) T Σ 1 (x µ) The Mahalanobis distance is based on correlations between variables by which dierent patterns can be identied and analyzed. It is a useful way of determining similarity of variables. The Euclidean distance applies in many classication problems. In KNN, when the value of a variable is missing, KNN uses the maximum dierence between two samples; this means that, if both values of a normalized variable in sample X 1 and X 2 are missing, the dierence is assumed to be 1. On the other hand, if only one of them is a missing value and the other one has a value b, the dierence is considered to be 1 b or b. For a categorical variable, KNN assumes it to be 1. For the K-nearest neighbor algorithm, we need to determine the number of nearest neighbors and distance measure. The best number of k depends only on data and we can nd it by comparing error rates for dierent k values. In general, the larger the values of k, the more stable the model of classication. However, the larger values of k mean that the learning data samples now being included are not very close to the new sample. If k is equal to 1, then the class of new sample is predicted as the class of the closest learning sample; this is called the nearest neighbor algorithm and it makes a rather unstable classier Trees Breiman and Friedman introduced recursive partitioning algorithms in Decision trees are usually classied into two groups: classication trees and regression trees. In classication trees, the target variable is categorical or qualitative but we can also use classication trees when the dependent variable is continuous. Thomas et al. (2002) mentioned that classication trees were applied in credit scoring by Makowski in The basic idea of tree construction is to nd subsets with maximum homogeneity or cases that are located in a subset belonging only to one class of target variable. At each step of splitting, tree algorithms split cases with independent variables that have maximum homogeneity. We dene impurity of a node as a function of the probability of a dierent class in the node under consideration: i(t) = φ(p 1, p 2,..., p J ) 6 see, Giudici and Figini, (2009), Hand et al. (2001), Han and Kamber, (2006). 7 see, Thomas et al., (2002), Giudici and Figini, (2009).

8 3.6 Support Vector Machine 7 where the p j is a probability of cases belonging to class j. There are dierent kinds of impurity function with these characteristics. One of the principal dierences in tree algorithms is related to impurity function. Breiman et al. (1984), Deville (2006), Giudici (2003) and Thomas et al. (2002) pointed out certain impurity functions, for example, Gini i(t) = j p(j t)(1 p(j t)), Entropy i(t) = (p(0 l) p(0 r))2 j p(j t)log(p(j t)) or Maximize half-sum of squares Chi = n(l)n(r) where n(l)+n(r) n(r) and n(l) are the number of observations in the right and left nodes. The large value of χ 2 statistic Chi means that the two proportions are not the same. The reduction of impurity that the split obtained was dened as quality of a split as: i = i(v) [π(l)i(l) + π(r)i(r)] where π(l) and π(r) are the observed proportions of observations in classication. In fact, tree algorithms select the variable that has best quality of a split. Finally, tree algorithms label leaf nodes based on the majority of target variables. In regression trees, tree tted ŷ i that is equal to mean of dependent variable for observations in considering leaf node. Classication and regression trees (CART) are the most usual tree algorithms (Breiman et al., 1984). In CART, the target variable could be categorical and continuous. The impurity function of CART is assumed to be Gini or entropy. Chi-square Automatic Detection (CHAID) was developed by Kass in Furthermore, the impurity is assumed to be chi-square Support Vector Machine The SVMs are used to separate debtors into two categories (y = 1 or y = 1) based on some hyperplane threshold with perpendicular vector w maximizing the minimal distance of each of the two groups from the threshold. With the optimal hyperplane, the training data keep a minimum distance of b from the hyperplane to guarantee generality of the model. The optimization problem using all n observations (y i, x i ), x i ɛr d is thus given by or in the dual form min w,b w 2 2, s.t. y i(< w, x i > +b) 1, i = 1, 2,..., n (2) min a n a i 1 2 i=1 n a i a j y i y j < x i, x j >, i=1,j s.t. n a i y i = 0 (3) where <, > denotes the inner product. The separating rule is then given by f(x) = sign(< w, x > +b) or, equivalently, f(x) = sign( n i=1 a iy i < x i, x > +b). A problem occurs if the data are not linearly separable as required. To this end, the original data vector xɛr d is mapped into a higher dimensional (K > d) feature space with a non-linear function φ : R d R k, x φ(x). To circumvent the calculations of the inner products and associated dot products in the higher dimension, the so called kernel-trick is applied, requiring only computation i=1

9 3.7 Dmneural 8 of kernel functions k(x i, x j ) =< φ(x i ), φ(x j ) > for the dot products. Thus, the transformation into the higher dimension space can be actually avoided. The resulting separating function is now f(x) = sign( n i=1 a iy i k(x i, x) + b). Common kernel functions are, for example, polynomial k(x i, x) =< x i, x > or radial basis k(x i, x) = exp( x i x 2 /c). The authors state that the advantages are given by the use of key observations only for the sake of speed, the translation of the discrimination problem into a quadratic problem, and the projection of the original problem onto a higher dimensional space to apply a linear discrimination function. They begin the modelling with a stepwise selection process of the most powerful variables to separate the data set into homogenous subsets. LSSVM is a version of SVM to conduct a linear regression of the form y = φ(x) i b + ε with the original data x mapped into a higher feature space by φ to obtain a higher degree of linearity. Using a kernel K(x, x i ) = φ(x) T φ(x i ) simplies the optimization in the preferred dual form y = n i=1 a iφ(x) T φ(x i ) + e Dmneural As far as we are aware, dmneural network training is not a very popular method in data mining but we will compare the performance of this model to other models. In the learning dataset, the dmneural species the best principal components of independent variables for maximum variation in the response variable; consequently it chooses the best group of independent variables in prediction or classication of response variable. Dmneural omits the independent variables that have less information for prediction of the response variable. According to principal components' characteristics, they are uncorrelated. An activation function is applied to the linear combination of independent variables and principal components. Matignon (2007) points out the eight dierent activation functions among them Gaussian, logistic, exponential and square. 8 The misclassication rate in the classication problem and the sum of square error in the prediction problem are used in specifying the best activation function in the next phases. The dmneural applies the response variable and the residual in prediction or classication of response variables from the rst step. The dmneural model constructs an additive non-linear model as follows. Matignon (2007) points out the following additive non-linear model as dmneural model: ŷ = nphases i=1 g(f(x, α)) Where g is the link function and the best activation function is f in phase i. 8 see, Matignon, (2007).

10 9 4. Data description 4.1. Data provider Our data consist of close to ten million dierent unsecured debts purchased between 2001 and 2010 by arvato infoscore, one of the largest debt purchasers in Germany. The company combines a collection business (German Inkasso), scoring services, and factoring. Factoring, as known in Germany, is a particular form of third-party nancial service for originator lenders. The most common variations are full-service, selective, notication, semi-factoring, and silent factoring. In case of normal factoring, the debt buyer, i.e., the factor, receives all debt from the originator in an automatically revolving process agreed upon in advance. The factor is owner as well as collector of the debt after its cession from the originator. It is the most common form of factoring in Germany. Selective factoring describes a construct where only selected debt is sold o to the third-party factor. When the third party oers notication factoring, the debtor is informed about the sale of the debt and can only repay to the third-party factor. The default risk, however, remains with the originator which, in the case of default, has to reimburse the factor. In silent factoring, the debtor is ignorant of the sale of the debt and payment is only possible to the original creditor. A negative consequence for the factor is a lack of inuence over the debtor since he is not entitled to collect. And nally, when semi-factoring is chosen between originator and factor, the debtor remains ignorant of the sale of the debt, as well, but payments are to be made exclusively to accounts or addresses that belong to the factor. In the case of arvato infoscore, the company engages in full-service factoring. Although a legally separate entity, in the case of collection and scoring businesses combined in the same company, the third-party buyer has the advantage that the collection department has often developed long lasting relationships with debtors during the period of initial ownership by the respective originators. These relationships yield precious information the third-party buyer would not have access under any other circumstances. However, legally, this is sometimes limited to the information that would be oered to any third-party buyer. In the following, we consider data only accessible to regular outside buyers Data The data consist of roughly ten million defaulted or non-performing unsecured receivables from nine dierent categories that are customers of the third-party buyer. On the one hand, these categories represent the following industries: telecommunications, online shopping and mail ordering, nancial services including credit cards, and the utility and energy sector. Moreover, receivables from the non-prot-entities of the public sector (community services) and public transport are also part of these categories as are failed return debit notes as well as anything that does not t into any of the prior categories; these are subsumed in the miscellaneous category including,

11 4.2 Data 10 for example, unpaid parking tickets. In the following, we will use these abbreviations to indicate the respective industries: Mail order (MO), business-to-business (B2B), energy and utilities (NRGY), nancial services (FS), miscellaneous (MI), public sector (PS), return debit note (RDN), telecommunications (TC), and public transport (PT). Each debtor is assigned a unique identication number. For each receivable a unique identi- cation number is issued, and all payments on the account of a particular receivable have to be labelled with the respective identication number. The relationship between receivable and borrower is not unique since a borrower might have defaulted on more than one receivable in arrears, whereas a receivable in arrears can only belong to one borrowing entity. A payment is characterized by the identication number of the receivable and, thus, can be traced to the corresponding debtor. Furthermore, we selected from all given payment characteristics those that could most easily be transformed into a numerical variable or a categorical variable of low dimension. The resulting variables relating to the debtor include age, gender, residential status and address, as well as current credit history. The variables related to the accounts receivables include age of debt, date of purchase by third party, amount outstanding and last payment date, while the original receivable amount is usually unknown. This yields about 15 variables that can be used for the subsequent analysis. For example, information on the quality of the location of the residence, which can be obtained by transforming the postal code into a rating, has not yet been considered. Henceforth, we will use the terms 'category' and 'industry' interchangeably. In Table 1, we have presented the most important statistics of the data sorted by industry. It also contains some initial results that we will discuss further in the last section. As we can see from adding the values of `# debts (original)', the total number of receivables is 9,793,590. Because our computational capacity was limited at the time we received the data, we decided to use only 100,000 randomly selected receivables from the mail order industry and only 500,000 randomly selected receivables from the public transport category. We will use the complete data-set for other categories. We assume that the means of debts in the complete data-sets of mail order and public transport categories are the same as in their samples; thus, the amount of debt outstanding is 1,248,266, Euros. The recovery rate in all categories is not the same and the range of mean recovery rate is between and The mean recovery rate of public sector is the lowest and the mail order category has the highest mean of recovery rate. The nancial services category has the highest mean of debt at and one of the lowest mean of recovery rate at In Table 1, if we subtract mean debt age from mean debt age in third-party we have the duration between default occurrence and sell-o to third-party company. The categories with shorter periods between default and sello to third-party company have seen more recovery than the categories that have longer period between default and sell-o to third-party. For example, these periods in nancial services and public sector are around 43 and 23 months, although for mail order and telecommunications, they are around ve months. The payments that did not convey reliable information on all the

12 11 used variables were discarded from further analysis. The missing information is indicated by the respective superscripts of the industry. This was age of debtor, age of receivable, or the identity of the debtor. Moreover, we cleaned with respect to `Earliest entry' and `Last entry' per industry, since there were unreasonable values, most likely the result of laxity during data entry. Eliminating these outliers resulted in the new values as presented here. At this point, it becomes obvious how important the quality of the data is for the third-party buyer since he generally has no means of validating and, if necessary, correcting them. However, the third-party buyer has to cope with many aws in the data. TABLE 1 5. Exploratory data analysis We use the complete data sets except mail order and public transport industries. Because our computational capacity was limited at the time we received the data, we decided to use only 100,000 observations randomly selected from mail order and 500,000 observations from the public transport section. The mail order section has the highest recovery rate that is and the public sector has the lowest recovery rate in the portfolio that is Table 2 shows the quantile of debt amount, recovery rate, time until full payment by debtor who paid fully and time of last payment by debtor who paid at least something. The very important result from this table is that the optimum time of collection process depends on the industry. As an illustration, in nancial services 99 percent of fully-paying debtors paid fully before 48 months and 90% of them paid fully before 28 months; meanwhile, 99% of non-fully-paying debtors did not pay more after 39 months and 90% of them did not pay more after 26 months. In contrast, 99% of fully-paying debtors paid fully before 11 months and 90% of them paid fully before 4 months in the miscellaneous category. Additionally, 90% of non-fully-paying debtors did not pay more after 6 months and 99% of them did not pay more after 12 months in the miscellaneous category. In other words, the time for calculation of nal recovery rate and the reasonable collection process time are dierent in dierent categories. We also analyze the recovery rate distributions for the horizons of 12, 24 and 36 months. 9 Our conclusion is that, for all nine industries, the variation in the respective distributions is minimal with mass slightly shifting from RR = 0 to RR = 1 since more debtors pay-o debts as time progresses. After one year, representing most of the receivables, the frequency of RR = 0 is slightly over 60% while, after three years, this frequency decreases minimally to just below 60%. So, either the rst year successfully predicted the recovery rate or three years are not long enough as a horizon, since we have censored data as payments are observed on debt that is much older 9 For each horizon, only receivables with an age greater than or equal to the horizon are included.

13 12 than the period considered by our scope. So far, our ndings appear to contrast with those of the bank loan data. TABLE 2 6. Recovery rates modelling The above mentioned literature, mostly on bank loan data, reports fairly high recovery rates of 60% or and more, on average. This may be due to two factors. First, the collection may have been retained by the banks; second, banks tend to have an advantage since they have insight into the borrower's nancial situation which lenders from other industries fail to acquire. This is argued by Fama (1985), for example. Since our data are from a non-bank third-party buyer, we expect rather low recovery rates. From Table 1, we see that this is justiable given that recovery rates are below 40% and even below 30% in many cases. In the next analysis, we consider the empirical distribution of the recovery rate across the nine dierent industries. It is apparent that nearly all probability mass is at RR = 0 and RR = 1. Across all nine industries, the majority, by far, of the recoveries are equal to 0. We hypothesize that this is the result of the relatively low average debt amounts (EAD) except for FS. After univariate exploratory analysis, we start to perform data analysis. We are applying a two-stage model which rst classies debts to extreme and non-extreme recovery rate; we then classify the extreme debts to full payment and non-payment. Moreover, the non-extreme recovery rate will be predicted. It is clear that we have classication problems in two steps, as the target variable is binary and the goals are to classify whether a defaulted debt will be extreme or non-extreme and to classify whether an extreme debt will be full payment or non-payment. We also have a prediction problem in the nal step, as the response variable is continuous and the aim is to predict non-extreme recovery rates. As pointed out before, the complete data-sets are applied, apart from the modelling step for mail order and public transport industries. We use only 100,000 randomly selected items from the mail order section and 500,000 observations from the public transport section. We delete outlier samples and impute the missing values in the data cleaning phase. After data cleaning, we divide the considered data-sets into two sets randomly: training or learning data-set and validation dataset. The training data-sets contain 70% of the observations and the validation data-sets contain 30% of the debts in each industry. We used stratied sampling with equal sizes based on target variable for training and validation. The number of observations decreases because data-sets are not balanced based on the response variable. We build each model on the training data-set and then these models are evaluated on the validation data-set to classify the debts. The misclassication rate is one of the usual criteria in classication model comparison.

14 6.1 Classifying debts as extreme and non-extreme 13 Data mining algorithms will be used for classication steps, such as neural network, CART, CHAID, K-nearest neighbor, dmneural, logistic regression and Support Vector Machine. Thereafter, neural network, CART, CHAID, regression and Support Vector Machine will be applied as prediction models in the nal step. In each step, we will compare models using R-squared, ROC curve, misclassication, average square error and sum square errors. However, the basic criteria are misclassication and R-squared Classifying debts as extreme and non-extreme Models building and comparisons We come to the modelling of the recovery rate by means of the well-known logistic regression model, i.e. the recovery rate RR is the non-linear transform of the linear model including real and coded categorical numerical data. The target variables are extreme and non-extreme recovery where 0 indicates non-extreme recovery and 1 shows extreme recovery. It is clear that extreme recovery consists of full payment and non-payment. As an initial step of selecting the individually most signicant debtor-related variables, we perform the logistic regression for each individual variable alone and assess its ability through an R 2 measure. This yielded the following set of seven variables individually most signicant for predicting RR: the debt amount (debt outstanding at sale), debtor age, prop title, debt date until sell-o (time between default and purchase by third party), rating (classifying the creditworthiness into an ordinal rating with seven levels), address (the validity of the debtor's address), and debtor type (either male, female or corporate entity). We choose a logistic regression model so a modelling selection procedure is not applied. This yielded the linear regression model P (Y =1) log P (Y =0) = µ + α.amount + β.debtor age + γ.prop title + δ.debt date till selloff ρ 1.rating 1 + ρ 2.rating 2 + ρ 3.rating 3 + ρ 4.rating 4 + ρ 5.rating ρ 6.rating 6 + ρ 7.rating 7 + φ.addresok + θ 1.debtor type 1 + θ 2.debtor type 2 + ɛ Here, we use the dummy variables rating 1 through rating 6 as well as debtortype 1 and debtortype 2 for the categorical variables. Table 3 shows the maximum likelihood estimates of the logistic regression parameters in classifying debts to extreme and non-extreme corresponding to the nal model and the statistical signicance of the parameters. For the explanatory variables, when the p-value is lower than 0.05, the null hypothesis is rejected. This means these explanatory variables have a statistically signicant inuence on the response variable. Now, we want to interpret the logistic regression model. In our model, when the debtor has extreme recovery rate Y=1; thus we can interpret that, for variables with negative coecients, the probability of extreme recovery rate decreases and, inversely, variables with positive coecients cause an increase in the probability of extreme recovery rate. For example, in nancial services, address ok, time between debt occurrence and sell-o to third-party company, and rating original equal to 5 have statistically signicant positive eects on the probability of extreme recovery

15 6.1 Classifying debts as extreme and non-extreme 14 rate. The probability of extreme recovery rate from debtors with address could be higher than the probability of extreme recovery rate from debtors without address. Debts with a longer time between default occurrence and sell-o to third-party company have a higher probability of extreme recovery rate. On the other hand, debts with 0, 2 and 4 rating original present a higher probability of non-extreme recovery rate than other debts. In summary, for nancial services, the variables that increase the probability of non-extreme recovery rate are: Do not have address Longer period between default and sell-o If we compare the table 5 we can see that the CHAID tree uses exactly the variables that have a statistically signicant inuence in logistic regression. Moreover, the CART tree uses the variables that are signicant in the logistic regression and debtor age. TABLE 3 CART and CHAID algorithms are used as classication methods. The chi-squared and Gini are impurity measures for CHAID and CART. To obtain a parsimonious tree, we use a signicance level of 0.2 in the stopping rule. Table 4 presents the results from the CART classication tree analysis for nancial services. The total number of splitting variables in the CART classication tree for nancial services is 6: address, rating, debt amount, debt date until sell-o, prop title and debtor age. All the dependent variables are used in the CART, except debtor type. The splitting variables in the CHAID classication tree for nancial services are address, rating, debt amount, debt date until sell-o, prop title and debtor age. The CART tree is slightly more complicated than the CHAID tree for nancial services. As mentioned, we classify all the debts based on the majority of debts in each leaf. We can calculate the misclassication rate as a performance measure. In the next section, we will examine the misclassication rate of our models in training and testing data-sets. At the beginning of the application of a neural network, we should specify the structure of the neural network. We used a neural network with one and two hidden layers consisting of between two and ve neurons. For example, we examined neural networks with 2, 3, 4 and 5 neurons for the datasets. In general, neural networks are black boxes and we cannot interpret them. An important step in using the K-nearest-neighbor is the width K specication. This establishes the size of the neighborhood of the independent variables that will be applied to classify the target variable. We checked the misclassication of KNN with dierent K. TABLE 4

16 6.2 Classifying extreme debts to full payment and non-payment 15 As mentioned before, we divide all the datasets into two sets: training or learning data-sets and validation data-sets. The training data-sets contain 70% of the observations and the validation data sets contain 30% of the debts in each industry. We use stratied sampling with equal sizes in target variable, which means that the number of extreme and non-extreme debts in both datasets, training and validation, are the same. We build each model on the training data-set and then evaluate these models on the validation data set to classify the debts. The misclassication rate is one of the usual criteria in model comparison. We should mention that we do not need to use cross-validation because our data sets are large; consequently our results are stable. In Table 5, we show some criteria such as misclassication rate, sum square error in training and validation, and average square error in validation data-sets in dierent industries. The misclassication rate in validation data-set is the principal criterion for model comparison in this study. The CHAID classication tree is the best model for classifying the debts to extreme and non-extreme recovery in public sector, nancial services and mail order. Based on the misclassication rate in the validation data-set, neural network is the best classi- er for business-to-business, miscellaneous and return debit note. On the other hand, the CART classication tree is the best classication model for energy and utilities and public transport. The Support Vector Machine has the highest accuracy rate in the validation data-set for telecommunications. For example, Figure 1 shows the ROC curve for nancial industries. We know that, in a ROC curve, one model dominates another when the curve of one model is completely above the curve of another model. The CHAID classication tree has the highest accuracy rate between our models in dierent industries at 0.805%. TABLE 5 and FIG Classifying extreme debts to full payment and non-payment Models building and comparisons We now come to the modeling of the recovery rate by logistic regression model. The response variables are full payment and non-payment recovery where 0 indicates non-payment recovery and 1 shows full payment. The independent variables are the debt amount, debtor age, prop title, debt date until sell-o, rating, address, and debtor type. This yielded the regression model P (Y =1) log P (Y =0) = µ + α.amount + β.debtor age + γ.prop title + δ.debt date till selloff ρ 1.rating 1 + ρ 2.rating 2 + ρ 3.rating 3 + ρ 4.rating 4 + ρ 5.rating ρ 6.rating 6 + ρ 7.rating 7 + φ.addresok + θ 1.debtor type 1 + θ 2.debtor type 2 + ɛ Here, we use the dummy variables rating 1 through rating 6 as well as debtortype 1 and debtortype 2 for the categorical variables. The maximum likelihood estimates of the logistic regression parameters in classifying extreme debt are shown in table 6.

17 6.2 Classifying extreme debts to full payment and non-payment 16 Now, we want to interpret the logistic regression model. In our model, when the debtor has fully paid Y=1; thus we can interpret that, for variables with negative coecients, the probability of full-payment decreases and, inversely, variables with positive coecients cause an increase in the probability of full-payment. For example, in nancial services, address not ok, debt amount, debt date unit sell-o, and rating original with labels 5, 6 have a statistically signicant negative eect on the recovery rate of debtor; also, prop title and rating original with labels 0, 2 and 4 variables have positive signicant inuence on the recovery rate in nancial services. The probability of full-payment by debtors with address could be higher than the probability of full-payment by debtors without address. Similarly, debts with a longer period from default to sell-o have a lower probability of full-payment. On the other hand, debts with 0 and 1 rating original present a higher probability of full-payment than other debts. In summary, for the nancial services, the variables that increase the probability of non-payment are: Do not have address Higher debt amount Longer period between default and sell-o Rating original= 05, 06 As with the telecommunications and nancial services categories, we can interpret logistic regression models for other categories. The address ok, debt amount, rating original and debt date until sell-o are statistically signicant in all industries. Also, debtor age and debtor type variables are signicant except in nancial services, and the prop title variable has a signicant inuence on nancial services, miscellaneous, return debit note, energy and utilities, telecommunications and mail order. We use CART, CHAID, SVM, neural network, KNN and dmneural as in the last step as classication methods. The misclassication rate in the validation data sets are applied for models comparison. Table 7 shows some criteria such as misclassication rate, sum square error in training and validation, and average square error in validation data set in dierent industries. The classication tree is the best model for classifying the debts with extreme recovery to full-payment and non-payment in return debit note, energy and utilities and telecommunication. Based on the misclassication rate in the validation data set, neural network is the best classier for business-to-business, nancial services and mail order. The Support Vector Machine has the highest accuracy rate in validation data sets for public sector, public transport and miscellaneous categories. TABLES 6,7

18 6.3 Prediction of non-extreme debts Prediction of non-extreme debts Models building and comparisons We now come to the prediction of the recovery rate using the regression model. The response variable is recovery rate of non-extreme debts. The independent variables are the same as in the previous steps: the debt amount (debt outstanding at sale), debtor age, prop title, debt date until sell-o (time between default and purchase by third party), rating (classifying the creditworthiness into an ordinal rating with seven levels), address (the validity of the debtor's address), and debtor type (either male, female, or corporate entity). Our regression model is: y = µ + α.amount + β.debtor age + γ.prop title + δ.debt date till selloff ρ 1.rating 1 + ρ 2.rating 2 + ρ 3.rating 3 + ρ 4.rating 4 + ρ 5.rating ρ 6.rating 6 + ρ 7.rating 7 + φ.addresok + θ 1.debtor type 1 + θ 2.debtor type 2 + ɛ We choose a regression model so the modelling selection procedure is not applied. Table 8 shows the maximum likelihood estimates of the regression parameters in predicting recovery rate of non-extreme debts. For the explanatory variables, where the p-value is lower than 0.05, the null hypothesis is rejected. This means that these explanatory variables have a statistically signicant inuence on the response variable. Now, we want to interpret this model. The response variable in our model is recovery rate of debts; thus we can interpret that, for variables with negative coecients, the debt's recovery rate decreases and, inversely, variables with positive coecients cause an increase in the debt's recovery rate. For example, in business-to-business, address not ok, debt amount and debtor age variables decrease the recovery rate. On the other hand, women debtors, rating original equal to 0 and 1, prop title and time between default and sell-o to third-party company have a positive inuence on recovery rate. The interpretation of regression models for the other categories is the same as the business-tobusiness category. The debt amount, debt date until sell-o and rating variables are signicant in all industries, while the address variable is not signicant except in the miscellaneous category. The debtor type is statistically signicant in prediction of recovery rate except in mail order and public transport. Only in nancial services does the debtor age have no signicant inuence. We use CART and CHAID algorithms as prediction models. The F-test statistic and variance reduction are impurity measures for CHAID and CART. On the other hand, SVM, regression and neural network are applied on all the data sets. As mentioned before, we divided non-extreme data sets into two sets: training and validation data sets. The training data sets contain 70% of the debts and the validation data sets contain 30% of the observations in each industry. We build each model on the learning data set and then

19 18 these models are evaluated on the validation data set for prediction of recovery rate value for non-extreme debts. The R-square and sum square error are applied as performance measures. Table 9 shows some criteria such as R-square, sum square error and average square error in dierent industries. The CHAID or CART classication tree are the best prediction models for predicting recovery rate of non-extreme debts in public sector, nancial services, business-tobusiness, mail order, telecommunications and public transport. On the other hand, neural network is the best model for the return debit note and mail order categories, and SVM is the best predictor in the miscellaneous category. TABLES 8,9 7. Conclusions We analyzed the recovery rates of 9,779,239 debts of a third-party company in this paper; these were classied into 9 categories: mail order, business-to-business, energy and utilities, nancial services, miscellaneous, public sector, return debit note, telecommunications, and public transport. The loss-given-default in all categories is not the same while the range of mean recovery rate is between and The mail order category has the highest mean of recovery rate and the mean recovery rate of public sector is the lowest. The nancial services category has the highest mean of debt at and the lowest mean of recovery rate at According to Table 1, the category with a lower debt date until sell-o has a higher mean of recovery rate. As shown in Table 2, the optimum times of collection process are not the same in all the industries. For instance, one year is denitely enough collection process time for miscellaneous but it is not long enough in the nancial services category. Neural network, CART, CHAID, Support Vector Machine, K-nearest neighbor, dmneural and logistic regression were applied in classifying debts to extreme and non-extreme recovery rate. These techniques were used in all the data-sets. The decision trees algorithms are the best models in ve industries and neural network has the best performance in two industries. The Support Vector Machine is the best classier in the telecommunications category. The important advantage of the classication tree is the interpretability of this model. Conversely, neural network and SVM are black box methods. We applied neural network, CART, CHAID, Support Vector Machine, K-nearest neighbor, dmneural and logistic regression in classifying extreme debts to full payment and non-payment. The decision trees algorithms are the best models in three industries and neural network has the best performance in three industries. The Support Vector Machine is the best classier in the public sector, public transport and miscellaneous categories. Neural network, CART, CHAID, Support Vector Machine and regression were used as prediction models of debts with non-extreme recovery rate. The neural network has the best performance

20 19 in two industries and Support Vector Machine is the best predictor for the miscellaneous category. Decision trees have the best results in the other data sets. In summary, the non-statistical methods produce better results than the statistical methods. References [1] Altman, E. I., A. Resti, and A. Sironi (2005). Recovery risk: The next challenge in credit risk management, Chapter Loss given default; a review of the literature in recovery risk, pp Risk Books, London. [2] Asarnow, E. and D. Edwards (1995). Measuring loss on defaulted bank loans. A 24-year study. Journal of Commercial Lending Vol. 77, No. 7. [3] Avery, R. B., P. U. Calem, and G. B. Canner (2004). Consumer credit scoring: Do situational circumstances matter. Journal of Banking and Finance 28, [4] Bastos, J. (2010a). Forecasting bank loans loss-given default. Jourrnal of Banking and Finance Vol. 34(10), [5] Bastos, J. (2010b). Predicting bank loan recovery rates with neural networks. Technical report, Working Paper. [6] BCBS (2005). International convergence of capital measurement and capital standards. a revised framework, bank for international settlements. [7] Bellotti, T. and J. Crook (2008). Modelling and predicting loss given default for credit cards. Technical report, Working Paper. [8] Belotti, T. (2011 (forthcoming)). Loss given default models incorporating macroeconomic variables for credit cards. International Journal of Forecasting. [9] Board of Governors of the Federal Reserve System,. (2011). Federal reserve statistical release. [10] Calabrese, R. (2010). Regression for recovery rates with both continuous and discrete characteristics, proccedings of the 45th scientic meeting of the italian statitistical society (sis), italy. [11] Chen, T. H. and C. W. Chen (2010). Application of data mining to the spatial heterogeneity of foreclosed mortgages. Expert Systems with Application Vol. 37(2), [12] Crook, J., D. Edelman, and L. Thomas (2007). Recent developments in consumer credit risk assessment. European Journal of Operations Research Vol. 183(3),

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Predicting Recovery Rates for Defaulting Credit Card Debt

Predicting Recovery Rates for Defaulting Credit Card Debt Predicting Recovery Rates for Defaulting Credit Card Debt Angela Moore Quantitative Financial Risk Management Centre School of Management University of Southampton Abstract Defaulting credit card debt

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION

MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION MERGING BUSINESS KPIs WITH PREDICTIVE MODEL KPIs FOR BINARY CLASSIFICATION MODEL SELECTION Matthew A. Lanham & Ralph D. Badinelli Virginia Polytechnic Institute and State University Department of Business

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Numerical Algorithms Group

Numerical Algorithms Group Title: Summary: Using the Component Approach to Craft Customized Data Mining Solutions One definition of data mining is the non-trivial extraction of implicit, previously unknown and potentially useful

More information

Ira J. Haimowitz Henry Schwarz

Ira J. Haimowitz Henry Schwarz From: AAAI Technical Report WS-97-07. Compilation copyright 1997, AAAI (www.aaai.org). All rights reserved. Clustering and Prediction for Credit Line Optimization Ira J. Haimowitz Henry Schwarz General

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

harpreet@utdallas.edu, {ram.gopal, xinxin.li}@business.uconn.edu

harpreet@utdallas.edu, {ram.gopal, xinxin.li}@business.uconn.edu Risk and Return of Investments in Online Peer-to-Peer Lending (Extended Abstract) Harpreet Singh a, Ram Gopal b, Xinxin Li b a School of Management, University of Texas at Dallas, Richardson, Texas 75083-0688

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit Statistics in Retail Finance Chapter 7: Fraud Detection in Retail Credit 1 Overview > Detection of fraud remains an important issue in retail credit. Methods similar to scorecard development may be employed,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

CREDIT RISK ASSESSMENT FOR MORTGAGE LENDING

CREDIT RISK ASSESSMENT FOR MORTGAGE LENDING IMPACT: International Journal of Research in Business Management (IMPACT: IJRBM) ISSN(E): 2321-886X; ISSN(P): 2347-4572 Vol. 3, Issue 4, Apr 2015, 13-18 Impact Journals CREDIT RISK ASSESSMENT FOR MORTGAGE

More information

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges A Basic Guide to Modeling Techniques for All Direct Marketing Challenges Allison Cornia Database Marketing Manager Microsoft Corporation C. Olivia Rud Executive Vice President Data Square, LLC Overview

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Issues in Information Systems Volume 16, Issue IV, pp. 30-36, 2015

Issues in Information Systems Volume 16, Issue IV, pp. 30-36, 2015 DATA MINING ANALYSIS AND PREDICTIONS OF REAL ESTATE PRICES Victor Gan, Seattle University, gany@seattleu.edu Vaishali Agarwal, Seattle University, agarwal1@seattleu.edu Ben Kim, Seattle University, bkim@taseattleu.edu

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

At first glance, small business

At first glance, small business Severity of Loss in the Event of Default in Small Business and Larger Consumer Loans by Robert Eales and Edmund Bosworth Westpac Banking Corporation has undertaken several analyses of the severity of loss

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Algorithmic Scoring Models

Algorithmic Scoring Models Applied Mathematical Sciences, Vol. 7, 2013, no. 12, 571-586 Algorithmic Scoring Models Kalamkas Nurlybayeva Mechanical-Mathematical Faculty Al-Farabi Kazakh National University Almaty, Kazakhstan Kalamkas.nurlybayeva@gmail.com

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING ABSTRACT The objective was to predict whether an offender would commit a traffic offence involving death, using decision tree analysis. Four

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

JetBlue Airways Stock Price Analysis and Prediction

JetBlue Airways Stock Price Analysis and Prediction JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue

More information

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

More information

Section 6: Model Selection, Logistic Regression and more...

Section 6: Model Selection, Logistic Regression and more... Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building

More information

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Fig. 1 A typical Knowledge Discovery process [2]

Fig. 1 A typical Knowledge Discovery process [2] Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Step 5: Conduct Analysis. The CCA Algorithm

Step 5: Conduct Analysis. The CCA Algorithm Model Parameterization: Step 5: Conduct Analysis P Dropped species with fewer than 5 occurrences P Log-transformed species abundances P Row-normalized species log abundances (chord distance) P Selected

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Efficiency in Software Development Projects

Efficiency in Software Development Projects Efficiency in Software Development Projects Aneesh Chinubhai Dharmsinh Desai University aneeshchinubhai@gmail.com Abstract A number of different factors are thought to influence the efficiency of the software

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Segmentation of stock trading customers according to potential value

Segmentation of stock trading customers according to potential value Expert Systems with Applications 27 (2004) 27 33 www.elsevier.com/locate/eswa Segmentation of stock trading customers according to potential value H.W. Shin a, *, S.Y. Sohn b a Samsung Economy Research

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Course Syllabus. Purposes of Course:

Course Syllabus. Purposes of Course: Course Syllabus Eco 5385.701 Predictive Analytics for Economists Summer 2014 TTh 6:00 8:50 pm and Sat. 12:00 2:50 pm First Day of Class: Tuesday, June 3 Last Day of Class: Tuesday, July 1 251 Maguire Building

More information

The relation between news events and stock price jump: an analysis based on neural network

The relation between news events and stock price jump: an analysis based on neural network 20th International Congress on Modelling and Simulation, Adelaide, Australia, 1 6 December 2013 www.mssanz.org.au/modsim2013 The relation between news events and stock price jump: an analysis based on

More information

Optimization of technical trading strategies and the profitability in security markets

Optimization of technical trading strategies and the profitability in security markets Economics Letters 59 (1998) 249 254 Optimization of technical trading strategies and the profitability in security markets Ramazan Gençay 1, * University of Windsor, Department of Economics, 401 Sunset,

More information

The Financial Crisis and the Bankruptcy of Small and Medium Sized-Firms in the Emerging Market

The Financial Crisis and the Bankruptcy of Small and Medium Sized-Firms in the Emerging Market The Financial Crisis and the Bankruptcy of Small and Medium Sized-Firms in the Emerging Market Sung-Chang Jung, Chonnam National University, South Korea Timothy H. Lee, Equifax Decision Solutions, Georgia,

More information

USING LOGIT MODEL TO PREDICT CREDIT SCORE

USING LOGIT MODEL TO PREDICT CREDIT SCORE USING LOGIT MODEL TO PREDICT CREDIT SCORE Taiwo Amoo, Associate Professor of Business Statistics and Operation Management, Brooklyn College, City University of New York, (718) 951-5219, Tamoo@brooklyn.cuny.edu

More information

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

CREDIT RISK MANAGEMENT

CREDIT RISK MANAGEMENT GLOBAL ASSOCIATION OF RISK PROFESSIONALS The GARP Risk Series CREDIT RISK MANAGEMENT Chapter 1 Credit Risk Assessment Chapter Focus Distinguishing credit risk from market risk Credit policy and credit

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University) 260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case

More information

Monitoring the Behaviour of Credit Card Holders with Graphical Chain Models

Monitoring the Behaviour of Credit Card Holders with Graphical Chain Models Journal of Business Finance & Accounting, 30(9) & (10), Nov./Dec. 2003, 0306-686X Monitoring the Behaviour of Credit Card Holders with Graphical Chain Models ELENA STANGHELLINI* 1. INTRODUCTION Consumer

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information