USING LOGIT MODEL TO PREDICT CREDIT SCORE Taiwo Amoo, Associate Professor of Business Statistics and Operation Management, Brooklyn College, City University of New York, (718) 951-5219, Tamoo@brooklyn.cuny.edu Viju Raghupathi, Assistant Professor, School of Business, Brooklyn College, City University of New York, (718) 951-5219, Vraghupathi@brooklyn.cuny.edu ABSTRACT Loans approved by financial institutions depend on FICO scores from credit reporting agencies. The scores are based on the amount of money the consumer owes, the payment history, the length of credit history, new credit, and types of credit. The lending institutions get their report from accredited reporting agencies standardize how the scores are determined. This paper discusses chronological procedures on how to use the logit (logistic regression) model to determine the consumer credit score. We deploy standard variables used by the reporting agencies and model dichotomous outcome variables using a linear combination of the predictor variables. Logit model, Credit scoring, Logistic regression INTRODUCTION It is important for any financial institution to have a less biased and more efficient credit scoring model. Credit score is a score that determines if a lender will be eligible to borrow money from lending institutions. The borrowed money could be for reasons such as mortgage, car loan, education loan, and any other type of loan that requires the lending institutions to check consumer s score and also determine their level of risks. Information used to determine the credit score is obtained from the consumer s credit report. The three major reporting agencies in United States are Experian, Equifax, and TransUnion. The authors examined a simple logit model to determine consumer s credit score. This was developed by first determining the types of variables in the credit report (either categorical or numerical), specifying the coding techniques, attaching the initial weights and finally applying the mathematical functions (summation and transfer). The output of the transfer function is primarily used to predict if a customer is likely to default or not, and subsequently to calculate the score to decide the consumers' credit worthiness. In this paper, two sets of variables or factors were applied to the model. The first set of variables consists of the five factors that are currently approved by the credit reporting agencies. The second set of nine variables or factors is recommended by the authors of this paper to determine the credit score. The logit model function uses the input information to detect a pattern in customers' behavior. The merit of this approach is that
it can be applied to different types of variables, either qualitative or quantitative and also solve problems of noisy data. A simulated data was to test the function and determine the output. In the next section, more information about credit scoring and logit model is discussed. RESEARCH BACKGROUND With the growing concern of credit risk, the banking industry is focused on developing effective systems that can evaluate and manage credit. There is a strong need to measure, monitor, manage and control the financial risk associated with bank loans. In order to be proactive in preventing delinquency or default by customers, banks need to collect, analyze and classify the different elements comprising a customer s credit. Credit scoring is a technique by which financial institutions develop a numerical score for each applicant so as to reduce the probability of delinquency or payment default by customers. Financial institutions and banks employ the credit scoring model for many applications such as personal loans, home loans, small business loans and insurance applications/renewals [1]. A credit scoring model assesses the likelihood of a borrower s ability to repay the loan on time, based on quantifiable borrower characteristics [2]. In case of personal loans, a bank accepts/rejects a client s application for a loan based on a combination of judgmental techniques and credit scoring models. The judgmental approach is based on the 3C s, 4C s or 5C s which are character, capital, collateral, capacity and condition [3]. Some of the techniques for credit scoring involve one or a combination of the following: discriminant analysis, linear regression, logistic regression, probit analysis, nonparametric analysis, expert systems, genetic algorithms and neural networks [4]. Constructing a credit scoring model requires historical data on the performance of prior loans and borrowers characteristics, so as to extrapolate current repayment trends [3]. Prior research has used a combination of variables such as age, annual income, gender, marital status, number of children, number of other credit cards held [1] (Koh et al., 2004), loan amount, loan duration, sex, monthly salary, additional income, house owned or rented, and education level [5]. Some of the indicators for credit scoring have been designated as demographic, financial, employment and behavioral indicators [6]. Logistic regression has been used in research on reducing high default rates. [7] came up with a scoring model for business loans using logistic regression with 16 individual technology attributes. [8] used a credit scoring logistic regression model that also considered including changes in the financial conditions of the firms after the firms received loans. Many studies have used logistic regression in credit scoring applications [5] [9] [10] [11] [12]. Our study discusses chronological procedures on how to use the logit model to determine the consumer s credit score by using a combination of standard variables deployed by reporting agencies and other variables recommended by the authors. The Logit model,
also called the logistic regression model, is used to model dichotomous outcome variables. In the logit model, the outcome is a linear combination of the predictor variables [13]. VARIABLES AND CODING TECHNIQUES One of the major advantages of logit is its ability to handle both numerical and categorical variables. Since the information needed in determining consumers' credit score requires the use of the two types of variables, it is important to transform the original data into standardized values. The transformation depends on whether the variable of interest is numerical continuous, numerical discrete or categorical. The most efficient way of doing this is to convert the original data to, standardized values between -1 and 1 in the case of numerical continuous data, and between 0 and 1 in the case of numerical discrete and categorical data. Numerical Continuous Variables used in this category are Income and Debt-to-Income Ratio. In order to standardize the values into between 0 and 1, we need to use the formula shown below. For a given set of sample data; Original Value Minimum Value Normalized Value Maximum Value Minimum Value Numerical Discrete The variables used in this category are as follows: Number of Credits Years at Residence Length of Credit History Years at Employment Number of Dependants In order to standardize their values into between 0 and 1, the following steps are helpful: (1) Specify the range of the given sample data. For example, number of credits may have records between the values of 9 and 54. (9 is the smallest number of credits while 54 is the largest number of credits in the sample). (2) Determine the width of the class to be used. This depends on the number of class groups you want to have. If you plan to have 5 classes, then the width to be used for this variable will be (54-9)/5 = 9. Thus the class intervals will be as follows: Number of Credits Assigned Codes 0 but less than 9 0.00 9 but less than 18 0.17 18 but less than 27 0.34 27 but less than 36 0.51 36 but less than 45 0.68 45 but less than 54 0.85
54 but less than 63 1.00 Notice that we desired to have 5 classes but we ended up with 7 classes. This is acceptable since we are just trying to estimate the number of classes that will be needed for the assigned codes. The table above shows that a customer with less than 9 credits will be assigned a code of 0.00. To get the first code number, we divide 1 by 6. Then add the number to it to get the next number, and so on. We expect that customers with good credit will have the opportunity to get more number of loans approved. This system of coding (from 0.00 to 1.00) will work for most numerical discrete variables in the credit scoring model. Some numerical variables may behave differently, and you will have to code from 1.00 to 0.00. For example, a variable like the number of dependants may be coded as follows: (1) Suppose the range of values in a sample data is between 0 and 5 (from no dependant to a maximum of 5 dependants). (2) Width of the class is 5/3 = 1.67 Number of Dependants Assigned Codes 0 but less than 2 1.00 2 but less than 4 0.67 4 but less than 6 0.33 In this case, the code means that the higher the number of dependants, the less likely the customer will have extra money left to pay for new loan or credit. Categorical Variables Variables of this type are Marital Status (Single, Married, Divorced, Separated); Loan Payment (Late, Not Late); Residential Status (Own, Rent) etc. The assigned codes depend on the number of categories involved in a variable. Examples are given below: Example 1: Residential Status Assigned Codes Own 1.00 Rent 0.50 This means that owning a residence implies more stability of address. It becomes easier for the creditor to locate the customer. A code of 1.00 means a customer with a more reliable address. Example 2: Marital Status Assigned Codes Married 1.00 Single 0.75 Divorced 0.50 Separated 0.25 If you are married, the creditor may look at the customer as being more responsible and reliable than someone who is not married.
In a variable Loan Payment, since customers have more than one credit, it is important to find the percentage of credit paid late to the total number of credits. This is required for each record or customer. We can then apply the coding method in Section 1.2 above to the percentage table. THE SCORING MODEL AND ITS ANALYSIS The proposed model consists of summation and sigmoid function. The summation function is a linear combination of the variable input codes and the weights. The initial weights are randomly selected and are assigned to the input variables in order of importance. Adjusting the weights now depend on training the network. The training method is discussed in the next section. This is known as Back propagation method. After coding the variables into values between 0 and 1, the next stage is to develop the statistical functions needed to predict the customers credit rating or score. The first function represents the linear combination of the initial weights and coded values of the variables called the summation function. Mathematically, this is written below. X W V W V W V... W V 1 1 2 2 3 3 n n Where: X is the value of the summation function n is the number of variables Wi are the initial weights assigned to the variables. The initial weights can be selected at random but assigned to the variables in order of importance. (Where i 1, 2, 3,..., n ). Vi represents the coded values of each variable. The value of X is now transferred to the sigmoid function which is used to predict whether a customer is likely to default in a loan or not. The output value from the sigmoid function ranges between 0 and 1. In addition, the value can be used to determine the credit score of a customer. Thus, interpreting the score will suggest to the decision maker whether the customer has a good or bad credit. The sigmoid function is given below: 1 Probability (that a customer will not default) = X 1 e The results of the analysis are shown in Tables 1 and 2. An output value of the sigmoid function close to 1 indicates a high chance of being a good customer (may not likely to default in a loan) and close to 0, a chance of being a bad customer (highly likely to default in a loan). To determine the credit score, the sigmoid values were normalized using the formula for coding the numerical continuous data (shown above). Now, what constitute a good or a bad score depends on the creditor or the decision maker. Table 1, shows the distribution of the score. This implies that, a customer with the smallest probability value will score zero and a customer with the highest probability value will
score 1000. Similar behavior of the score is exhibited in Table 2. It is the responsibility of the decision maker to set up the cut off point for the score. For example, in Table 1, it may be decided that customers who score less than 900 points should be disqualified from getting a loan. However in Table 2, the cut off point can be 600 points. If more customers are evaluated, the scores would be more widely distributed. Table 1: Credit Scoring Analysis - Model A (5 Factors) Customer Summation Sigmoid Credit Score 1 11.25 0.99 1000 2 3.94 0.98 938 3 1.56 0.83 0 4 8.06 0.99 1000 5 4.38 0.98 938 6 1.69 0.84 63 Table 2: Credit Scoring Analysis: Model B (10 factors) Customer Summation Sigmoid Credit Score 1 2.73 0.94 714 2 2.32 0.91 500 3 3.42 0.97 929 4 2.52 0.93 643 5 1.62 0.84 0 6 3.78 0.98 1000 7 3.44 0.97 929 CONCLUSION The proposed credit score model is reliable, dynamic and easy to use. It is capable of estimating the chance that a customer will not default in a loan considering the important variables used to determine credit score. In addition, the output of the model depends on the values of the input variables and their corresponding weights. For example, a good customer record input will automatically produce a good credit score and vise versa. Finally, the scoring model is sensitive to changes in both the input values and the weights. REFERENCES [1] Koh, H. C., Tan, W. C., Goh, C. P. Credit Scoring Using Data Mining Techniques, Singapore Management Review, 2004, 26(2), 25-47. [2] Dinh, T. H. T., Kleimeier, S. A credit scoring model for Vietnam s retail
banking market International Review of Financial Analysis, 2007, 16, 471 495. [3] Yap, B.W., Ong, S.H., Husain, N.H.M. Using data mining to improve assessment of credit worthiness via credit scoring models, Expert Systems with Applications, 2011, 38, 13274-13283. [4] Hand, D. J., Henley, W. E. Statistical classification methods in consumer credit scoring: A review, Journal of the Royal Statistical Society: Series A. Statistics in Society, 1997, 160(3), 523 541 [5] Abdou, H., Pointon, J., El-Masry, A. Neural nets versus conventional techniques in credit scoring in Egyptian banking, Expert System with Applications, 2008, 35, 1275-1292. [6] Vojtek, M., Kocenda, E. Credit scoring methods, Czech Journal of Economics and Finance, 2006, 56(3-7), 152-167. [7] Sohn, S. Y., Moon, T. H., Kim, H. S. Improved technology scoring model for credit guarantee fund. Expert Systems with Applications, 2005, 28(2), 327-331. [8] Sohn, S. Y., Moon, T. H., Kim, H. S. Behavioral credit scoring model for technology-based firms that considers uncertain financial ratios obtained from relationship banking, Small Business Economics, 2013, 41, 931-943. [9] Baesens, B., Gestel, T. V., Viaene, S., Stepanova, M., Suykens, J., Vanthienen, J. Benchmarking State-of-the-Art Classification Algorithms for Credit Scoring, Journal of the Operational Research Society, 2003, 54(6), 627-635 [10] Crook, J., Edelman D., Thomas, L. Recent developments in consumer credit risk Assessment, European Journal of Operational Research, 2007, 183(3), 1447-1465. [11] Desai, V. S., Crook, J. N., Overstreet, G. A. A Comparison of Neural Networks and Linear Scoring Models in the Credit Union Environment, European Journal of Operational Research, 1996, 95(1), 24-37. [12] Lee, T. H., Jung, S. Forecasting creditworthiness: Logistic vs. artificial neural net, The Journal of Business Forecasting Methods and Systems, 2000, 18(4), 28-30. [13] Hosmer, D., Lemeshow, S. Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc, 2000.