Lending Club Interest Rate Data Analysis 1. Introduction Lending Club is an online financial community that brings together creditworthy borrowers and savvy investors so that both can benefit financially [1]. In this analysis, I mainly focus on the relationship between interest rates of loans on the Lending Club and the characteristics of the person asking for the loan such as employment history, credit history and creditworthiness scores. Among all variables, FICO scores is the most important variable for lenders to make lending decision. Applicants with higher FICO scores might be offered better interest rates on mortgages or automobile loans as well as higher credit limit amounts [2]. Understanding the relationship between interest rate and FICO scores can help people or those who want to use Lending Club to borrow money to independently estimate the interest rate and make better investment decisions. Also, I want to further the analysis into the relationship between interest rate and other characteristics of the individual under the same FICO scores. Using exploratory analysis and general linear regression techniques, I find that there is a significant relationship between interest rates and FICO score, even after adjusting for important confounders such as loan purpose and employment lengths. 2. Method Data collection In the data analysis, I used the data issued through the Lending Club. The data were downloaded from Lending Club (https://www.lendingclub.com/home.action) on November 8, 2013 using the R programming language [3]. Exploratory data analysis Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods [4]. Exploratory data analysis can be used to (1) check for missing data and other confounders, (2) summarized the data and (3) transform varieties to fit the regression model, for example, using logs to address right skewed data. Statistical modeling Considering the fact that dependent variable- interest rate is numeric and continuous and independent variables are both numeric and categorical, I determined to use general linear model. The term general linear model is used for Normal linear models with any combination of categorical and continuous explanatory variables [5]. Model selection is based on the consequence of exploratory data analysis. Also, linear regression is an
approach to model the relationship between a scalar dependent variable y and one or more explanatory variables denoted X [6]. Thus, I think general linear regression is most suitable for analyzing the relationship between interest rate and other variables. 3. Results Original dataset description There are total 2500 observations and 14 variables in the original dataset. The independent variables include amount requested (AR), amount funded by investors (AFI), loan length (LL), loan purpose (LP), debt to income ratio (D2I), state (Stat), home ownership (HO), monthly income (MI), FICO range (FICO), open credit lines (OCL), revolving credit balance (RC), inquiries in the last 6 months (Inq) and employment length (EL). The dependent variable is interest rate (IR). Missing value and confounders After doing exploratory data analysis, I identified that there are 2 missing values in the open credit lines, revolving credit balance and inquiries in the last 6 months variables in the 2 observations. Also, there are 77 missing values in the employment length. In addition to these missing values, there are 2 negative value in the amount funded by investors variables. I treated the 2 negative values as invalid data and deleted them with missing values. After processing the data, I keep the unit of interest rate as percentage. I do not change variable interest rate into decimal because I think that is not intuitive to observe the value. Specifically, if you see the value of one interest rate is 8.42, the interest rate is 8.42%. Through exploratory analysis, I also identified that the distribution of monthly income variable is extremely right skewed. Revolving credit balance has same problem like monthly income. Thus, in the following analysis, I decided to use log to transform these two variables. Moreover, I noticed that the correlation between amount requested and amount funded by investors is very high. It is easy to see that amount requested, to some extends, determines the amount investors are going to invest. It will produce multicollinearity if I keep these two variables together. Therefore, I deleted the variable - amount funded by investors. In terms of variable State, the number of observations in each state has large differences. Specifically, there are 171 observations in state TX, while there is only 1 observation in state MS. Having seen this, I strong think variable State is not a confounder. Besides, variable loan purpose is also a confounder in the dataset. Loan purpose not only influences the interest rate, but also affects amount requested. Obviously, different loan purposes would require different amount of money. After running simple linear regression between amount requested and loan purpose, the p- value for the model is very significant (about 2.2e16), which proves that loan purpose is a confounder to the model. For employment length, the average of interest rates for each employment length is almost same, around 13%.
Model analysis In my knowledge of interest rate, I know FICO scores plays a key role. So first I decide run a linear regression between FICO scores and interest rate. Because FICO is categorical variable, I made dummy code for FICO scores to fit the regression model. To get the summarized information about FICO scores, I made frequency table about FICO. I noticed that there are only 4, 3 and 4 observations in FICO ranges 640-644, 645-649 and 655-659. Also, there are only 7, 5, 1, 1 observations in FICO ranges 810-814, 815-819, 820-824 and 830-834 respectively. Thus, I determined to group the first 3 factors into one group- 640-659 and the last 4 factors into another group- 810-834 in order to make the number of each group is not too small. After drawing the box plot for interest rate with FICO scores, I found that the interest rate in the 640-659 is much lower then next several groups. Considering the fact that lower FICO scores are more likely to produce higher interest rate and the number of observation in 640-659 is not big enough, I treated the group 640-659 as an outlier. I also run the linear regression with 640-659 and without 640-659 (See Table 1.1 and Table 1.2). From the table, you can see with 640-659, p- value for some groups are not very significant. On the other hand, without 640-659, p- value for almost groups is statistical significant. In the first regression, after making dummy code, the coefficient of 640-659 is the intercept. The first regression model is following: IR = b 0 + b 1 (FICO) + e (Intercept) 15.1408 0.8170 18.53 0.0000 FICO.Range660-664 3.3027 0.8572 3.85 0.0001 FICO.Range665-669 2.2908 0.8517 2.69 0.0072 FICO.Range670-674 1.1137 0.8456 1.32 0.1880 FICO.Range675-679 0.7451 0.8469 0.88 0.3791 FICO.Range680-684 -0.0141 0.8476-0.02 0.9867 FICO.Range685-689 -0.4628 0.8541-0.54 0.5880 FICO.Range690-694 -0.3978 0.8522-0.47 0.6407 FICO.Range695-699 -0.9390 0.8499-1.10 0.2693 FICO.Range700-704 -1.7602 0.8556-2.06 0.0398 FICO.Range705-709 -2.5230 0.8536-2.96 0.0031 FICO.Range710-714 -2.6215 0.8608-3.05 0.0023 FICO.Range715-719 -3.9478 0.8686-4.54 0.0000 FICO.Range720-724 -4.0373 0.8608-4.69 0.0000 FICO.Range725-729 -4.4900 0.8692-5.17 0.0000 FICO.Range730-734 -5.1847 0.8675-5.98 0.0000 FICO.Range735-739 -5.5173 0.8892-6.20 0.0000 FICO.Range740-744 -5.5410 0.9097-6.09 0.0000 FICO.Range745-749 -5.2392 0.9032-5.80 0.0000 FICO.Range750-754 -6.6972 0.8949-7.48 0.0000 FICO.Range755-759 -6.0233 0.9217-6.54 0.0000 FICO.Range760-764 -6.4555 0.9195-7.02 0.0000 FICO.Range765-769 -7.2955 0.9580-7.62 0.0000 FICO.Range770-774 -8.3879 1.0670-7.86 0.0000 FICO.Range775-779 -6.3967 1.0156-6.30 0.0000 FICO.Range780-784 -7.4543 0.9877-7.55 0.0000 FICO.Range785-789 -6.5214 1.0670-6.11 0.0000 FICO.Range790-794 -7.5803 1.0334-7.34 0.0000 FICO.Range795-799 -6.7493 1.1329-5.96 0.0000 FICO.Range800-804 -7.4850 1.1554-6.48 0.0000 FICO.Range805-809 -7.7145 1.1813-6.53 0.0000 FICO.Range810-834 -7.3148 1.0961-6.67 0.0000 Table 1.1 Linear regression with 640-659 (Intercept) 18.4435 0.2599 70.97 0.0000 FICO.Range665-669 -1.0119 0.3546-2.85 0.0044 FICO.Range670-674 -2.1890 0.3397-6.44 0.0000 FICO.Range675-679 -2.5576 0.3427-7.46 0.0000 FICO.Range680-684 -3.3168 0.3446-9.63 0.0000 FICO.Range685-689 -3.7655 0.3603-10.45 0.0000 FICO.Range690-694 -3.7005 0.3558-10.40 0.0000 FICO.Range695-699 -4.2417 0.3501-12.12 0.0000 FICO.Range700-704 -5.0629 0.3638-13.92 0.0000 FICO.Range705-709 -5.8257 0.3590-16.23 0.0000 FICO.Range710-714 -5.9242 0.3759-15.76 0.0000 FICO.Range715-719 -7.2505 0.3936-18.42 0.0000 FICO.Range720-724 -7.3400 0.3759-19.53 0.0000 FICO.Range725-729 -7.7927 0.3948-19.74 0.0000 FICO.Range730-734 -8.4874 0.3912-21.70 0.0000 FICO.Range735-739 -8.8200 0.4372-20.17 0.0000 FICO.Range740-744 -8.8437 0.4778-18.51 0.0000 FICO.Range745-749 -8.5419 0.4651-18.36 0.0000 FICO.Range750-754 -9.9999 0.4489-22.28 0.0000 FICO.Range755-759 -9.3260 0.5002-18.65 0.0000 FICO.Range760-764 -9.7582 0.4961-19.67 0.0000 FICO.Range765-769 -10.5982 0.5645-18.77 0.0000 FICO.Range770-774 -11.6906 0.7350-15.90 0.0000 FICO.Range775-779 -9.6994 0.6579-14.74 0.0000 FICO.Range780-784 -10.7570 0.6137-17.53 0.0000 FICO.Range785-789 -9.8241 0.7350-13.37 0.0000 FICO.Range790-794 -10.8830 0.6851-15.89 0.0000 FICO.Range795-799 -10.0520 0.8281-12.14 0.0000 FICO.Range800-804 -10.7877 0.8586-12.56 0.0000 FICO.Range805-809 -11.0172 0.8934-12.33 0.0000 FICO.Range810-834 -10.6175 0.7767-13.67 0.0000 Table 1.2 Linear regression without 640-659
We can see that after moving out FICO score 640-659, p- value of every FICO range is apparent significant. The relationship between FICO score and interest rate is obvious. We can see that individuals with higher FICO score can get lower interest rate. However, between FICO range 750-754 and 810-834, the decrease of interest rate with increase of FICO score is not obvious. For example, we can say that interest rate for whose FICO Range is 740-744 is 8.84% less than those whose FICO Range is 660-664 under the second regression model. Second regression model would include potential confounders to see whether the relationship between FICO score and interest rate is still significant and to find some other related variables which probably influence interest rate. The second regression model is following: IR = b0 + b1(ar) + b2(ll) + b3(d2i) + b4(ho) + b5log10(mi) + b6(fico) +b7log10(ocl+1) +b8(inq) The result is in the Table 1.3. We can find that in addition to FICO range, p- values for 60 months loan length, monthly income, home ownership- rent, revolving credit balance and inquiries in last 6 months are also all significant. (Intercept) 19.3800 0.7945 24.39 0.0000 Amount.Requested 0.0002 0.0000 26.60 0.0000 C(Loan.Length,, 1)2 3.1644 0.1015 31.19 0.0000 Debt.To.Income.Ratio 0.0000 0.0056 0.00 0.9975 log10(monthly.income) -0.7980 0.2119-3.77 0.0002 as.factor(home.ownership)other 1.2602 0.9376 1.34 0.1790 as.factor(home.ownership)own 0.1189 0.1504 0.79 0.4292 as.factor(home.ownership)rent 0.2595 0.0845 3.07 0.0022 log10(revolving.credit.balance + 1) -0.3405 0.0640-5.32 0.0000 Inquiries.in.the.Last.6.Months 0.3338 0.0321 10.40 0.0000 as.factor(fico.range)665-669 -0.6041 0.2337-2.58 0.0098 as.factor(fico.range)670-674 -1.6302 0.2237-7.29 0.0000 as.factor(fico.range)675-679 -2.2573 0.2261-9.98 0.0000 as.factor(fico.range)680-684 -3.0830 0.2272-13.57 0.0000 as.factor(fico.range)685-689 -3.6106 0.2372-15.22 0.0000 as.factor(fico.range)690-694 -3.8087 0.2351-16.20 0.0000 as.factor(fico.range)695-699 -4.2625 0.2309-18.46 0.0000 as.factor(fico.range)700-704 -5.0629 0.2398-21.11 0.0000 as.factor(fico.range)705-709 -5.8192 0.2370-24.56 0.0000 as.factor(fico.range)710-714 -5.9139 0.2478-23.87 0.0000 as.factor(fico.range)715-719 -7.0336 0.2601-27.04 0.0000 as.factor(fico.range)720-724 -7.2720 0.2483-29.29 0.0000 as.factor(fico.range)725-729 -7.7557 0.2604-29.79 0.0000 as.factor(fico.range)730-734 -8.4360 0.2581-32.68 0.0000 as.factor(fico.range)735-739 -8.9781 0.2886-31.11 0.0000 as.factor(fico.range)740-744 -8.9387 0.3149-28.38 0.0000 as.factor(fico.range)745-749 -8.9642 0.3069-29.20 0.0000 as.factor(fico.range)750-754 -9.4747 0.2972-31.88 0.0000 as.factor(fico.range)755-759 -9.6350 0.3314-29.07 0.0000 as.factor(fico.range)760-764 -9.4758 0.3281-28.88 0.0000 as.factor(fico.range)765-769 -10.0611 0.3733-26.95 0.0000 as.factor(fico.range)770-774 -10.8712 0.4860-22.37 0.0000 as.factor(fico.range)775-779 -10.0850 0.4368-23.09 0.0000 as.factor(fico.range)780-784 -10.5659 0.4058-26.04 0.0000 as.factor(fico.range)785-789 -10.6053 0.4885-21.71 0.0000 as.factor(fico.range)790-794 -10.7768 0.4555-23.66 0.0000 as.factor(fico.range)795-799 -10.9371 0.5500-19.89 0.0000 as.factor(fico.range)800-804 -10.4325 0.5683-18.36 0.0000 as.factor(fico.range)805-809 -11.2514 0.5920-19.01 0.0000 as.factor(fico.range)810-834 -10.4526 0.5198-20.11 0.0000 Table 1.3 Linear regression with potential confounders
Last, I want to identify how other variables would influence interest rate under the same FICO score. In other words, I want to check the relationship between interest rate and other variables without influences produced by FICO. I picked out FICO range 675-679 and FICO range 720-724. The consequences are shown following in Table 1.4 and Table 1.5. (Intercept) 12.5478 3.5743 3.51 0.0007 Amount.Requested 0.0002 0.0000 5.80 0.0000 C(Loan.Length,, 1)2 2.6520 0.4762 5.57 0.0000 Debt.To.Income.Ratio 0.0188 0.0250 0.75 0.4539 C(Home.Ownership,, 4)RENT 0.0700 0.3535 0.20 0.8433 log10(monthly.income) -0.8990 1.0051-0.89 0.3733 Open.CREDIT.Lines -0.1504 0.0455-3.31 0.0013 log10(revolving.credit.balance + 1) -0.1156 0.4731-0.24 0.8075 Inquiries.in.the.Last.6.Months 0.6924 0.1572 4.40 0.0000 Table 1.4 Linear regression under FICO score 675-679 (Intercept) 13.8910 3.1909 4.35 0.0000 Amount.Requested 0.0002 0.0000 5.93 0.0000 C(Loan.Length,, 1)2 2.6587 0.4659 5.71 0.0000 Debt.To.Income.Ratio 0.0295 0.0284 1.04 0.3015 C(Home.Ownership,, 4)OTHER 0.8576 2.0412 0.42 0.6750 C(Home.Ownership,, 4)OWN 0.5569 0.7084 0.79 0.4330 C(Home.Ownership,, 4)RENT 0.6682 0.3515 1.90 0.0592 log10(monthly.income) 0.3033 0.9059 0.33 0.7382 Open.CREDIT.Lines 0.0954 0.0428 2.23 0.0271 log10(revolving.credit.balance + 1) -0.9984 0.4075-2.45 0.0154 Inquiries.in.the.Last.6.Months 0.4525 0.1457 3.11 0.0023 Table 1.5 Linear regression under FICO score 675-679 You can noticed that under same FICO score, p- value for amount requested, 60 months loan length, inquiries in the last 6 months and open credit lines are significant. The coefficient of amount request is only 0.0002, which suggests increasing 1 dollar amount request would increase 0.0002% interest rate. Although the increase for interest rate is little, interest rate would increase 0.2% if the amount increases 1000 dollars. Thus, the relationship between amount request and interest rate is still strong. 4. Conclusions From above analysis, my analysis on Lending Club data suggests that there is a significant and negative relationship between FICO score and interest rate. Also, according to the boxplot I made in the final figure, I strongly think after FICO score 750-754, the relationship between these two variables becomes weak. Besides, under the same FICO score, I suggest that amount request, 60 months loan length and some other variables are related to the interest rate. Among those variables, amount requested, loan length and inquiries in the last 6 months are positively associated with interest rate, while open credit line is negatively related to interest rate.
Considering the size of data is not very big, I think the accuracy for the interest rate would not be very accurate. Also, number of observations in each FICO score is not close. It is likely to produce some outliers and confounders. A larger collection of interest rate under different FICO scores may be more appropriate for understanding the relationship between FICO score and interest rate. In reality, if borrowers in Lending Club want to get lower interest rate and better deal, they should improve their FICO scores. Also, they probably do not need to improve FICO to the highest scores because after score 750-754, the interest rate does not decrease a lot. Furthermore, they can adjust amount request and lower inquire in the last 6 months to lower the interest rate.
5. References 1. LendingClub Web Page. URL: https://www.lendingclub.com/public/about- us.action. Accessed 11/10/2013. 2. Wikipedia "Credit score in the United States" Page. URL: http://en.wikipedia.org/wiki/credit_score_in_the_united_states#fico_score. Accessed 11/10/2013 3. R Core Team (2012). R: A language and environment for statistical computing. URL: http://www.r- project.org 4. Wikipedia "Exploratory data analysis" Page. URL: http://en.wikipedia.org/wiki/exploratory_data_analysis. Accessed 11/12/2013 5. Annette J. Dobson. An Introduction to Generalized Linear Model. Vol. 115. Chapman & Hall/CRC, 2002. 6. Seber, George AF, and Alan J. Lee. Linear regression analysis. Vol. 936. Wiley, 2012