Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success Predcton n P2P Lendng Xue Ru 1 +, Bngwu Lu 2 and Shaohua Tan 1 1 Department of Intellgence Scence, School of Electronc Engneerng and Computer Scence, Pekng Unversty, Bejng 100871, Chna 2 School of Informaton, Bejng Wuz Unversty, Bejng 101149, Chna Abstract. Peer-to-peer lendng or P2P lendng connects the people who want to borrow and the people who want to nvest. To dentfy the determnant factors of fundng success and to predct whether a lstng wll get funded or not are two key ssues n P2P lendng. In ths study, Bayesan network model based on a new learnng algorthm HEK2 (Herarchy Exact K2) s proposed to solve these two key ssues. Wth the DAG (drected acyclc graph) structure learned n our model, the causal relatonshps of the entre factor set can be revealed n a vsble manner. Consequently, the determnants of fundng success and several hdden patterns rarely dscussed before are extracted drectly. Comparson wth earler work shows that the predcton accuracy of our method s 7.5% hgher than SVM and 13.5% hgher than KNN, whch are both popular classfers. Emprcal results show the effectveness and flexblty of our model. Keywords: P2P lendng, causal relatonshp, fundng success, Bayesan network, HEK2 1. Introducton Peer-to-peer (P2P) lendng, an emergng alternatve to tradtonal nsttutonal lendng, s based on an onlne reverse aucton. In P2P lendng, people can ether request loans by creatng lstngs, takng the Borrowers role, or buy loans by makng bds, takng the Lenders role [1][2].Compared wth tradtonal fnancal servces mddlemen, P2P lendng has several advantages [3]. For example, the returns are sad to be hgher (10.69%) and the borrow nterest rate (rate startng at 6.59% for AA loans) to be lower [4]. In the study of P2P lendng, to dentfy the determnant factors of fundng success and to predct whether a lstng wll get funded or not are two key ssues, whch are valuable n provdng decson support for borrowers. There are more and more studes concentratng on solvng these two ssues. For example, parwse correlaton test s used to dentfy the determnants of fundng success and then the regresson model s used to predct the fundng success [2][5]. However, there s a rsk of multcollnearty n the regresson model. As an example, factor StartngRate and factor Amount are both ncluded n the regresson model n [2], but the correlaton between them s 0.55, whch s statstcally sgnfcant. To avod multcollnearty, popular classfcaton technques, such as SVM, KNN and so on, are used n [1] to predct the fundng success. However, t provdes no explanaton about the relatonshps among factors. In ths study, Bayesan network model s used to solve the two key ssues mentoned above, whch s beleved to have several key noveltes compared wth earler work. Frst, Bayesan network model can avod multcollnearty as well as SVM and KNN. Second, wth the DAG structure learned n our model, the causal relatonshps of the entre factor set can be revealed n a vsble manner. However, correlaton matrx n [5] only shows whether two factors are correlated or not and SVM and KNN n [1] provde no nformaton about relatonshps among factors. Causal relatonshps dscovered n our model drectly reveal the hdden patterns + Correspondng author. Tel.: +86 10 62755745. E-mal address: bjxueru@gmal.com 81

bured n the data and dentfy the factors whch actually drve the varaton of fundng success probabltes. Besdes, we should not neglect that there s a skewng problem nsde the meta-data. To solve ths problem, a data flterng method s proposed n [1], but the samples of testng set are not randomly selected, whch makes the method not useful n a practcal envronment. From a practcal pont of vew, we use the weght adjustment technque as a soluton. The rest of the paper s organzed as follows. Secton 2 ntroduces how casual relatonshps among factors are modeled. Secton 3 descrbes the processng of the meta-data. In Secton 4, we llustrate and analyze the expermental results. Conclusons and dscussons are n Secton 5. 2. Buld Bayesan Network Model A Bayesan network model s a probablstc graphcal model that represents a set of random varables and ther condtonal dependences va a DAG (drected acyclc graph). Several algorthms, such as K2, HllClmbng, SmulatedAnnealng and so on, can be used to buld the Bayesan network model. However, these algorthms only return approxmate search results [6]. In ths study, we propose a HEK2 (Herarchy Exact K2) algorthm whch returns exact search result fndng the best matched structure. The HEK2 algorthm manly conssts of two steps: Frst, decde the level dvson of the factors collected from P2P lendng marketplace. Second, use the score-search approach to fnd the best matched structure. Here we use Bayesan Drchlet as our scorng crteron [6]: n q r Γ( N ) j Γ ( Njk + Njk ) PB ( s, D) = PB ( s) Γ ( N + N ) Γ( N ) = 1 j= 1 j j k= 1 jk In our methodology, we take a herarchcal vew based on Assumpton 1: If the value of a factor v s determned before another factor v j, then v cant be a descendant of v j. Under ths assumpton, we can dvde the factors nto three layers. Detals about dfferent layers can be seen n secton 3. HEK2 algorthm can be seen as an extenson to the orgnal K2 algorthm. In the orgnal K2 algorthm, the order of factors s gven as an nput. However, the result reles heavly on the gven order. It only returns approxmate search result [8]. In HEK2 algorthm, every possble order of the factors n a same layer s consdered. The parent set of a factor conssts of the factors before t under a fxed order and the factors from the prevous layer. Then our method searches through the space of all possble DAGs and the structure wth hghest score s returned. The pseudo code of HEK2 s n Algorthm 1. Dynamc programmng can be used to accelerate. Assume that there are k 1 factors n layer 1, then n ths study the tme complexty s k1 1 10 11 [ k1 2 + (11 k1) 2 + 2 ] O( n). As for the nference part, there have been well-developed algorthms for Bayesan network model [9]. Algorthm. 1 Input: FactorsSet, PrevousLayerFactorsSet Output: BestStructure, BestScore Algorthm: Lst all the orders over the FactorsSet; For each Order: For each Node n FactorsSet: Lst all the possble parent sets of the Node; For each ParentSet: Calculate the score of the Node and ts ParentSet; Fnd the ParentSet wth hghest Score; Add the Node and ts ParentSet to temp Structure; Add the Score to temp Score; Fnd the BestStructure and assocated BestScore; 82

3. Data Processng Prosper.com s the worlds largest peer-to-peer lendng marketplace, wth more than 1,170,000 members and $272,000,000 n funded loans. Cross-sectonal annual data durng 5 years from 2006 to 2010 are collected from Prosper.com n ths study [4]. After removng rrelevant factors, there reman 12 factors ncludng the class factor. Under Assumpton 1, these factors are dvded nto three layers. Some factors need to be transformed. The status of GroupKey s entered as True f the member has a group, otherwse as False. The same transformaton s done to Descrpton and Images. As for the class factor Status, status completed s entered as True, expred, wthdrawn and canceled as False. The other values are omtted. Instances wth mssng values are removed drectly. Equal frequency dscretzaton method s adopted to dscrete the contnuous varables. Detals about factors can be seen n Table 1. Table. 1: Factors. Herarchy Factor Value Type DebtToIncomeRato Nomnal CredtGrade (ProsperRatng) Nomnal Frst Layer GroupKey VerfedBankAccount IsBorrowerHomeOwner AmountRequested Nomnal BorrowerMaxmumRate Nomnal Second Layer Descrpton Duraton Nomnal FundngOpton Nomnal Images Thrd Layer Status 4. Expermental Analyss The HEK2 algorthm ntroduced n secton 2 s appled to each of the 5 annual datasets. For clarty, we only show the learned structure of year 2006 as a representatve (see Fg. 1). As can be seen from the graph, CredtGrade and BorrowerMaxRate are both determnants of the class factor Status. GroupKey, AmountRequested and DebtToIncomeRato are ancestors of Status, whch means that they have ndrect nfluences. DebtToIncomeRato has no sgnfcant nfluence as t s too far from the class factor Status n the graph. Descrpton and IsBorrowerHomeOwner have no effect on Status. All these results are n lne wth earler work [2][5]. Images also has a drect nfluence on Status. VerfedBankAccount doesn t have relatonshp strong enough wth any other factor. These nterestng fndngs are barely shown before. A hgh correlaton between IsBorrowerHomeOwner and Status s expected n both [2] and [5], but n fact the correlaton between them s relatvely low, whch s hard to explan. However, t can be seen clearly under our learned structure that they are both resultng factors of CredtGrade. There s no drect relatonshp between them. If an edge wth the same drecton appears at least three tmes out of the 5 cross-sectonal datasets, we confrm t as a credble relatonshp (see Fg. 2). To summarze the 5 cross-sectonal datasets, VerfedBankAccount and Descrpton has no relatonshp strong enough wth any other factor. CredtGrade(ProsperRatng), AmountRequested and BorrowerMaxRate are determnant factors of the class factor Status. GroupKey s an mportant factor nfluencng other lstng optons. CredtGrade (ProsperRatng) has the most wdely effect on other factors. Soft margn SVM wth dfferent kernels and KNN are appled to the annual dataset of year 2007 to predct the fundng success n [1]. The result shows that SVM wth Radal Bass Kernel has the hghest accuracy 85%. The predcton accuracy of KNN s 79%. The predcton accuracy of our model s 7.5% hgher than SVM, and 13.5% hgher than KNN. The predcton performance of our model can be seen n Table 2. 83

However, the predcton senstvty, whch ndcates the proporton we truly recognzed of the successful lstngs, s too low to accept. Ths s because the data skews towards the falure lstngs heavly. For example, only 9% of all the lstngs n 2006 got funded. The weght adjustment technque s used to solve ths problem. We enhance the relatve weght of successful lstngs to promote the senstvty. Snce there s a tradeoff between the senstvty and accuracy (see Table 3), the relatve weght can be decded accordng to the relatve mportance of dfferent classes. In the case of 2006, 4.6 may be a proper value for the weght. The senstvty rses up to 67.60% whle the accuracy and specfcty stay on 86.72% and 88.49%. Fg. 1: Bayesan network structure for year 2006. A drected edge n the graph represents the causal relatonshp between two factors. CredtGrade, BorrowerMaxRate and Images are beleved to have drect nfluences on Status. Fg. 2: General model for 5 cross-sectonal annual datasets. A drected edge represents the causal relatonshp between two factors. The number besdes the edge represents the tmes ths relatonshp appears n 5 annual datasets. A relatonshp wth 3 appearances or above s confrmed to be credble. CredtGrade(ProsperRatng), AmountRequested and BorrowerMaxRate are three stable factors nfluencng Status across 5 years. Table. 2: Predcton accuracy for cross-sectonal annual dataset Year #Tranng Instances #Testng Instances Accuracy 2006 43,322 21,837 91.69% 2007 95,210 47,611 92.50% 2008 66,272 33,103 89.88% 84

2009 8,304 3,996 84.38% 2010 14,714 7,600 78.33% Weght Predcton Table. 3: Predcton performance wth dfferent weght 1.0 1.9 2.8 3.7 4.6 5.4 Accuracy(%) 91.69 90.88 89.33 88.66 86.72 86.72 Senstvty(%) 10.85 42.06 55.62 60.64 67.60 67.60 Specfcty(%) 99.18 95.40 92.45 91.26 88.49 88.49 5. Concluson and Dscusson In ths study, we propose a HEK2 algorthm to buld the Bayesan network model on emprcal data collected from P2P lendng marketplace. The method s effectve n dscoverng the complcated causal relatonshps among varous factors. Wth the DAG structure learned n our model, mportant factors whch actually drve the varaton of fundng success probabltes are clearly llustrated. Emprcally, our basc results are n lne wth earler work. The dfference s that our model reveals more hdden patterns. The predcton accuracy of our model s 7.5% hgher than SVM and 13.5% hgher than KNN, compared wth earler work. Our model has the practcal sgnfcance wth the help of the weght adjustment technque. However, our algorthm has an exponental tme complexty. To fnd a more effcent exact search method s one of the future research drectons. 6. Acknowledgements Supported by the Key Project of Bejng Natural Scence Foundaton (category B, No. KJ201210037037). 7. References [1] Herrero-Lopez, A Sheng-Yng Pao, R Bhattacharyya. The Effect of Socal Interactons on P2P Lendng. meda.mt.edu. [2] L Puroa, JE. Techb, H Wallenusa, J Wallenus. Borrower Decson Ad for people-to-people lendng. Decson Support Systems. Volume 49, Issue 1, Aprl 2010, Pages 52-60. [3] M Klafft. Onlne peer-to-peer lendng: A lenders perspectve. Proceedngs of the Internatonal Conference on E- Learnng, E-Busness, Enterprse Informaton Systems, and E-Government, EEE 2008. [4] http://www.prosper.com [5] J Ryan, K Reuk, C Wang. To Fund Or Not To Fund: Determnants Of Loan Fundablty n the Prosper.com Marketplace. Stanford Graduate School of Busness. [6] R Daly, Q Shen, S Atken. Learnng Bayesan networks: approaches and ssues. The Knowledge Engneerng Revew (2011), 26: pp 99-157. [7] F. M. Malvestuto. Approxmatng dscrete probablty dstrbutons wth decomposable models. STATISTICS AND COMPUTING, Volume 6, Number 2, 169-176. [8] GF. Cooper and E Herskovts. A Bayesan method for the nducton of probablstc networks from data. MACHINE LEARNING, Volume 9, Number 4, 309-347. [9] A Darwchek. Recursve condtonng. Artfcal Intellgence, Volume 126, Issues 1-2, February 2001, Pages 5-41. 85