Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending



Similar documents
Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

An Interest-Oriented Network Evolution Mechanism for Online Communities

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Invoicing and Financial Forecasting of Time and Amount of Corresponding Cash Inflow

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

Section 5.4 Annuities, Present Value, and Amortization

Improved SVM in Cloud Computing Information Mining

An Alternative Way to Measure Private Equity Performance

Can Auto Liability Insurance Purchases Signal Risk Attitude?

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

A COLLABORATIVE TRADING MODEL BY SUPPORT VECTOR REGRESSION AND TS FUZZY RULE FOR DAILY STOCK TURNING POINTS DETECTION

Statistical Methods to Develop Rating Models

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Small pots lump sum payment instruction

Searching for Interacting Features for Spam Filtering

BUSINESS PROCESS PERFORMANCE MANAGEMENT USING BAYESIAN BELIEF NETWORK. 0688,

Damage detection in composite laminates using coin-tap method

Project Networks With Mixed-Time Constraints

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

How To Calculate The Accountng Perod Of Nequalty

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Simple Interest Loans (Section 5.1) :

Research on Evaluation of Customer Experience of B2C Ecommerce Logistics Enterprises

7.5. Present Value of an Annuity. Investigate

Joe Pimbley, unpublished, Yield Curve Calculations

Student Performance in Online Quizzes as a Function of Time in Undergraduate Financial Management Courses

IMPACT ANALYSIS OF A CELLULAR PHONE

Factors Affecting Outsourcing for Information Technology Services in Rural Hospitals: Theory and Evidence

What is Candidate Sampling

CHAPTER 14 MORE ABOUT REGRESSION

L10: Linear discriminants analysis

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Performance Analysis and Coding Strategy of ECOC SVMs

A DATA MINING APPLICATION IN A STUDENT DATABASE

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Decision Tree Model for Count Data

Transition Matrix Models of Consumer Credit Ratings

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

Efficient Project Portfolio as a tool for Enterprise Risk Management

Fuzzy Set Approach To Asymmetrical Load Balancing In Distribution Networks

RELIABILITY, RISK AND AVAILABILITY ANLYSIS OF A CONTAINER GANTRY CRANE ABSTRACT

Detecting Credit Card Fraud using Periodic Features

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

The OC Curve of Attribute Acceptance Plans

Using Series to Analyze Financial Situations: Present Value

Multi-sensor Data Fusion for Cyber Security Situation Awareness

Probabilistic Latent Semantic User Segmentation for Behavioral Targeted Advertising*

Set. algorithms based. 1. Introduction. System Diagram. based. Exploration. 2. Index

The Application of Fractional Brownian Motion in Option Pricing

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

Inter-Ing INTERDISCIPLINARITY IN ENGINEERING SCIENTIFIC INTERNATIONAL CONFERENCE, TG. MUREŞ ROMÂNIA, November 2007.

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Gender Classification for Real-Time Audience Analysis System

DEFINING %COMPLETE IN MICROSOFT PROJECT

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

Gaining Insights to the Tea Industry of Sri Lanka using Data Mining

Automated Network Performance Management and Monitoring via One-class Support Vector Machine

Dynamic Pricing for Smart Grid with Reinforcement Learning

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Lecture 2: Single Layer Perceptrons Kevin Swingler

An Empirical Study of Search Engine Advertising Effectiveness

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Financial Mathemetics

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

Statistical Approach for Offline Handwritten Signature Verification

J. Parallel Distrib. Comput.

Traditional versus Online Courses, Efforts, and Learning Performance

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Rate Monotonic (RM) Disadvantages of cyclic. TDDB47 Real Time Systems. Lecture 2: RM & EDF. Priority-based scheduling. States of a process

Context-aware Mobile Recommendation System Based on Context History

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Gender differences in revealed risk taking: evidence from mutual fund investors

Complex Service Provisioning in Collaborative Cloud Markets

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

The Current Employment Statistics (CES) survey,

A practical approach to combine data mining and prognostics for improved predictive maintenance

Transcription:

Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success Predcton n P2P Lendng Xue Ru 1 +, Bngwu Lu 2 and Shaohua Tan 1 1 Department of Intellgence Scence, School of Electronc Engneerng and Computer Scence, Pekng Unversty, Bejng 100871, Chna 2 School of Informaton, Bejng Wuz Unversty, Bejng 101149, Chna Abstract. Peer-to-peer lendng or P2P lendng connects the people who want to borrow and the people who want to nvest. To dentfy the determnant factors of fundng success and to predct whether a lstng wll get funded or not are two key ssues n P2P lendng. In ths study, Bayesan network model based on a new learnng algorthm HEK2 (Herarchy Exact K2) s proposed to solve these two key ssues. Wth the DAG (drected acyclc graph) structure learned n our model, the causal relatonshps of the entre factor set can be revealed n a vsble manner. Consequently, the determnants of fundng success and several hdden patterns rarely dscussed before are extracted drectly. Comparson wth earler work shows that the predcton accuracy of our method s 7.5% hgher than SVM and 13.5% hgher than KNN, whch are both popular classfers. Emprcal results show the effectveness and flexblty of our model. Keywords: P2P lendng, causal relatonshp, fundng success, Bayesan network, HEK2 1. Introducton Peer-to-peer (P2P) lendng, an emergng alternatve to tradtonal nsttutonal lendng, s based on an onlne reverse aucton. In P2P lendng, people can ether request loans by creatng lstngs, takng the Borrowers role, or buy loans by makng bds, takng the Lenders role [1][2].Compared wth tradtonal fnancal servces mddlemen, P2P lendng has several advantages [3]. For example, the returns are sad to be hgher (10.69%) and the borrow nterest rate (rate startng at 6.59% for AA loans) to be lower [4]. In the study of P2P lendng, to dentfy the determnant factors of fundng success and to predct whether a lstng wll get funded or not are two key ssues, whch are valuable n provdng decson support for borrowers. There are more and more studes concentratng on solvng these two ssues. For example, parwse correlaton test s used to dentfy the determnants of fundng success and then the regresson model s used to predct the fundng success [2][5]. However, there s a rsk of multcollnearty n the regresson model. As an example, factor StartngRate and factor Amount are both ncluded n the regresson model n [2], but the correlaton between them s 0.55, whch s statstcally sgnfcant. To avod multcollnearty, popular classfcaton technques, such as SVM, KNN and so on, are used n [1] to predct the fundng success. However, t provdes no explanaton about the relatonshps among factors. In ths study, Bayesan network model s used to solve the two key ssues mentoned above, whch s beleved to have several key noveltes compared wth earler work. Frst, Bayesan network model can avod multcollnearty as well as SVM and KNN. Second, wth the DAG structure learned n our model, the causal relatonshps of the entre factor set can be revealed n a vsble manner. However, correlaton matrx n [5] only shows whether two factors are correlated or not and SVM and KNN n [1] provde no nformaton about relatonshps among factors. Causal relatonshps dscovered n our model drectly reveal the hdden patterns + Correspondng author. Tel.: +86 10 62755745. E-mal address: bjxueru@gmal.com 81

bured n the data and dentfy the factors whch actually drve the varaton of fundng success probabltes. Besdes, we should not neglect that there s a skewng problem nsde the meta-data. To solve ths problem, a data flterng method s proposed n [1], but the samples of testng set are not randomly selected, whch makes the method not useful n a practcal envronment. From a practcal pont of vew, we use the weght adjustment technque as a soluton. The rest of the paper s organzed as follows. Secton 2 ntroduces how casual relatonshps among factors are modeled. Secton 3 descrbes the processng of the meta-data. In Secton 4, we llustrate and analyze the expermental results. Conclusons and dscussons are n Secton 5. 2. Buld Bayesan Network Model A Bayesan network model s a probablstc graphcal model that represents a set of random varables and ther condtonal dependences va a DAG (drected acyclc graph). Several algorthms, such as K2, HllClmbng, SmulatedAnnealng and so on, can be used to buld the Bayesan network model. However, these algorthms only return approxmate search results [6]. In ths study, we propose a HEK2 (Herarchy Exact K2) algorthm whch returns exact search result fndng the best matched structure. The HEK2 algorthm manly conssts of two steps: Frst, decde the level dvson of the factors collected from P2P lendng marketplace. Second, use the score-search approach to fnd the best matched structure. Here we use Bayesan Drchlet as our scorng crteron [6]: n q r Γ( N ) j Γ ( Njk + Njk ) PB ( s, D) = PB ( s) Γ ( N + N ) Γ( N ) = 1 j= 1 j j k= 1 jk In our methodology, we take a herarchcal vew based on Assumpton 1: If the value of a factor v s determned before another factor v j, then v cant be a descendant of v j. Under ths assumpton, we can dvde the factors nto three layers. Detals about dfferent layers can be seen n secton 3. HEK2 algorthm can be seen as an extenson to the orgnal K2 algorthm. In the orgnal K2 algorthm, the order of factors s gven as an nput. However, the result reles heavly on the gven order. It only returns approxmate search result [8]. In HEK2 algorthm, every possble order of the factors n a same layer s consdered. The parent set of a factor conssts of the factors before t under a fxed order and the factors from the prevous layer. Then our method searches through the space of all possble DAGs and the structure wth hghest score s returned. The pseudo code of HEK2 s n Algorthm 1. Dynamc programmng can be used to accelerate. Assume that there are k 1 factors n layer 1, then n ths study the tme complexty s k1 1 10 11 [ k1 2 + (11 k1) 2 + 2 ] O( n). As for the nference part, there have been well-developed algorthms for Bayesan network model [9]. Algorthm. 1 Input: FactorsSet, PrevousLayerFactorsSet Output: BestStructure, BestScore Algorthm: Lst all the orders over the FactorsSet; For each Order: For each Node n FactorsSet: Lst all the possble parent sets of the Node; For each ParentSet: Calculate the score of the Node and ts ParentSet; Fnd the ParentSet wth hghest Score; Add the Node and ts ParentSet to temp Structure; Add the Score to temp Score; Fnd the BestStructure and assocated BestScore; 82

3. Data Processng Prosper.com s the worlds largest peer-to-peer lendng marketplace, wth more than 1,170,000 members and $272,000,000 n funded loans. Cross-sectonal annual data durng 5 years from 2006 to 2010 are collected from Prosper.com n ths study [4]. After removng rrelevant factors, there reman 12 factors ncludng the class factor. Under Assumpton 1, these factors are dvded nto three layers. Some factors need to be transformed. The status of GroupKey s entered as True f the member has a group, otherwse as False. The same transformaton s done to Descrpton and Images. As for the class factor Status, status completed s entered as True, expred, wthdrawn and canceled as False. The other values are omtted. Instances wth mssng values are removed drectly. Equal frequency dscretzaton method s adopted to dscrete the contnuous varables. Detals about factors can be seen n Table 1. Table. 1: Factors. Herarchy Factor Value Type DebtToIncomeRato Nomnal CredtGrade (ProsperRatng) Nomnal Frst Layer GroupKey VerfedBankAccount IsBorrowerHomeOwner AmountRequested Nomnal BorrowerMaxmumRate Nomnal Second Layer Descrpton Duraton Nomnal FundngOpton Nomnal Images Thrd Layer Status 4. Expermental Analyss The HEK2 algorthm ntroduced n secton 2 s appled to each of the 5 annual datasets. For clarty, we only show the learned structure of year 2006 as a representatve (see Fg. 1). As can be seen from the graph, CredtGrade and BorrowerMaxRate are both determnants of the class factor Status. GroupKey, AmountRequested and DebtToIncomeRato are ancestors of Status, whch means that they have ndrect nfluences. DebtToIncomeRato has no sgnfcant nfluence as t s too far from the class factor Status n the graph. Descrpton and IsBorrowerHomeOwner have no effect on Status. All these results are n lne wth earler work [2][5]. Images also has a drect nfluence on Status. VerfedBankAccount doesn t have relatonshp strong enough wth any other factor. These nterestng fndngs are barely shown before. A hgh correlaton between IsBorrowerHomeOwner and Status s expected n both [2] and [5], but n fact the correlaton between them s relatvely low, whch s hard to explan. However, t can be seen clearly under our learned structure that they are both resultng factors of CredtGrade. There s no drect relatonshp between them. If an edge wth the same drecton appears at least three tmes out of the 5 cross-sectonal datasets, we confrm t as a credble relatonshp (see Fg. 2). To summarze the 5 cross-sectonal datasets, VerfedBankAccount and Descrpton has no relatonshp strong enough wth any other factor. CredtGrade(ProsperRatng), AmountRequested and BorrowerMaxRate are determnant factors of the class factor Status. GroupKey s an mportant factor nfluencng other lstng optons. CredtGrade (ProsperRatng) has the most wdely effect on other factors. Soft margn SVM wth dfferent kernels and KNN are appled to the annual dataset of year 2007 to predct the fundng success n [1]. The result shows that SVM wth Radal Bass Kernel has the hghest accuracy 85%. The predcton accuracy of KNN s 79%. The predcton accuracy of our model s 7.5% hgher than SVM, and 13.5% hgher than KNN. The predcton performance of our model can be seen n Table 2. 83

However, the predcton senstvty, whch ndcates the proporton we truly recognzed of the successful lstngs, s too low to accept. Ths s because the data skews towards the falure lstngs heavly. For example, only 9% of all the lstngs n 2006 got funded. The weght adjustment technque s used to solve ths problem. We enhance the relatve weght of successful lstngs to promote the senstvty. Snce there s a tradeoff between the senstvty and accuracy (see Table 3), the relatve weght can be decded accordng to the relatve mportance of dfferent classes. In the case of 2006, 4.6 may be a proper value for the weght. The senstvty rses up to 67.60% whle the accuracy and specfcty stay on 86.72% and 88.49%. Fg. 1: Bayesan network structure for year 2006. A drected edge n the graph represents the causal relatonshp between two factors. CredtGrade, BorrowerMaxRate and Images are beleved to have drect nfluences on Status. Fg. 2: General model for 5 cross-sectonal annual datasets. A drected edge represents the causal relatonshp between two factors. The number besdes the edge represents the tmes ths relatonshp appears n 5 annual datasets. A relatonshp wth 3 appearances or above s confrmed to be credble. CredtGrade(ProsperRatng), AmountRequested and BorrowerMaxRate are three stable factors nfluencng Status across 5 years. Table. 2: Predcton accuracy for cross-sectonal annual dataset Year #Tranng Instances #Testng Instances Accuracy 2006 43,322 21,837 91.69% 2007 95,210 47,611 92.50% 2008 66,272 33,103 89.88% 84

2009 8,304 3,996 84.38% 2010 14,714 7,600 78.33% Weght Predcton Table. 3: Predcton performance wth dfferent weght 1.0 1.9 2.8 3.7 4.6 5.4 Accuracy(%) 91.69 90.88 89.33 88.66 86.72 86.72 Senstvty(%) 10.85 42.06 55.62 60.64 67.60 67.60 Specfcty(%) 99.18 95.40 92.45 91.26 88.49 88.49 5. Concluson and Dscusson In ths study, we propose a HEK2 algorthm to buld the Bayesan network model on emprcal data collected from P2P lendng marketplace. The method s effectve n dscoverng the complcated causal relatonshps among varous factors. Wth the DAG structure learned n our model, mportant factors whch actually drve the varaton of fundng success probabltes are clearly llustrated. Emprcally, our basc results are n lne wth earler work. The dfference s that our model reveals more hdden patterns. The predcton accuracy of our model s 7.5% hgher than SVM and 13.5% hgher than KNN, compared wth earler work. Our model has the practcal sgnfcance wth the help of the weght adjustment technque. However, our algorthm has an exponental tme complexty. To fnd a more effcent exact search method s one of the future research drectons. 6. Acknowledgements Supported by the Key Project of Bejng Natural Scence Foundaton (category B, No. KJ201210037037). 7. References [1] Herrero-Lopez, A Sheng-Yng Pao, R Bhattacharyya. The Effect of Socal Interactons on P2P Lendng. meda.mt.edu. [2] L Puroa, JE. Techb, H Wallenusa, J Wallenus. Borrower Decson Ad for people-to-people lendng. Decson Support Systems. Volume 49, Issue 1, Aprl 2010, Pages 52-60. [3] M Klafft. Onlne peer-to-peer lendng: A lenders perspectve. Proceedngs of the Internatonal Conference on E- Learnng, E-Busness, Enterprse Informaton Systems, and E-Government, EEE 2008. [4] http://www.prosper.com [5] J Ryan, K Reuk, C Wang. To Fund Or Not To Fund: Determnants Of Loan Fundablty n the Prosper.com Marketplace. Stanford Graduate School of Busness. [6] R Daly, Q Shen, S Atken. Learnng Bayesan networks: approaches and ssues. The Knowledge Engneerng Revew (2011), 26: pp 99-157. [7] F. M. Malvestuto. Approxmatng dscrete probablty dstrbutons wth decomposable models. STATISTICS AND COMPUTING, Volume 6, Number 2, 169-176. [8] GF. Cooper and E Herskovts. A Bayesan method for the nducton of probablstc networks from data. MACHINE LEARNING, Volume 9, Number 4, 309-347. [9] A Darwchek. Recursve condtonng. Artfcal Intellgence, Volume 126, Issues 1-2, February 2001, Pages 5-41. 85