Data mining application in banking sector with clustering and classification methods

Transcription

1 Proceedings of the 2015 International Conference on Industrial Engineering and Operations Management Dubai, United Arab Emirates (UAE), March 3 5, 2015 Data mining application in banking sector with clustering and classification methods Aslı Çaliş Gazi University Department of Industrial Engineering Ankara, Turkey [email protected] Ahmet Boyaci Hitit University Department of Management Çorum, Turkey [email protected] Kasım Baynal Kocaeli University Department of Industrial Engineering Kocaeli, Turkey [email protected] Abstract Because of the phenomenal rise in information, future forecasting systems about strategy development were needed in each area. Therefore, data mining techniques are used extensively in banking area such as many areas. In this study, conducted in banking sector, it was aimed to reduce the rate of risk in decision making to a minimum via analysis of existing personal loan customers and estimate potential customers payment performances with k-means method is one of the clustering techniques and the decision trees method which is one of the models of classification in data mining. In the study, SPSS Clementine was used as a software of data mining and an application was done for evaluation of personal loan customers. Keywords classification; clustering; data mining; personal loans; spss clementine I. INTRODUCTION Advancements in computer technologies caused a rise in information production and data base system volume. To discover the data with the potential to be useful which are kept in databases and to create meaningful patterns from these are stated as data mining. Businesses are in a tense competition which needs continuity in today s consumer focused markets. Businesses have to apply effective and low cost marketing strategies to be successful in these competition conditions [9]. To create effective marketing strategies true information is needed and to obtain true information future headed forecasting systems which can analyze the data in multiple dimensions are needed. In this connection, the data mining techniques are used widely in banking field same as many other fields. Since credit allocation is a risky condition for the banks, in this study it is targeted to obtain secure information via data mining to reduce the rate of risk in decision making to a minimum and to start out to find customer potential for future. In this study, clustering and classification models are given a place. The credit repayment performance of existing individual customers in a first class branch belonging one of the largest banks engaged in financial sector in Turkey is assessed with k-means method and estimates will be indicated in relation to the repayment conditions of potential customers in future by using decision trees. II. LITERATURE SEARCH Aşan (2007), aimed at grouping the socio-economic characteristics of customers, using credit cards. With priority, as a functional outcome, the individual banking and credit cards, are defined; the place and importance of this concept in this country are explained and bank customers who are using credit cards are put into sets by clustering analysis. Bank customers which are put into relation with this method, are put into three groups, according to their socio-economic characteristics and into three sets. According to the three sets, it is observed that the customers differ, according to ten socioeconomic indicators [2]. Doğan (2008), completed an application about clustering analysis by taking financial rations of commercial banks which are active in Turkish Banking Sector in the period of ( ). By discussing compatibility of application results which are based on financial ratios and belonging to commercial banks that are active, as of the subject date, the analysis of sets is made and explained in the study. By discussing the adaptability of application results with the results of analysis done for banks and in the light of conclusions reached, in line with the purpose of using the technique of Analysis of Sets, the financial performance of banks are determined and similar banks that resemble each other from financial angle are defined. Examinations are made to see if they can be used as an existing technique, under observance of banks and to take the form of being a /15/$ IEEE

2 complementary technique, along with the ones that are present [7]. Chien and Chen (2008), made it an objective to develop the relationship rules containing personnel selection, personnel characteristics and behavior at work containing job performance and separation from work by presenting a framework for data mining. In their study, they have focused on decision trees and rules of working together and aimed at filling the gap between data mining and personnel selection and attainment of benefit in process of personnel selection. Especially, for personnel selection decision, with the decision tree analysis, rules are formed. Since considerably large number of personal data is categorical data, they are used in forming CHAID decision tree for classification. In assessment of the performance of the classification method and in arriving at beneficial rules, it is used as the lifting criteria. They have performed their studies in recruitment of indirect workers containing engineers and managers for different business functions of the firm. Results made it possible to determine decision rules relating to personnel performance and separation from work [5]. Hsia and et al. (2008) in their study, used mining technique in a University in Taiwan, in relation to preference of course and rate of course completion analysis. The student records for the years were made subject of research based on three data mining algorithms named decision tree, connection analysis and decision forest. The objective of studies was to use new data mining technique in determining the course preferences of students and preferences of students in future in relation to course to be attended. Decision trees are used in finding the course preferences of students, connection analysis is used to determine the course category and participant vocation correlation while decision forest was used to determine the probability of completing the course preferred by the participants. In the study, CHAID is used as decision tree. In form of course category and participant profession estimated variables, the status of participants at the time of joining is taken as the objective variable. After the decision tree structured to find the courses preferred, the connection analysis was used to find the relationship between the course category and the profession of the participant. Lastly, with the decision forest, courses preferred by participants coming from different sectors are determined [11]. Fu and et al. (2007), aimed at conducting a research on female and males from two different countries from the Angeles of their culture, behavior and social loyalty as estimation of factors which determine their quality of life. CART is used in determining the quality of life of 278 Australian and 398 Taiwan female and males. In form of 4 different dependent variable in the study, physical, psychological, spiritual (mental) and environmental health were measured for determining the multi-dimensional quality of life. Whereas the independent variables were culture, behavior and social loyalty along with socio demographic status, religious and spiritual characteristics. Social demographic variables were the age, marital status, level of education; current employment status and annual hose hold income. When age was taken into consideration as continues variable in this study, other variables were used as dummy in multiple regression analysis. At the end of studies, it was determined that the CART algorithm could be used with parametric data without need for data transformation and one of the big advantages of CART was to discover the hierarchy between independent variables [10]. Questier and et al. (2005), completed the CART and multiple variable regression tree (MRT) for controlled and uncontrolled characteristic selection. The CART Method allow modeling with controlled characteristics of more than one explanatory variables x and with one respond variable y. Whereas MRT is derived from CART and can perform processes with more than one response variable y. This shows that, controlled characteristic selection for applications in artificial and real data sets can be effectively used in selecting the characteristics of the method proposed. When the number of characteristics is reduced, the most important set structure is being presented. The method, is at the same time, is developing the structure of the set by removing he unnecessary characteristics and the ones that has no relationship [13]. Albayrak and Yılmaz (2009), completed a data mining application study of data of Istanbul Securities Exchange (İMKB). In the study, by benefiting annual financial indicators for the years of the 100 enterprises which were, in industry and service sector, operating in İMKB 100 indexes, the decision tree techniques which is a data mining technique was used. To data secured by using the financial information belonging to the companies, the CHAID algorithm was applied and position of enterprises one according to the other were determined. With the results of the study, the positions of enterprises according to each other were determined by using decision tree technique and most important variables effecting variable of the sector were determined [1]. Dolgun and et al. (2009), performed a study in form of application of data mining, in analysis of unstructured data. As a result of converting the unstructured data by using the methods of text and web mining and their contribution to the success of the model after inclusion and being converted into structured from, were analyzed. Models built by using C5.0 algorithm which is one of the decision tree methods were compared with each other and the best model is determined [8]. Emel and Taşkın (2005), by benefiting from data base containing personalized sales behavior according to the customers of a retail enterprise, aimed at, making a sales analysis containing, a detailed and relative measurement results. The classification type formed, benefited from C&RT decision tree technique for the sales forecasting model. At the end of C&RT decision tree technique application, k the customers were divided into classes according to their amount of spending. By doing so, it was possible to determine the target voids formed in scale success and to determine if at what degree the relative contributions of different factors were in this [9]. Özekes and Çamurcu (2002), made an application in data mining about classification and prediction. In this application, by examine the credits given to customers in past by a bank

3 and the credits contracts that are ended, the decision tree and classification rules were formed. Following this, by using these classification rules, the status of repayment of credits of customers with credit contracts continuing were estimated [12]. III. METHOD/MODEL In the application, k-means method is one of the clustering techniques and the decision trees method which is one of the models of classification in data mining will be used. The application will be realized by using the SPSS Clementine program. In the application, the effects of variables on clusters will separately examine by using the k-means method and assessments will make in the direction of existing customers. Also the results of C5.0 and C&RT algorithms will be compared. A. K-Means for Clustering Clustering is one of the basic data processing. It is widely used in solving problems of customer segmentation and determination of swindling acts. In clustering applications we end up performing three tasks [6]. 1. Separation of data sets into sections within the clusters, 2. Verification of results of clustering, 3. Interpreting the clusters. Objective in models of clustering, bases on the fact that the elements of the clusters, resemble each other very much, but have characteristics that are present in clusters having a rather different aspects. Records present in database, are divided into this different clusters. In the K-means algorithm, K value can be determined according to problem or it can not be determined. Like squared error criterion, there is need to have a clustering criterion. The K-means algorithm starts with random selection of an object that will represent every cluster. Each of remaining objects is assigned to a cluster and the clustering criterion is used to compute average of the cluster. These averages are used as new cluster averages and each of the objects are assigned again to the cluster that resembles itself most. These clusters are computed and until no change is observed in the clusters and no change fall under the desired error level, this cycle is continued [4]. B. Decision Trees for Classification Decision trees are data mining approaches that are frequently used in classification and estimation. Despite being capable of being used in classification of other methodologies like the nerve networks, the decision trees with their easy to make interpretations and ease of being understood provides advantage or decision makers [5]. Decision trees: Have low cost, They are easy to understand, interpret and could be integrated with data base, Having good dependability (reliability) Because of such reasons, they are one of the most widely used classification techniques. Classification of data by using decision tree technique is a two step process which contains learning and classification. Before the learning step, a known training data is analyzed by a classification algorithm with the purpose of building a model. The model learned, is seen as the classification rule or the decision tree. Whereas in classification step, test data is used to determine the correctness of classification rules or correctness of the decision tree. If correctness is at an acceptable rate, rules are used for the purpose of classification of new data. The areas in the training data must be determined in relation to which sequence they will be used in forming the decision tree. For this purpose, the most widely used measurement, is the Entropy measurement. As much the Entropy measure is, the results determined by using that filed will be uncertain and indifferent at that rate. Therefore, the areas having least entropy measure at the root of the decision tree are used [12]. Let area A has different k vales {a 1, a 2,..., a k }. The Formula for finding the entropy measure of area A given is [12]: M k N E ( CA \ ) = p( a k, j) x p( ci\a k, j) log 2 p( ci\a k, j) j= 1 i= 1 (1) Where: E (C\A) = Entropy measure of classification characteristic of area A, p (a k, j) = Probability of area a k having a value of j, p (c i \ a k, j) = Probability of class value of area a k when it has value of j to be c i, M k = Number of values contained in area a k ; j=1,2,, M k, N = number of different classes ; i= 1,2,, N k = number of areas ; k = 1,2,, k. If elements in a cluster S are separated categorically to C 1, C 2, C 3,..., C i classes, to determine the class of an element in cluster S, the required information is being computed by using the Formula: ( ) = log ( ) + log ( ) log ( ) I S p p p p p p i 2 i (2) In this Formula, p i, is the probability of a random sample to be separated into class C i and it is expressed as S i / S. Whereas S i is in class C i and represents the number of samples of a S. Expected information equation basing on Entropy or

4 separation of sub sets according to A can be expressed as follows: n S i (3) S i= 1 i ( ) = I( S ) E A In this case, in the branching process to be made by using the area A, the information gain is computed by using the Formula: Kazanç ( A) = I ( S) E( A) (4) In other words, Gain (A) is the decrease in entropy originating from knowing the value of area A. C. C4.5 and C5.0 Algorithms The most widely used decision tree algorithm is the C4.5 algorithm which is the develop state of ID3 algorithm that was proposed in 1986 by Quinlan. The C5.0 algorithm is the develop state of C4.5 and it is used especially for large data sets. To increase the correctness for the C5.0 algorithm, the boosting algorithm is used and therefore they are also known as boosting trees. The C5.0 algorithm is more rapid as compared to C4.5 and uses memory in a more productive manner [14]. Even to the results of both of the two algorithms are the same, the C5.0 as form makes it possible to come out with a smoother decision trees. D. CART Algorithm It has the nature of being the continuation of the decision tree of Morgan and Sonquist titled AID (Automatic Interaction Detection) and was proposed by Breiman and others in CART algorithm accepts both numerical and the nominal data types as input and predicted variables; can be used as a solution in classification and regression problems. CART decision tree, has unique dual form divided into a structure. As branching criteria, CART tree benefits from Gini index, without any stopping rule at the phase of its structuring, it is continually divided and grows. In the state where a new branching will not be realized, a cutting out from top in the direction of root is started. The probable most successful decision tree is subjected to assessment with a test data independently selected after each cutting offs and efforts are made to make determinations [15]. A. Data IV. APPLICATION Within the scope of the study, data containing customer numbers and information about the status of credit paybacks belonging to the credit customers of the branch where the application is going to be made were secured from the operating within the structure of General Directorate. Information about gender, marital status, age, monthly income, income by spouse, status of education, owning house and car, having children, being a customer who receives his salary form the bank, the way of work were reached by using customer numbers and the existing system in the bank for making examinations. Since principles of confidentiality were observed, the customer numbers were changed. Data used in the study, were put into categorical state. After deleting the lacking and erroneous data, remaining data were inputted on to Microsoft Excel and preliminary works were performed. A matrix of 200 x 12 containing total of 200 customers was formed. Data to be used in the application were based on legal follow-up and normal payment records in a period of six month belonging to the individuals in the branch. B. Application for Clustering The most critical subject of Clustering Analysis is to decide about the number of cluster. The researcher must minimize the uniqueness, in deciding the number of cluster. However, in many articles that are published currently, there are no final results that could be indicated as findings on this subject. The most known of initially proposed approaches is the identity: k= (n/2) 1/2 (5) And it is computed as indicated above. Where k is the number of clusters, n is the number of units. They are recommended for use in research, based on small samples. When it is used in research having large sample, it becomes difficult to reach at healthy results [3]. There are two different method that are used in applications, to determine the number of clusters First, the number of cluster is determined to be 10 by using: k= (200/2) 1/2 Where, 200 is the number of customers. In the second phase, the number of clusters from k= 2 to k=10 are increased by one and sum of squared errors for each value is determined. The k= 10 value is determined according to the above formula and the squares of errors which relates to other cluster numbers, are compared and the value having least sum of squared errors is accepted as the number of clusters. In Table I there are values relating to sum of squared errors for each number of clusters. The cluster having the least value of sum of squared errors is determined to be 3. TABLE I. Number of Clusters for K-means and Sum of Squared Errors It was determined by the program that, there was no variable that didn t have any effect on three clusters so the effects belonging all of the variables had been examined and the results had been interpreted. If we study the clusters, from the angle of payment status, we can see that, this variable represents importance for three clusters. As it can be observed from Figure I, all of the customers in the first cluster, experienced problem in making payment for credit amounts and they were subjected to legal

5 follow-up. The customers in the second and third clusters, were formed by persons who experienced problem, respectively at the rates of %63,64 and %98,48, in repayment of credit. to 200 individual credit customers. Of the data 60 % were allotted to Training set, and the remaining 40 % for the test set. The Status of Payment which is a dependent variable contained data about 100 each Legal follow-up and 100 normal payment status. For training and test set, this rate was protected as 50 %. Model built with SPSS Clementine can be observed from Figure II. Figure I. Effect of Payment Status Variable on Clusters When the customer profiles are assessed, under the light of data given above, it is observed that the first cluster is formed mostly by persons at age 45-51, having no home and car belonging to them and with monthly income in the rage of TL, receiving their salaries from different banks, graduates of primary school and retired male customers. All of the customers, in this cluster by delaying making payment for credit amounts when due entered into a legal follow-up status. The second cluster is generally composed of single customers, in age interval of years, employed at public and private sector. Contrary to others, the number of females in this cluster is more, as compared to the number of males. Another subject to pay attention is that % of the customers in this cluster, do not have house belonging to themselves and they do not possess normal payment status. When one looks at the third cluster, it can be seen that, this cluster contains, in majority, the public employees, retired male customers, having monthly income in the range of TL, receiving their salaries from the banks from which they have used credits. They are at age interval of years, owning a house and a car. Of a significant portion of customers, in this cluster, there is income earned by the spouse and % of them make payments in an orderly manner. Figure II. Model Built with SPSS Clementine 1) Results of C5.0 Algorithm: While customer classification was being done with C5.0 Algorithm, according to payment status, it can be seen that the first branching in the decision tree starts with status of education. In other words, the most effective variable on status of payment is seen to be status of education. Of the persons graduated from University % paid credits without delay and 9.52% entered into a legal follow-up state. For the customers having primary school and high school graduation level, the decision tree continued to branch with the status of being a customer who receives his salary from the bank. It is observed that in this group, all of the customers in status of being a customer who receives his salary from different bank entered into legal follow-up. According to the existing system of the bank, the credit installments of customers are being regularly collected from their salary accounts and in case the installments are delayed, the account is blocked and collection is attained. In this case, unless the customer does not enter into an exceptional condition, he is not subjected to follow-up. The structure of tree of C5.0 algorithm can be observed from Figure III. C. Application for Classification The application was realized by using a data set containing 11 independent variables and 1 dependent variables belonging

6 Figure III. Structure of decision tree belonging to the C5.0 Algorithm 2) Rate of accuracy for the C5.0 Algorithm: The rate of accuracy of the algorithm for training set is determined to be 96.67% and the rate of accuracy for test set is determined to be %. As it can be observed form Figure IV, there are 4 data in the training set and 9 data in the test set which are incorrectly classified. Since rate of accuracy is high for the both sets, it is possible to say that the model is successful. Figure V. Structure of decision tree of the C&RT Algorithm 4) Rate of accuracy belonging to the C&RT Algorithm: The rate of accuracy of the algorithm for training set is determined to be % and the rate of correctness for test set as %. As it can be observed from Figure VI, 4 data were incorrectly classified for training set and 11 data for the test set. Since rate of accuracy is high for both sets, it is possible to say that the model is successful. Figure VI. Rate of Accuracy for the C&RT Algorithm for training and test sets 5) Gains for the C&RT Algorithm: Gains which was provided by the C&RT Algorithm can be observed from Table.II. Figure IV. Rate of accuracy for the C5.0 Algorithm for training and test sets 3) Results of C&RT Algorithm: Branching of C&RT Algorithm started with the monthly income variable. It is observed that customers having monthly income of 750 TL and less and the customers having monthly income of TL were put into same group and of these customers who did not have equivalent salaries, all of them were in legal follow-up. According to monthly income, in classification of customers included into another group, the second variable which was effective is observed to be the age variable. The structure of tree belonging to the C&RT algorithm continues as follows: TABLE II. Gains for the C&RT Algorithm When the C&RT Algorithm is examined from the angle of gains it provided, it is possible to arrive at the below conclusions. When the nodes in Status of payment indicated to be in legal follow-up included in this table are taken into consideration and the 16th node is examined:

7 Node = 2 (of the total 120 data, in 16th node there are 2 data). Node(%) = 1.67 (2/120), Gain=2 (of the total of 120 entries, 60 of them belong to customers in legal follow-up. In 16th node, there are 2 entries and 2 of them belong to customers in legal follow-up), Gain (%) =3.33 [(Number of entries arriving at the node)=2/(number of entries in legal follow-up)=60], Response(%)=100 [(Number of entries arriving at the node)=2/ Number of entries in legal follow-up arriving at the node)=2], Index (%) =200[Response (%) =100/ Ratio of number of entries in legal follow-up to all of the entries= (60/120)]. 6) Binary Classifier Results: When Binary Classifier results are viewed from Table III, the value of accuracy of test set is seen to be at the highest classifier of Artificial Nerve Networks with a rate of 93.75%. The classifier in the second rank from the angle of accuracy value is Logistic Regression with a rate of % and in the third rank C5.0 is seen with a rate of % and the C&RT Algorithm. TABLE III. Binary Classifier Results At the end of the clustering analysis, numer of cluster was determined to be three. In first cluster all of the customers, by delaying making payment for credit amounts when due entered into a legal follow-up status. In case the customers in this cluster again requests a credit, their applications can be assessed as negative or requests can be made to present mortgage or a guarantee to decrease the rate of risks involved. In second cluster % of the customers do not have house belonging to themselves and they do not posses normal payment status. In this case, for the customers who did not delay their installments, encouragements can be provided to them to use housing loans. In third cluster, there is income earned by the spouse and % of them make payments in an orderly manner. Customers in this cluster, can be assessed, as special customers and cross-sales of internet banking, investment account having no account operation expense, foreign exchange account, credit cards, Rapid Transit System (HGS) devices and insurance products containing such events as accident, earthquake and fire can be made to them. Furthermore, in case they demand credit again, special interest reduction polices can be implemented. By doing so, the loyalty of the customers to the related bank can be protected. According to results of classification, while the C5.0 Algorithm is forming tree with multiple branches originating from each node, the C&RT Algorithm generated rules based on dual division process. While the most important variable in forming decision there in C5.0 Algorithm was status of education; for the C&RT Algorithm the monthly income variable became the most important one. When rate of accuracy of models are viewed, it is observed that from the angle of training set, both model have same percentage values. From the angle of test set, it is observed that C5.0 produces better results with a difference of 2.5% as compared to the C&RT Algorithm. So, by using decision tree of C5.0 Algorithm, potential customers repayment performances can be estimated and reducing the rate of entrance into legal follow-up can be provided. By increasing the levels of decision tree, it may be possible to reach at higher estimation successes. V. COMPARISON / CONCLUSIONS For the banks to attain competitive advantage in the sector and to stay operating for long time periods, they must understand their customers correctly and must separate risky customers from others. In the study, it was aimed to analysis of existing personal loan customers and estimate potential customers repayment performances. Firstly, existing customers that resemble each other very much according to a predetermined criteria were grouped into same group by using k-means method is one of the clustering techniques and then the rules were formed for potential credits customers in future by using the decision trees method which is one of the models of data mining classification. Under the light of data, the target is determined to reduce the rate of entrance into legal followup. REFERENCES [1] Albayrak A. S., Yılmaz Ş.K., Data mining: Decision tree algorithms and an application on data of IMKB, Süleyman Demirel University The Journal of Faculty of Economics and Administrative Sciences, vol. 14, pp , [2] Aşan Z., Examining the socioeconomic characteristics of customers using credit cards, with clustering analysis, Dumlupınar University The Journal of Social Sciences, vol.17, pp , [3] Atbaş A. C., A study on determining the cluster number in clustering analysis, Master Thesis, Ankara University, Graduate School of Natural Sciences, [4] Bilen H., Data mining application for personnel selection and performance evaluation in banking sector, Master Thesis, Gazi University, Graduate School of Natural and Applied Sciences, [5] Chien C.-F., Chen L.-F., Data mining to improve personnel selection and enhance human capital: A case study in high-technology industry, Expert Systems with Applications,vol. 34, pp , [6] Ching W. K., Pong M. K.., Advances in data mining and modeling, 1st ed., World Scientific, Hong Kong, China, [7] Doğan B., Clustering analysis as a tool under the supervision of banks: An application for Turkish banking sector, PhD Thesis, Kadir Has University, Graduate School of Social Sciences, 2008.

8 [8] Dolgun M.Ö., Özdemir T.G., Oğuz D., Analysis of unstructured data in data mining: Text and web mining, Journal of Statisticians, vol. 2, pp , [9] Emel G. G., Taşkın Ç., Decision trees in data mining and a sales analysis application, Eskişehir Osmangazi University Journal of Social Sciences, vol. 6, pp , [10] Fu S.-Y. K., Anderson D., Courtney M., Hu W., The relationship between culture, attitude, social networks and quality of life in midlife Austrilian and Taiwanese citizens, Maturitas, vol.58, pp , [11] Hsia T.-C., Shie A.-J.,Chen L.-C., Course planning of extension education to meet market demand by using data mining techniques-an example of Chinkuo technology university in Taiwan,Expert Systems with Applications, vol. 34, pp , [12] Özekes S., Çamurcu A.Y., A classification and prediction application in data mining, Marmara University Journal of Science, vol. 18, pp. 1-17, [13] Questier F., Put R., Coomans D., Walczak B., Heyden Y.V., The use of CART and multivariate regression trees for supervised and unsupervised feature selection,chemometrics And Intellegent Labaratory Systems, vol. 76,pp , [14] Sancak S., Comparison of techniques belonging to intrusion detection systems, Master Thesis, Gebze Institute of Technology, Graduate School of Engineering and Sciences,2008. [15] Sezer E. A., Bozkır A.S., Yağız S., Gökçeoğlu C., The effect of the depth of decision tree to the prediction capacity in C&RT algorithm: an application on the progress rate of a tunnel boring machine, Symposium of Innovations and Applications in Intelligent Systems, Kayseri, Turkey, June 2010.