A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model ABSTRACT Mrs. Arpana Bharani* Mrs. Mohini Rao** Consumer credit is one of the necessary processes but lending bears credit risk. Credit risk is the economic loss due to the failure of a borrower to repay according to the terms of his or her contract with the lender. To manage credit risk means to estimate the potential ability of borrowers to repay their debts. By the use of quantitative models, researchers have found out the factors that contribute to credit risks. But also the presence of data mining techniques to identify credit risk cannot be ignored. There is very less research that shows the use of data mining techniques in this context, and such type of studies could be very helpful for the practitioners and academicians. This study fills that gap. Data mining is a well defined procedure that takes data as input and produces output in the forms of models or patterns. In other words, the task of data mining is to analyze a massive amount of data and to extract some usable information that we can interpret for future uses. s has also increased.currently, data mining is a popular way to combat frauds because of its effectiveness. Data mining is a well-defined procedure that takes data as input and produces output in the forms of models or patterns. In other words, the task of data mining is to analyze a massive amount of data and to extract some usable information that we can interpret for future uses. Keywords: Decision Tree, Entropy, Gini, Hunt's Algorithm, Online s, Tracing Email, Tracing IP. Mrs. Arpana Bharani - Shri Cloth Market Kanya Vanijya Mahavidyalaya, Indore Mrs. Mohini Rao - Shri Cloth Market Kanya Vanijya Mahavidyalaya, Indore

1. Introduction Decision tree learning is a common method used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. This paper is organized as follows. A brief description about the related work is given in section 2. Decision tree model with Hunt's algorithm functionalities in fraud detection are given in section 3. Split criteria with example are described in section 4. Model testing and evaluation are discussed in section 5 and section 6 concludes the paper. 2. Literature Review Credit card fraud detection has drawn a lot of research and a special emphasis on data mining has been suggested. Brause et al. [1] have developed an approach that involves advanced data mining techniques and neural network algorithms to obtain high fraud coverage. Stolfo et al. [2] suggest a credit card fraud detection system (FDS) using metalearning techniques to learn models of fraudulent credit card transactions. Metalearning is a general strategy that provides a means for combining and integrating a number of separately built classifiers or models. They consider naïve Bayesian, C4.5, and Back Propagation neural networks as the base classifiers. A metaclassifier is used to determine which classifier should be considered based on skewness of data. Phua et al. [3] have done an extensive survey of existing data-mining-based FDSs and published a comprehensive report. Prodromidis and Stolfo [4] use an agent-based approach with distributed learning for detecting frauds in credit card transactions. It is based on artificial intelligence and combines inductive learning algorithms and metalearning methods for achieving higher accuracy. The following is the classifying technique that can be used in fraud detection. 3. Decision Tree In this model we find out the fraud customers and merchants by tracing fake mail and IP addresses. Customers and merchants are suspicious if the mail is a fake. Then they are traced by all the information about the sender through IP Address. It can find out the location of the customer and can trace all the details. Decision Tree is the most powerful technique in Data Mining. It is the vital part of Credit card Detection.

3.1 Decision Tree Algorithm Decision Tree algorithm is a data mining induction techniques that recursively partitions a data set of records using depth-first greedy approach (Hunts et al, 1966) or breadth-first approach (Shafer et al, 1996) until all the data items belong to a particular class. A decision tree structure is made of root, internal and leaf nodes. The tree Structure is used in classifying unknown data records. At each internal node of the tree, a decision of best split is made using impurity measures (Quinlan, 1993). The tree leaves are made up of the class labels which the data items have been group. Fundamental Algorithm for Decision Tree 3.2 Hunt's Algorithm Step 1: Let xt be the set of training record from node t. Step 2: Let y = {y1, y2............... yn} be the class labels. Step 3: If all records in xt belong to the same class yt is a leaf node labeled at yt. Step 4: If xt contain records that belongs more than one class Select attribute test conditions to partition the records in the smaller subset. Create child node for each outcome of the test. Let x contains five attributes such as x1, x2, x3, x4, x5 and Values x1=tid, x2=cname, x3=mailtype,x4=ipaddress, x5= TransAmt and y be the Class labels contains the attribute TransType and values are either or. So contains y3, y4, y5, y7, y9, y10, y11, y12 and Contains y1, y2, y6, y8 class labels. Akash belongs to the same class (). So Akash comes for Leaf Node or Terminal Node and Nitin and Vikas have more than one class and. We can test the condition and partition the records in to smaller subsets and create child node for each outcome of the test. 4. Split Criteria The best split is defined as one that does the best job of separating the data into groups where a single class predominate each group. Purity is the measure which is used to evaluate a potential. The best split is one that increases purity of subsets by the greatest amount. A good split also creates nodes of similar size or at least doesn't create very small nodes. Test for choosing the best split: - Entropy - Information gain ratio. - Gini

4.1 Entropy (Information Gain) Selection of an attribute to test at each node and choosing the most useful attribute for classifying an example. It can measure how well a given attribute separate the training example according to the target classification. This measure is used to select among the candidate attribute at each step while growing the tree. Table 1: Credit Card Transaction Data. TID CName M ail Type T1 Nitin Customer T2 Nitin Customer T3 Akash Customer T4 Vikas Customer T5 Vikas Merchant T6 Vikas Merchant T7 Akash Merchant T8 Nitin Merchant T9 Nitin Merchant T10 Vikas Customer T11 Nitin Customer T12 Akash Customer T13 Nitin Customer T4 Nitin Customer IP Address 61.16.173. 243 61.16.173. 243 61.16.173. 243 61.16.173. 243 Trans Amt Trans Type General form of calculating information gain, Entropy (S) = -?Pi log2 Pi Where, Pi is the probability belongs to class.

1. S is a sample of training examples. 2. P is the proportion of positive and negative example in S 3. Log function to the base 2 is encoded in bits. Entropy (S) is just the average amount of information is needed to identify the class label of a tuple S. Entropy (S) is also known as Entropy. For example of the credit card fraud transaction data given in Table1 there are nine instances of which the decision to Transtype is "" and there are five instances of which the decision to Transtype is "", then the information gain result is: The Entropy of each E(S) =-9/14 x log2 (9/14) - 5/14 x log2 (5/14) = 0.940 bits. For example: The target class is transtype which can be legal or fraud. The attributes to collection are TID, Cname, Transamt, IPAddress, MailType and TransType. Detailed calculation for InformationGain (Transtype) Entropy (Transtype) = [8/12 log2 (8/12) + 4/12 log2 (4/12)] = 0.9185 Cname Entropy N itin 0.9709 Akash 0 Vikas 0.8112 Gain 0.2486 Apply the same process to the remaining attributes we get Attributes Gain Email 0.01086 IP 0.1263 TransAmt 0.0568

Comparing the Information Gain of the four attributes, we see that "Cname has the est Value. Cname will be the Root Node of the Decision Tree. Apply the same process to the left side of the Root Node (Nitin), We get Entropy (Nitin) =0.9309 Gain (Nitin, Email) =0.0233 Gain (Nitin, IP) =0.1387 Gain (Nitin, TransAmt) =0.0972 The Information Gain of IP is est. So IP will be the Decision Node. The decision tree looks like the following: For the center of the Root Node (Akash) which is a special case. Entropy (Akash) = 0. All members Akash belongs to strictly one target classification class (). Thus we skip all the calculation and add the corresponding Target classification value of the tree. Apply the same process to the Right side of the Root Node (Vikas), we get Entropy (Vikas) =0.7987 Gain (Vikas, mail type) =0.1089 Gain (Vikas, IP) =0.0065 Gain (Vikas, TransAmt) =0.0636 The Information Gain of mailtype is highest. So mail type will be the Decision Node. The decision tree will look like the following Now with IP and mail type as decision nodes, we no longer can split the decision tree based on the attributes because it has reach the Target Classification class. The final decision tree will look like the following

4.2 Gini Index Gini Index is used in CART. Gini Index measures the impurity of T.A does not partition a training tuples as Gini (t) =1-?Pi2 When Pi Probability that a tuple in t belongs to class ci is estimated ci,t let t be the training data where there are 9 tuples belonging to the class Transtype="" and remaining 5="". Gini (transtype) =1-(9/14)2 - (5/14)2 =0.459 5. TESTING AND MODEL EVALUATION Two alternative models can be built, each based on a different method and test against the training set. Decision tree model is prepared by using splitter algorithm. Neural network and Belief network can be trained by using the whole sample as training. 6. CONCLUSION detection techniques are very important in today's world of computers. The fast growth of internet with large financial opening in electronic business and the lack of truly secure systems create more opportunities for criminals to attack the system. In this paper we study a Credit Card Detection model using effective algorithm for Decision Tree Learning. Although focus is based on the Information Gain based Decision Tree Learning. In this paper the best split of Purity Measures of Gini, Entropy and Information Gain Ratio is estimated to test the best classifier attribute. In this Technique we simply find out the ulent Customer/ Merchant through Tracing Fake Mail and IP Address. Customer /merchant are suspicious if the mail is fake they are traced all information about the owner/sender through IP Address. It can find out the Location of the customer and Trace all details. Decision Tree is the most Powerful Technique in Data Mining. It is also a vital part of Credit card Detection.

7. REFERENCES 1. R. Brause, T. Langsdorf, M. Hepp,Gesellschaft f. "Neural Data Mining for Credit Card Detection" 2. S.J Stolfo, D.W Fan, W.Lee, A.L Prodronidis and P.K.Chan "Credit Card Detection Using Mete-Learning: Issues and Initial Results" Proc. AAAI Workshop AI Methods in and Risk Management, pp.83-90, 1997. 3. Clifton Phua, Vincent Lee, Kate Smith & Ross Gayler, "A Comprehensive Survey of Data Mining-based Detection Research", Final version 2: 9/02/2005 4. Philip K. Chan, Wei Fan, Andreas L. Prodromidis, and Salvatore J. Stolfo, "Distributed Data Mining incredit Card 5. Detection", IEEE November/December 1999. 6. Chun Wei Clifton Phua, "Investigative data mining in fraud detection" a thesis submitted on November 2003. 8. WEBLIOGRAPHY 1. www.ijcsi.org 2. http://support.sas.com/resources/papers/proceedings14/1581-2014.pdf 3. www.ijdmta.com