A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model



Similar documents
A COMPARATIVE ASSESSMENT OF SUPERVISED DATA MINING TECHNIQUES FOR FRAUD PREVENTION

Meta Learning Algorithms for Credit Card Fraud Detection

Electronic Payment Fraud Detection Techniques

Credit Card Fraud Detection Using Meta-Learning: Issues 1 and Initial Results

Data Mining for Knowledge Management. Classification

Credit Card Fraud Detection Using Hidden Markov Model

Data Mining Classification: Decision Trees

CREDIT CARD FRAUD DETECTION USING DECISION TREE FOR TRACING AND IP

Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

Fraud Detection in Credit Card Using DataMining Techniques Mr.P.Matheswaran 1,Mrs.E.Siva Sankari ME 2,Mr.R.Rajesh 3

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

Survey on Credit Card Fraud Detection Techniques

DATA MINING TECHNIQUES AND APPLICATIONS

Detecting Credit Card Fraud by Decision Trees and Support Vector Machines

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Data Mining Techniques Chapter 6: Decision Trees

A Novel Approach for Credit Card Fraud Detection Targeting the Indian Market

Chapter 20: Data Analysis

Data mining techniques: decision trees

Performance Analysis of Decision Trees

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Classification and Prediction

The Credit Card Fraud Detection Analysis With Neural Network Methods

E-Banking Integrated Data Utilization Platform WINBANK Case Study

CREDIT CARD FRAUD DETECTION BASED ON ONTOLOGY GRAPH

Decision-Tree Learning

Lecture 10: Regression Trees

DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE

Social Media Mining. Data Mining Essentials

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Mining Techniques

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

DATA MINING APPLICATION IN CREDIT CARD FRAUD DETECTION SYSTEM

Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System

CLOUDS: A Decision Tree Classifier for Large Datasets

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Implementation of Data Mining Techniques to Perform Market Analysis

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Evaluating Online Payment Transaction Reliability using Rules Set Technique and Graph Model

Data Mining Techniques and its Applications in Banking Sector

Gerry Hobbs, Department of Statistics, West Virginia University

To improve the problems mentioned above, Chen et al. [2-5] proposed and employed a novel type of approach, i.e., PA, to prevent fraud.

Less naive Bayes spam detection

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Intelligent Agents and Fraud Detection

International Journal of Advance Research in Computer Science and Management Studies

Application of Hidden Markov Model in Credit Card Fraud Detection

Data Mining in the Application of Criminal Cases Based on Decision Tree

Professor Anita Wasilewska. Classification Lecture Notes

EFFICIENT CLASSIFICATION OF BIG DATA USING VFDT (VERY FAST DECISION TREE)

Comparative Analysis of Serial Decision Tree Classification Algorithms

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining based on Rough Set and Decision Tree Optimization

Application of Data Mining Techniques to Model Breast Cancer Data

Classification On The Clouds Using MapReduce

Effective Data Mining Using Neural Networks

Keywords data mining, prediction techniques, decision making.

Modeling System Calls for Intrusion Detection with Dynamic Window Sizes

Decision Trees from large Databases: SLIQ

Implementation of a New Approach to Mine Web Log Data Using Mater Web Log Analyzer

Data Mining Algorithms Part 1. Dejan Sarka

Introduction to Learning & Decision Trees

Random forest algorithm in big data environment

Web Document Clustering

Decision Tree Learning on Very Large Data Sets

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE

Prediction of Heart Disease Using Naïve Bayes Algorithm

How To Detect Credit Card Fraud

Data Mining on Streams

Classification: Basic Concepts, Decision Trees, and Model Evaluation

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining Practical Machine Learning Tools and Techniques

D A T A M I N I N G C L A S S I F I C A T I O N

Use of Data Mining in Banking

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Customer Classification And Prediction Based On Data Mining Technique

Artificial Neural Network and Location Coordinates based Security in Credit Cards

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

AN UPDATE RESEARCH ON CREDIT CARD ON-LINE TRANSACTIONS

AnalysisofData MiningClassificationwithDecisiontreeTechnique

Learning with Skewed Class Distributions

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

An Efficient Two-phase Spam Filtering Method Based on s Categorization

Transcription:

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model ABSTRACT Mrs. Arpana Bharani* Mrs. Mohini Rao** Consumer credit is one of the necessary processes but lending bears credit risk. Credit risk is the economic loss due to the failure of a borrower to repay according to the terms of his or her contract with the lender. To manage credit risk means to estimate the potential ability of borrowers to repay their debts. By the use of quantitative models, researchers have found out the factors that contribute to credit risks. But also the presence of data mining techniques to identify credit risk cannot be ignored. There is very less research that shows the use of data mining techniques in this context, and such type of studies could be very helpful for the practitioners and academicians. This study fills that gap. Data mining is a well defined procedure that takes data as input and produces output in the forms of models or patterns. In other words, the task of data mining is to analyze a massive amount of data and to extract some usable information that we can interpret for future uses. s has also increased.currently, data mining is a popular way to combat frauds because of its effectiveness. Data mining is a well-defined procedure that takes data as input and produces output in the forms of models or patterns. In other words, the task of data mining is to analyze a massive amount of data and to extract some usable information that we can interpret for future uses. Keywords: Decision Tree, Entropy, Gini, Hunt's Algorithm, Online s, Tracing Email, Tracing IP. Mrs. Arpana Bharani - Shri Cloth Market Kanya Vanijya Mahavidyalaya, Indore Mrs. Mohini Rao - Shri Cloth Market Kanya Vanijya Mahavidyalaya, Indore

1. Introduction Decision tree learning is a common method used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions. This paper is organized as follows. A brief description about the related work is given in section 2. Decision tree model with Hunt's algorithm functionalities in fraud detection are given in section 3. Split criteria with example are described in section 4. Model testing and evaluation are discussed in section 5 and section 6 concludes the paper. 2. Literature Review Credit card fraud detection has drawn a lot of research and a special emphasis on data mining has been suggested. Brause et al. [1] have developed an approach that involves advanced data mining techniques and neural network algorithms to obtain high fraud coverage. Stolfo et al. [2] suggest a credit card fraud detection system (FDS) using metalearning techniques to learn models of fraudulent credit card transactions. Metalearning is a general strategy that provides a means for combining and integrating a number of separately built classifiers or models. They consider naïve Bayesian, C4.5, and Back Propagation neural networks as the base classifiers. A metaclassifier is used to determine which classifier should be considered based on skewness of data. Phua et al. [3] have done an extensive survey of existing data-mining-based FDSs and published a comprehensive report. Prodromidis and Stolfo [4] use an agent-based approach with distributed learning for detecting frauds in credit card transactions. It is based on artificial intelligence and combines inductive learning algorithms and metalearning methods for achieving higher accuracy. The following is the classifying technique that can be used in fraud detection. 3. Decision Tree In this model we find out the fraud customers and merchants by tracing fake mail and IP addresses. Customers and merchants are suspicious if the mail is a fake. Then they are traced by all the information about the sender through IP Address. It can find out the location of the customer and can trace all the details. Decision Tree is the most powerful technique in Data Mining. It is the vital part of Credit card Detection.

3.1 Decision Tree Algorithm Decision Tree algorithm is a data mining induction techniques that recursively partitions a data set of records using depth-first greedy approach (Hunts et al, 1966) or breadth-first approach (Shafer et al, 1996) until all the data items belong to a particular class. A decision tree structure is made of root, internal and leaf nodes. The tree Structure is used in classifying unknown data records. At each internal node of the tree, a decision of best split is made using impurity measures (Quinlan, 1993). The tree leaves are made up of the class labels which the data items have been group. Fundamental Algorithm for Decision Tree 3.2 Hunt's Algorithm Step 1: Let xt be the set of training record from node t. Step 2: Let y = {y1, y2............... yn} be the class labels. Step 3: If all records in xt belong to the same class yt is a leaf node labeled at yt. Step 4: If xt contain records that belongs more than one class Select attribute test conditions to partition the records in the smaller subset. Create child node for each outcome of the test. Let x contains five attributes such as x1, x2, x3, x4, x5 and Values x1=tid, x2=cname, x3=mailtype,x4=ipaddress, x5= TransAmt and y be the Class labels contains the attribute TransType and values are either or. So contains y3, y4, y5, y7, y9, y10, y11, y12 and Contains y1, y2, y6, y8 class labels. Akash belongs to the same class (). So Akash comes for Leaf Node or Terminal Node and Nitin and Vikas have more than one class and. We can test the condition and partition the records in to smaller subsets and create child node for each outcome of the test. 4. Split Criteria The best split is defined as one that does the best job of separating the data into groups where a single class predominate each group. Purity is the measure which is used to evaluate a potential. The best split is one that increases purity of subsets by the greatest amount. A good split also creates nodes of similar size or at least doesn't create very small nodes. Test for choosing the best split: - Entropy - Information gain ratio. - Gini

4.1 Entropy (Information Gain) Selection of an attribute to test at each node and choosing the most useful attribute for classifying an example. It can measure how well a given attribute separate the training example according to the target classification. This measure is used to select among the candidate attribute at each step while growing the tree. Table 1: Credit Card Transaction Data. TID CName M ail Type T1 Nitin Customer T2 Nitin Customer T3 Akash Customer T4 Vikas Customer T5 Vikas Merchant T6 Vikas Merchant T7 Akash Merchant T8 Nitin Merchant T9 Nitin Merchant T10 Vikas Customer T11 Nitin Customer T12 Akash Customer T13 Nitin Customer T4 Nitin Customer IP Address 61.16.173. 243 61.16.173. 243 61.16.173. 243 61.16.173. 243 Trans Amt Trans Type General form of calculating information gain, Entropy (S) = -?Pi log2 Pi Where, Pi is the probability belongs to class.

1. S is a sample of training examples. 2. P is the proportion of positive and negative example in S 3. Log function to the base 2 is encoded in bits. Entropy (S) is just the average amount of information is needed to identify the class label of a tuple S. Entropy (S) is also known as Entropy. For example of the credit card fraud transaction data given in Table1 there are nine instances of which the decision to Transtype is "" and there are five instances of which the decision to Transtype is "", then the information gain result is: The Entropy of each E(S) =-9/14 x log2 (9/14) - 5/14 x log2 (5/14) = 0.940 bits. For example: The target class is transtype which can be legal or fraud. The attributes to collection are TID, Cname, Transamt, IPAddress, MailType and TransType. Detailed calculation for InformationGain (Transtype) Entropy (Transtype) = [8/12 log2 (8/12) + 4/12 log2 (4/12)] = 0.9185 Cname Entropy N itin 0.9709 Akash 0 Vikas 0.8112 Gain 0.2486 Apply the same process to the remaining attributes we get Attributes Gain Email 0.01086 IP 0.1263 TransAmt 0.0568

Comparing the Information Gain of the four attributes, we see that "Cname has the est Value. Cname will be the Root Node of the Decision Tree. Apply the same process to the left side of the Root Node (Nitin), We get Entropy (Nitin) =0.9309 Gain (Nitin, Email) =0.0233 Gain (Nitin, IP) =0.1387 Gain (Nitin, TransAmt) =0.0972 The Information Gain of IP is est. So IP will be the Decision Node. The decision tree looks like the following: For the center of the Root Node (Akash) which is a special case. Entropy (Akash) = 0. All members Akash belongs to strictly one target classification class (). Thus we skip all the calculation and add the corresponding Target classification value of the tree. Apply the same process to the Right side of the Root Node (Vikas), we get Entropy (Vikas) =0.7987 Gain (Vikas, mail type) =0.1089 Gain (Vikas, IP) =0.0065 Gain (Vikas, TransAmt) =0.0636 The Information Gain of mailtype is highest. So mail type will be the Decision Node. The decision tree will look like the following Now with IP and mail type as decision nodes, we no longer can split the decision tree based on the attributes because it has reach the Target Classification class. The final decision tree will look like the following

4.2 Gini Index Gini Index is used in CART. Gini Index measures the impurity of T.A does not partition a training tuples as Gini (t) =1-?Pi2 When Pi Probability that a tuple in t belongs to class ci is estimated ci,t let t be the training data where there are 9 tuples belonging to the class Transtype="" and remaining 5="". Gini (transtype) =1-(9/14)2 - (5/14)2 =0.459 5. TESTING AND MODEL EVALUATION Two alternative models can be built, each based on a different method and test against the training set. Decision tree model is prepared by using splitter algorithm. Neural network and Belief network can be trained by using the whole sample as training. 6. CONCLUSION detection techniques are very important in today's world of computers. The fast growth of internet with large financial opening in electronic business and the lack of truly secure systems create more opportunities for criminals to attack the system. In this paper we study a Credit Card Detection model using effective algorithm for Decision Tree Learning. Although focus is based on the Information Gain based Decision Tree Learning. In this paper the best split of Purity Measures of Gini, Entropy and Information Gain Ratio is estimated to test the best classifier attribute. In this Technique we simply find out the ulent Customer/ Merchant through Tracing Fake Mail and IP Address. Customer /merchant are suspicious if the mail is fake they are traced all information about the owner/sender through IP Address. It can find out the Location of the customer and Trace all details. Decision Tree is the most Powerful Technique in Data Mining. It is also a vital part of Credit card Detection.

7. REFERENCES 1. R. Brause, T. Langsdorf, M. Hepp,Gesellschaft f. "Neural Data Mining for Credit Card Detection" 2. S.J Stolfo, D.W Fan, W.Lee, A.L Prodronidis and P.K.Chan "Credit Card Detection Using Mete-Learning: Issues and Initial Results" Proc. AAAI Workshop AI Methods in and Risk Management, pp.83-90, 1997. 3. Clifton Phua, Vincent Lee, Kate Smith & Ross Gayler, "A Comprehensive Survey of Data Mining-based Detection Research", Final version 2: 9/02/2005 4. Philip K. Chan, Wei Fan, Andreas L. Prodromidis, and Salvatore J. Stolfo, "Distributed Data Mining incredit Card 5. Detection", IEEE November/December 1999. 6. Chun Wei Clifton Phua, "Investigative data mining in fraud detection" a thesis submitted on November 2003. 8. WEBLIOGRAPHY 1. www.ijcsi.org 2. http://support.sas.com/resources/papers/proceedings14/1581-2014.pdf 3. www.ijdmta.com