A COMPARATIVE ASSESSMENT OF SUPERVISED DATA MINING TECHNIQUES FOR FRAUD PREVENTION Sherly K.K Department of Information Technology, Toc H Institute of Science & Technology, Ernakulam, Kerala, India. shrly_shilu@yahoo.com Abstract- The extensive growth of internet and the vast financial possibilities opening up, more and more systems are subject to attack by intruders. In a competitive environment fraud become a business critical problem. It is very important to prevent unauthorized access to system resources and data. To build a completely secure system, in addition with the authentication process behavior analysis also required, prior to completing a transaction. Many companies have interactions with millions of external parties; it is cost prohibitive to manually check the external party s identities and activities. So the riskiest one can be determined through data mining techniques. This study evaluates three classification methods to solve the fraud detection problems for data mining and shows how advanced techniques can be combined successfully to obtain high fraud coverage with maximum confidence and minimum false alarm rate. Keywords- Bayesian classifier, Data mining, Decision tree, Fraud detection, neural network. ------------------------------------------------------------------------------------------------------------------------------------------------- I. INTRODUCTION Due to the extensive growth of E-Commerce fraud detection has become a necessity. The term fraud here refers to the abuse of a profit organizations system without necessarily leading to direct legal consequences. Fraud detection is a continuously evolving discipline and ever changing tactics to commit fraud. It is the company and card issuer s interest to prevent fraud or failing this to detect fraud as soon as possible. Otherwise consumer trust in both the company and the card decreases and revenue is lost in addition to the direct losses made through fraudulent sales. Fraud is an adaptive crime and it is increasing every year. So it needs special methods of intelligent data analysis to detect and prevent it. Classification and prediction are two forms of data analysis that can be used to extract models. Many classification and prediction methods have been proposed by researchers in Data Mining. This paper discusses three main classification techniques used to prevent fraud in data mining. One main objective is to evaluate the use of data mining methods in differentiating fraud and non-fraud observations. This paper is organized as follows. A brief description about the related work is given in section 2. Decision tree model functionalities in fraud detection are given in section 3. Section 4 describes the neural network approach in card fraud detection. The application Bayesian classifier in credit card fraud detection is described in section 5. Model testing and evaluation are discussed in section 6 and section 7 concludes the paper. II. RELATED WORK Credit card fraud detection has drawn a lot of research and a special emphasis on a data mining have been suggested. Ghosh and Reilly [1] have proposed credit card fraud detection with a neural network. They have built a detection system, which is trained on a large sample of labeled credit card account transactions. These transactions contain example fraud cases due to lost cards, stolen cards, application fraud, counterfeit fraud, mail-order fraud, and nonreceived issue (NRI) fraud. Aleskerov et al. [2] present CARDWATCH, a database mining system used for credit card fraud detection. The system, based on a neural learning module, provides an interface to a variety of commercial databases. Syeda et al. [3] have used parallel granular neural networks (PGNNs) for improving the speed of data mining and knowledge discovery process in credit card fraud detection. A complete system has been implemented for this purpose. Fan et al. [7] suggest the application of distributed data mining in credit card fraud detection. Brause et al. [8] have developed an approach that involves advanced data mining techniques and neural network algorithms to obtain high fraud coverage. Stolfo et al. [9] suggest a credit card fraud detection system (FDS) using metalearning techniques to learn models of fraudulent credit card transactions. Metalearning is a general strategy that provides a means for combining and integrating a number of separately built classifiers or models. They consider naïve Bayesian, C4.5, and Back Propagation neural networks as the base classifiers. A metaclassifier is used to determine which classifier should be considered based on skewness of data. Phua et al. [10] have done an extensive survey of existing data-mining-based FDSs and published a comprehensive report. Prodromidis and Stolfo [11] use an agent-based approach with distributed learning for detecting frauds in credit card transactions. It is based on artificial intelligence and combines inductive learning algorithms and metalearning methods for achieving higher accuracy. The following are three classifying techniques used in fraud detection. 1
III. DECISION TREE CLASSIFIER Decision trees are powerful and popular tools for classification and prediction. Decision tree can be used to predict a pattern or to classify the class of a data. It is a decision support tool that uses a tree-like graph where each internal node denotes a test on an attribute, each branch represents an outcome of the test and each leaf node holds a class label. A decision tree produces a sequence of rules (or series of questions) that can be used to recognize the class. Rules can readily be expressed so that humans can understand them or even directly used in a database access language like SQL so that records falling into a particular category may be retrieved. Decision tree programs construct a decision tree from a set of training cases. There are several most popular decision tree algorithms such as ID3, C4.5 and CART (classification and regression trees). The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. A good quantitative measure of the worth of an attribute is a statistical property called information gain that measures how well a given attribute separates the training examples according to their target classification. This measure is used to select among the candidate attributes at each step while growing the tree. In order to define information gain precisely, we need to define a measure commonly used in information theory, called entropy, that characterizes the (im)purity of an arbitrary collection of examples. A. Entropy Entropy is the quantitative measure of disorder in a system. In decision tree construction, entropy is used to determine which node to split next. It is a measure of impurity degree. If a target attribute takes on c different values, then the entropy of set D 1 is defined as transaction data is shown in table 1. Four transaction attributes such as transaction amount, time, merchant and city of purchase/order placing from the transaction data which are relevant for identifying the user spending behavior are considered for the tree construction. Each individual transaction amount usually depends on the corresponding type of item purchased, which also has a great role in identifying the spending nature of the card holder. The type of each purchase is linked to the type of business of the corresponding merchant. The four continuous parameters such as transaction amount, transaction time/frequency of transaction, type/quality of item purchased and billing/order placing city are converted into categorical parameters. The transaction time attribute categorization is done by dividing a month into four weeks and each week is divided into two slots, namely, weekday (wd) and weekend (we).thus transaction date is categorized into eight groups wd1, we1, wd2, we2, wd3, we3, wd4, we4. The transaction amount be quantized into three different levels - Low, Medium and High. The purchase item can be categorized in to five groups such as Textile items (Ti), Electronic items (EI), Gold (Gl), Medical (MD) and Miscellaneous (Mi) purchases. Some merchant may sell variety items, the item purchased from these merchant may consider as miscellaneous for convenience. TABLE 1: Card Transaction Data B. Information Gain Information gain is a measure of the effectiveness of an attribute A in classifying the training data and is defined as the difference between the original information requirement based on just the proportion of classes and the new requirement obtained after partitioning on A. Information gain is computed as impurity degrees of the parent table and weighted summation of impurity degrees of the subset table. The attribute with maximum gain ratio is selected as splitting attribute. Information gain (i) = Entropy of parent table D Sum (n k /n * Entropy of each value k of subset table Si) C. Decision Tree approach in card fraud prediction We will use past few year credit card transaction data of different customers to construct decision tree. Sample (1) TABLE 2: Categorized Data 2
The fourth attribute city of purchase/order placing is also an important parameter which can assist fraud detection easily. In case of physical card using transaction, consider the city of purchase parameter, else for the online transaction order placing city has relevant role in transaction. In online transaction order placing city can be identified with the IP address. Transactions come from dynamic IPs shows irregular behavior. Therefore to identify the fraud classify city of purchase/order placing into three category local (LC), national (NC) and international (IC). All these converted attributes shown in table 2 can be used as the input data to create a decision tree in identifying the fraud. Sample decision tree constructed is shown in fig.1 B. Neural network in the context of card fraud detection There are different kinds of neural networks and neural network algorithms. The most popular neural network algorithm is back propagation which works on multilayer feed-forward networks will be best suited for card fraud detection. A multilayer feed forward neural network consists of an input layer, one or more hidden layers, and output layer. Each output unit takes as input, a weighted sum of the outputs from units in the previous layer. It applies an activation function to the weighted input. X1 W1i X2 w2i. wjk. Xi wii. Ok. wni xn I/P layer hidden layer O/P layer Fig.2 A Multilayer feed-forward neural network Fig 1. Sample Decision Tree IV. NEURAL NETWORK CLASSIFIER Neural networks resemble the human brain. It can acquire knowledge through learning. The knowledge is stored within inter-neuron connection strengths known as synaptic weights. The network is composed of a large number of highly interconnected processing elements (neurons) working in parallel to solve a specific problem. The disadvantage is that because the network finds out how to solve the problem by itself, its operation can be in predictable. A. Artificial neuron An artificial neuron is a device with many inputs and one output. The neuron has two modes of operation, the training mode and the using mode. In the training mode neuron can be trained to find (or not), for particular input patterns. In the using mode when a taught input pattern is detected at the input, its associated output becomes the current output. If the input pattern does not belong in the taught list of input patterns, the firing rule is used to determine whether to fire or not. Back propagation learns by interactively processing a data set of training tuples, comparing the network s prediction for each tuple with the actual known value. The target value may be the known class label of the training tuple or a continuous value, for each training tuple the weights are modified. So as to minimize the mean squared error between the networks prediction and the actual target value. These modifications are made in the backward direction that is from the output layer, through each hidden layer down to the first layer. In general the weights well eventually converge and the learning process stops. So a neural network is just a function with a number of weights which produces a score based on the data within card transactions. If you were to assign random values to the weights then this function would be unlikely to generate meaningful score and good detection performance. So the weights in the neural network need to be optimized. This optimization process is often referred to as learning or training and involves an iterative process of passing through a historical data base of card transactions (with fraudulent and legitimate transactions clearly identified) and systematically adjusting the weighs so that the score discriminates well between fraudulent and legitimate transactions. The term intelligence used in connection with neural networks refers to the knowledge about fraud patterns which is reflected in the values of the weights of the trained network. Intelligence= the value of the weights. 3
The network can be simplified by removing weighted links that have the least effect on the trained network and is called network pruning. Once the trained network has been pruned, clustering is used to find the set of common activation values for each hidden unit in a given trained two-layer neural network. The combinations of these activation values for each hidden unit are analyzed. Rules are derived relating combinations of activation values with corresponding output unit values. Similarly the sets of input values and activation values are studied to derive rules describing the relationship between the input and hidden unit layers. Finally, the two sets of rules may be combined to form IF-THEN rules. A major disadvantage of neural networks lies in their knowledge representation. Acquired knowledge in the form of a network of units connected by weighted links is difficult for humans to interpret. V. BAYESIAN CLASSIFIER Bayesian classifiers are statistical classifiers. They can predict class member probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes theorem. A. Naïve Bayesian approach in card fraud prediction Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. Let D be a transaction set history of card holder, X 1,X 2,..,X D with associated class labels fraudulent and legitimate (Ci). Each tuple is represented by an n-dimentional attribute vector, X = (x 1,x 2,..,x n ) described by the attributes, Y 1,Y 2,..,Y n respectively. Some attributes which can be derived from transaction records are individually highly predictive of fraud. The classifier well predict that the transaction belong to the class having the highest posterior probability P(X Ci). In order to reduce computation in evaluating P(X Ci) of the given transaction sets with many attributes (transaction amount, time, merchant, country etc), the naive assumption of class conditional independence is made (that is there are no dependence relationship among the attributes). We can easily estimate the probabilities P(x 1 Ci), P (x 2 Ci).. from the training tuples. x k refers to the value of attributes A k for tuples X. Card fraud detection is a complex problem domains involving many different input variables (for example transaction amount, time, merchant, merchant category codes, country etc) arising from multiple transactions in a sequence. Some of these variables are continuous (eg: amount, time) where as others are categorical (eg. (2) (3) MCC, country). A classifier computes a fraud score based on a multiplicity of continuous and categorical variables. a) If Ak is categorical then P(x k Ci) is the number of tuples of class Ci in D b) If Ak is continuous-valued then we need to use a Gaussian distribution with a mean µ and standard deviation σ defined by. /2σ 2 (4) So that The predicted class label is the class Ci for which P(X Ci)P(Ci) is the maximum. Where Ci, D is the number of training tuples of class Ci in D. B. Bayesian Belief Networks When the assumptions of class conditional independence holds true their naïve Bayesian classifier is the most accurate in comparison with all other classifier. In practice however dependencies can exist between variables. Bayesian belief network specify joint conditional probability distribution. This approach computes the probability distributions of each of the features and used a process called evidence integration to compute a consolidated fraud probability from the individual features probabilities. A belief network is defined by two components- a directed acyclic graph and a set of conditional probability tables (CPT). Each node in the graph represents a random variable. The variables may be discrete or continuous valued. Each arc represents a probabilistic dependence. If an arc is drawn from a node Y to a node Z then Y is a parent or immediate predecessor of Z and Z is descendant of Y. Each variable is conditionally independent of its nondescendants in the graph given its parents. A belief network has one conditional probability table (CPT) for each variable. The CPT for a variable Y specifies the conditional distribution P(Y parents(y)). Let x =(x 1.x n ) be a transaction tuple described by the variables or attributes Y 1 Yn respectively. An example of directed acyclic graph and CPT are shown in fig. and table 1 respectively. Joint probability distribution (5) (6) (7) 4
Fig. 3 Directed Acyclic graph TABLE 3: CONDITIONAL PROBABILITY TABLE C. Training Bayesian Belief networks Several algorithms exist for learning the network topology from the training data. Experts must specify conditional probabilities for the nodes that participate in direct dependencies. These probabilities can then be used to compute the remaining probability values If the network topology is known and the variables are observable, then learning network consists of TABLE 4: CLASSIFIERS COMPARISON computing the CPT entries as is similarly done in naïve Bayesian classification. When the network topology is given and some of the variables are hidden gradient descent method can be used to train the belief network. Let D be a training set of data tuples, X1, X2,.,X D. w ijk be a CPT entry for the variable Yi =y ij having the parents Ui =u ik where w ijk P(Yi = y ij Ui = u ik ) The w ijk are viewed as weights. The weights are initialized to random probability values. The gradient descant method performs greedy hill climbing in that each iteration the weights are updated and will eventually coverage to a local optimum solution. We maximize This can be done by the following steps 1. Compute the gradients : for each i,j,k (9) 2. The weights are updated by (8) (10) Where l is the learning date which is set to a small constant 3. Renormalize the weights. Sl.no Features Neural Network Bayesian classifier Decision Tree 1 Transparency of reasoning The acquired knowledge in the form of a network of units connected by weighted links is difficult for humans to interpret. 2 Sparsity Effective classifiers for continuous attributes and are inaccurate in the area of sparsity of the data. 3 Size of training set Produce best result for large transaction set Based on probabilities of attributes fraud score is calculated and is transparent to a user. Accurate fraud scores in the presence of sparsity. Effective even for small to medium size transactions 4 Training Time very long training time Training times are short A statistical property called information gain used to measure the purity of training samples according to their target classification, which is transparent to a user. Accurate fraud scores in the presence of sparsity. Model over fits the data for large data sets. Pruning techniques are required to correct the over fitting problem. Training time required is more than Bayesian and less than neural network. VI. TESTING AND MODEL EVALUATION Three alternative models can be built, each based on a different method and test against the training set. Decision tree model is prepared by using splitter algorithm. Neural network and Belief network can be trained by using the whole sample as a training set and test against the training set. The key differences between decision tree, neural network and Bayesian approaches to card fraud detection are in the areas of transparency of reasoning, handling of sparsity model training time and data required. The disadvantage of 5
Decision tree induction is that, it considers only one attribute at a time which will reduce its performance. The neural network classifiers are suitable for larger data bases only and take long time to train it. Bayesian classifiers are more accurate and much faster to train and suitable for low, medium and large sized data base. But they are slower when applied to new instances. In a comparative assessment of the models performance can conclude that the Bayesian Belief Network outperforms the other two models and achieves outstanding classification accuracy. Neural network achieves a satisfactorily high performance. Finally the Decision Tree s performance is considered rather low. [8] R. Brause, T. Langsdorf, M. Hepp,Gesellschaft f. Neural Data Mining for Credit Card Fraud Detection [9] S.J Stolfo, D.W Fan, W.Lee, A.L Prodronidis and P.K.Chan Credit Card Fraud Detection Using Mete-Learning: Issues and Initial Results Proc. AAAI Workshop AI Methods in Fraud and Risk Management, pp.83-90, 1997. [10] Clifton Phua, Vincent Lee, Kate Smith & Ross Gayler, A Comprehensive Survey of Data Mining-based Fraud Detection Research, Final version 2: 9/02/2005 [11] Philip K. Chan, Wei Fan, Andreas L. Prodromidis, and Salvatore J. Stolfo, Distributed Data Mining incredit Card Fraud Detection, IEEE November/December 1999. [12] Chun Wei Clifton Phua, Investigative data mining in fraud detection a thesis submitted on November 2003. VII. CONCLUSION Intrusion detection is important in today s computing environment. The combination of facts such as the extensive growth of internet, the vast financial possibilities opening up in electronic trade and the lack of truly secure systems create more opportunities for criminals to attack the system. The hybrid of the anomaly and misuse detection models can improve fraud detection and security of systems. The inclusion of biometric identifiers such as scanning finger prints or retinal pattern, DNA sequence, signature or voice can develop more secure system. The specially designed fraud pattern mining algorithm reduces the detection delay. The parallel architecture of neural networks is very well suited for real time applications. But in changing fraud patterns training the model require long time. So it will be more suited for large data sets. Bayesian classifier offers improved detection performance at reduced cost. This enables smaller and mid-size institutions to implement cost effective intelligent fraud solutions which is impossible with neural network based applications. Popular supervised algorithms such as neural networks, Bayesian networks and decision trees have been combined or applied in a sequential fashion to improve results. VIII. REFERENCES [1] S. Ghosh, D.L. Reilly, Credit Card Fraud Detection with a Neural- Network, Proceedings of the International Conference on System Science, pp. 621-630, 1994 [2] Aleskerov, B. Freisleben, B. Rao, CARDWATCH: A Neural Network Based Database Mining System for Credit Card Fraud Detection, Proceedings of IEEE/IAFE Conference on Computational Intelligence for Financial Engineering (CIFEr), pp. 220-226, 1997 [3] M. Syeda, Y.Q. Zhang, Y. Pan, Parallel Granular Neural Networks for Fast Credit Card Fraud Detection, Proceedings of the IEEE International Conference on Fuzzy Systems, pp. 572-577, 2002. [4] Jiawei Han and Micheline Kamber. Data Mining Concepts and Techniques, Second Ediction. Morgan kaufmann publishers. [5] http://paynetsystems.com/credit card processing» 2007».htm [6] NeuroDimension - Fraud Detection Using Neural Networks and Sentinel Solutions (Smartsoft).htm [7] Wei Fan, Haixun Wang, Philip S. YuSalvatore J. Stolfo, A Fully Distributed Framework for Cost-sensitive Data Mining,Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS 02) 6