DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com 2 y.p.singh@mmu.edu.my Abstract Credit card fraud is a serious and major growing problem in banking industries. With the advent of the rise of many web services provided by banks, banking frauds are also on the rise. Banking systems always have a strong security system in order to detect and prevent fraudulent activities of any kind of transactions. Though totally eliminating banking fraud is almost impossible, but we can however minimize the frauds and prevent them from happening by machine learning techniques. This paper aims to conduct experiments to study banking frauds using ensemble tree learning techniques and genetic algorithm to indict ensemble of decision trees on bank transaction datasets for identifying and preventing bank fraud. It also provides an evaluation and effectiveness of the ensemble of decision trees on the credit card dataset. Keywords: Ensemble Decision tree Induction, Ensemble methods, genetic algorithm, credit card fraud dataset. 1. Introduction With the developments in the information technology and improvements in the communication channels, fraud is spreading all over the world, resulting in huge financial losses. There have been several researches done in the field of fraud detection, with various methods employed in detection and prevention [2]. Methods such as decision tree learning, support vector machines, neural networks, expert systems and artificial immune systems have been explored and identified [1], [2] for fraud detection. The scope of this paper has been reduced to only credit card application fraud and risk based on decision tree induction using ensemble learning techniques and genetic algorithms. Due to sensitivity of customers financial information, getting clean data is hard for mining applications. The dataset used is obtained from the UCI (University of California, Irvine) machine learning repository- German Credit Card dataset and Australian Credit Card dataset for the present paper. Fraud prevention is a subject that always brought interest from financial institutions, since the advent of new technologies as telephone, automatic teller machines (ATM)and credit card systems have leveraged the volume of fraud loss of many banks [6]. In this context, fraud prevention, with a special importance of fraud automatic detection, arises as an open field for application of all known classification methods. Classification techniques play a very important role, once it is able to learn from past experience (fraud happened in the past) and classify new instances (transactions) in a fraud group or in a legitimate group. The rest of the paper is organized as follows. Section2 describes C4.5 classification decision tree algorithm and AdaBoost Ensemble learning algorithm Section 3describes genetic algorithm through wrapper methods. Section 4 shows the results of performance of the C4.5 algorithm, Wrapper methods and AdaBoost ensemble learning algorithm with different parameters. Finally, Section 5 concludes with experimental results and a summary. Organized by WorldConferences.net 321
2. Decision Tree Induction Techniques 2.1. Decision Tree and C4.5 Classifier construction is a common task in many data mining applications.a decision tree is a structure that is used to model data. Decision trees use divide and conquer technique, which divides problems into simpler problems till it gets easier to solve [3]. C4.5 was developed by Ross Quinlan as an extension of ID3 decision tree learning algorithm [5]. C4.5 is based a greedy, top-down recursive partitioning of datasets. After taking a training set in which every instance has a class label, C4.5 algorithm learns a classifier on it. This classifier predicts an unknown instance with a class label to accomplish the classification task. A decision tree is built byc4.5 using the information gain as a heuristic value for selecting the best attribute as a node in the tree to split the training set. Quinlan has introduced gain ratio in this version instead of information gain [5]. 2.2. C4.5 Decision Tree Induction Algorithm Input: Training set S of n examples, node R; Output: decision tree with root R; 1. If the instances in S belong to the same class or the amount of instances in S is too few, set R as leaf node and label the node R with the most frequent class in S; 2. Otherwise, choose a test attribute X with two or more values (outcomes) based on a selecting criterion, and label the node R with X; 3. Partition S into subsets S1, S2,, Sm according to the outcome of attribute X for each outcome; generate R m children nodes R1, R2,, Rm; 4. For every group (Si, Ri), build recursively a subtree with root Ri. 2.3. Ensemble Methods Ensemble methods can be applied to improve classifier s accuracy in predictions. Ensemble methods normally are used to construct an ensemble of trees for a given data set. Examples of ensemble methods are the bagging and boosting. One of the ways for classification is to take votes or weighted votes of ensemble of classifiers and calculate the decision based on weighted average. Both bagging and boosting use this approach. In bagging, the models receive equal weights, whereas for boosting, weighting is given influence to the more successful models [4]. AdaBoost, also known as Adaptive Boosting is used as part of implementation method to boost the performance of decision tree and it is implemented in WEKA (Waikato Environment for Knowledge Analysis) as AdaBoost.M1 [4]. This boosting algorithm can be applied to any classifier s learning algorithm [9]. AdaBoost algorithm in a pseudo code form is given in Figure 1 [7]. This algorithm creates an ensemble of classifiers, with each having a weighted vote which is function of t [4]. Organized by WorldConferences.net 322
Input : Training set S = {x i, y i }, i = 1,...,N; and y i leaner Output : Boosted classifier: = {c 1,...,c m }; T : number of iterations; I : Weak ( ) ( ) [ ( ) ] where are the induced classifiers (with ( ) ) and their assigned weights respectively 1: D 1 (i) 1/N for i = 1,, N 2: fort = 1 to Tdo 3: ( ) 4: ( )[ ( ) ] 5: if then 6: 7: return 8: end if 9: 10: D t + 1 (i) = D t (i). [ ( ) ] for i = 1,..., N 11: Normalise D t + 1 to be a proper distribution 12: end for Figure 1: Pseudocode for AdaBoost.M1 3. Decision Tree Induction Using Genetic Algorithm The majority of the existing algorithms for learning decision trees are greedy and a tree is induced in top-down manner, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Furthermore, the greedy algorithms require a fixed amount of time and are not able to generate a better tree if an additional time is available. Genetic Algorithm (GA) is an evolutionary technique which mimics the process of nature s evolution and is used as search heuristic algorithm for decision tree learning. GA functions by randomly creating a group of individuals (represented as chromosomes), as population of decision trees. The individuals or solutions are then evaluated by deciding on the fitness level for each of the solutions in population. Fitness is a value assigned to a solution to determine how far or close it is to the best solution. The greater the assigned value, the better the solution is. These solutions are then reproduced to create one or more offspring which are then mutated randomly. This continues till the suitable solution is reached. Basically the algorithm evolves through these operators namely: selection, crossover and mutation. In this paper, decision tree generated using genetic algorithm are evaluated for a performance comparison to decision tree generated initially with or without boosting and boosting with AdaBoost.M1. Organized by WorldConferences.net 323
3.1. Decision Tree and C4.5 Feature selection or variable selection has been a focus of many research areas in many applications for datasets such as fraud detection, text processing,, and network intrusion. In these areas, the number of attributes can be considerably large. For example, as noted in German credit card application dataset, there are 20 attributes available, but for credit card transactions, there can be more than 30 attributes and the number of attributes grows throughout the banking process. As such, an effective and accurate feature selection method is required in order to be able to classify the instances correctly. Feature selection is selecting a subset of attributes occurring in the training set and used as features in classification. Wrapper approach has been examined as a way to integrate GA to be used as an approach in selecting subset of attributes. Choosing irrelevant attributes will cause the accuracy to degrade (overfitting), however applying the induction algorithm on a set of relevant attributes, accuracy is increased. This has been experimented and shown by John (1997) using C4.5 with credit approval dataset. 3.2. Wrapper Approach The attribute subset selection is applied by having an induction algorithm "wrapped" around a search engine and using the learning algorithm itself as an evaluation function [8]. The learning algorithm is being fed with dataset, and partitions the dataset into internal training and test set, with different sets of features removed from data. The feature subset with the highest evaluation is chosen as the final set on which to run the learning algorithm. The resulting classifier is then evaluated on an independent test set that was not used earlier during the search. In order to evaluate on the attribute selection for GA using the wrapper method, experiments are conducted on ID3, and C4.5 on German Credit Card approval dataset. The tree size, accuracy and time taken for attribute selection is taken for comparison. Population has been set to 50 for GeneticSearch engine in WEKA. 4. Experiments and Results The implementation of the proposed solution will be limited only to credit approval risk. Based on this scope, dataset containing credit approval is obtained for experimental studies. Two forms of experimental results are provided, which are, experimental results of decision tree without any boosting techniques and experimental results of decision tree together with AdaBoost.M1. 4.1. German Credit Dataset The German Credit dataset has been obtained from the UCI Repository of Machine Learning Databases. This dataset classifies people described by a set of attributes as good or bad credit risks. It contains 1000 instances in which there are 7 numerical attributes and 13 categorical (nominal) attributes. Making it a total of 21 attributes together with risk class Good and Bad. 4.2. Experiment: Decision Tree without Boosting Decision tree will be induced without any boosting using ID3 and C4.5 algorithm. Dataset has numeric attributes which are discretized before using ID3.The resulting decision tree is then recorded for the performance analysis. Percentage split of 70% is used in which 70% of the data is used for training and 30% is kept as test data. Classification accuracy is 69.00% with 207 instances correctly classified for ID3 as shown in Table 7. Whereas for C4.5, classification accuracy is 73.6667 % shown in Table 7 with 221 instances correctly classified as given in confusion matrix Table 1. WEKA has used 300 samples for testing the tree. Based on the confusion matrix given below in Table 1 for C4.5, there are 29 instances wrongly classified as good, and 50 instances classified as bad. Organized by WorldConferences.net 324
Table 1: Confusion Matrix for C4.5 without Boosting 192 29 a = 1 50 29 b = 2 Table 2 :Confusion Matrix for ID3 without Boosting 167 42 a = 1 30 40 b = 2 4.3. Experiment: Decision Tree with Boosting When doing experiment for boosting, number of iterations is set to 100, and resampling is set to true. Percentage split of 70% is used in which 70% of the data is used for training and 30% is kept as test data. This split criteria will be used throughout the testing similar to the first experiment. Based on the result in Table 2, it is shown that attribute A1 (Status of existing checking account) represents the top node of the decision tree for ID3. It has been noted that the classification accuracy using Percentage Split testing option is only 75%. The accuracy of correctly classified instances increased compared to earlier testing without boosting. The confusion matrix in Table 3 shows that 23 instances are incorrectly classified as good, and 52 instances are incorrectly classified as bad. Table 3: Confusion Matrix for ID3 with Boosting 198 23 a = 1 52 27 b = 2 Similarly, for C4.5, with the same parameters as for ID3, it is observed that attribute A1 (Status of existing checking account) represents the top node of the decision tree. It has been noted that the classification accuracy using Percentage Split testing option is only 79%. The accuracy of correctly classified instances increased compared to earlier testing without boosting. The confusion matrix in Table 4, shows that 40 instances are incorrectly classified as good, and 23 instances are incorrectly classified as bad. Organized by WorldConferences.net 325
Table 4: Confusion Matrix for C4.5 with Boosting 198 23 a = 1 40 39 b = 2 It is observed that boosted decision trees outperformed the decision trees without boosting applied as shown in Table 7. Table 7: Summary Result for ID3, C4.5 and Boosting Experiments Percentage Split 70% Classifier Classified Instances #pos #neg Correctly Classified Percentage ID3 207 72 69.00% ID3 using AdaBoost.M1 225 75 75.00% C4.5 221 79 73.67% C4.5 using AdaBoost.M1 237 63 79.00% 4.4. Experiment: Decision Tree With GeneticSearch Training set of German dataset is loaded and pre-processed using filter option in WEKA. After attribute selection has been applied to the dataset, it is found that seven attributes has been selected. Using this subset, the attributes are removed accordingly and passed on to ID3 for final evaluation. It is noted that classification accuracy has increased to 74.6667 % compared to evaluating the ID3 classifier alone with the dataset. Also a smaller size tree has been derived. The confusion matrix in Table 5 shows that 24 instances has been wrongly classified as good whereas 50 instances has been wrongly classified as bad. Table 5: Confusion Matrix for ID3 with GeneticSearch 197 24 a = 1 50 27 b = 2 For C4.5, it is noted that classification accuracy has reduced to 76.67% compared to evaluating the C4.5 classifier alone with the dataset. Also a smaller size tree has been derived. The confusion matrix in Table 6 shows that 10 instances has been wrongly classified as good whereas 18 instances has been wrongly classified as bad. Organized by WorldConferences.net 326
Table 6: Confusion Matrix for C4.5 with GeneticSearch 84 10 a = 1 18 95 b = 0 Table 8 denotes the summary for the experiment with GA. Table 8: Summary Result for ID3 and C4.5 For German Credit Dataset Percentage Split 70% Classified Instances Classifier #pos #neg Correctly Classified Percentage ID3 using GA 224 74 74.67 % C4.5 using GA 230 70 76.67% 5. Conclusion This paper investigated decision tree learning algorithms using ID3, C4.5, Ensemble methods and wrapper techniques to conduct experiments for identifying and preventing bank frauds. It also has provided an evaluation and effectiveness of the ensemble of decision trees on the credit card dataset. Experimental results have shown that GA with ID3 or C4.5 performed better compared to using the ID3 and C4.5 classifier alone. The results have been provided and compared as in the Table 9. It also shows that C4.5 with AdaBoost.M1 gives higher accuracy compared to others. References [1] Kou,Y., Lu,C., Sirwongwattana, S., Huang,Y. Survey of Fraud Detection Techniques. International Conference on Networking, Sensing & Control, 749-754, 2004. [2] Delamaire, L., Abdou, H., Pointon, J., Credit Card Fraud and Detection Techniques: A Review. Banks and Banks Systems, 4(2), 57-68, 2009. [3] Rocha, B.C., Sousa Junior, R, Identifying Bank Frauds Using Crisp-DM and Decision Trees. International Journal of Computer Science & Information Technology, 2(5), 162-169, 2010. [4] Witten, I.H., Frank, E., Hall, M.A., Data Mining: Practical Machine Learning Tools and Techniques (3 rd ed.), Morgan Kaufmann, 2011. [5] J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, Inc., 1993. [6] Bernama (2009, December 9). 1,191 Internet banking fraud cases detected in M sia Jan-June. The Star. Organized by WorldConferences.net 327
[7] Galar, M. ern nde, A. Barrenechea, E. Bustince,. errera,., A eview on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(4), 463-484, 2012 [8] Kohavi. R, John,G.H, Wrappers for feature subset selection, Artificial Intelligence, Volume 97, 1(2), 273-324, 1997. [9] Freund,Yoav, Schapire, Robert E., Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, 148-156, 1996. Organized by WorldConferences.net 328