DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
|
|
- Milo Little
- 2 years ago
- Views:
Transcription
1 DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 2 Abstract Credit card fraud is a serious and major growing problem in banking industries. With the advent of the rise of many web services provided by banks, banking frauds are also on the rise. Banking systems always have a strong security system in order to detect and prevent fraudulent activities of any kind of transactions. Though totally eliminating banking fraud is almost impossible, but we can however minimize the frauds and prevent them from happening by machine learning techniques. This paper aims to conduct experiments to study banking frauds using ensemble tree learning techniques and genetic algorithm to indict ensemble of decision trees on bank transaction datasets for identifying and preventing bank fraud. It also provides an evaluation and effectiveness of the ensemble of decision trees on the credit card dataset. Keywords: Ensemble Decision tree Induction, Ensemble methods, genetic algorithm, credit card fraud dataset. 1. Introduction With the developments in the information technology and improvements in the communication channels, fraud is spreading all over the world, resulting in huge financial losses. There have been several researches done in the field of fraud detection, with various methods employed in detection and prevention [2]. Methods such as decision tree learning, support vector machines, neural networks, expert systems and artificial immune systems have been explored and identified [1], [2] for fraud detection. The scope of this paper has been reduced to only credit card application fraud and risk based on decision tree induction using ensemble learning techniques and genetic algorithms. Due to sensitivity of customers financial information, getting clean data is hard for mining applications. The dataset used is obtained from the UCI (University of California, Irvine) machine learning repository- German Credit Card dataset and Australian Credit Card dataset for the present paper. Fraud prevention is a subject that always brought interest from financial institutions, since the advent of new technologies as telephone, automatic teller machines (ATM)and credit card systems have leveraged the volume of fraud loss of many banks [6]. In this context, fraud prevention, with a special importance of fraud automatic detection, arises as an open field for application of all known classification methods. Classification techniques play a very important role, once it is able to learn from past experience (fraud happened in the past) and classify new instances (transactions) in a fraud group or in a legitimate group. The rest of the paper is organized as follows. Section2 describes C4.5 classification decision tree algorithm and AdaBoost Ensemble learning algorithm Section 3describes genetic algorithm through wrapper methods. Section 4 shows the results of performance of the C4.5 algorithm, Wrapper methods and AdaBoost ensemble learning algorithm with different parameters. Finally, Section 5 concludes with experimental results and a summary. Organized by WorldConferences.net 321
2 2. Decision Tree Induction Techniques 2.1. Decision Tree and C4.5 Classifier construction is a common task in many data mining applications.a decision tree is a structure that is used to model data. Decision trees use divide and conquer technique, which divides problems into simpler problems till it gets easier to solve [3]. C4.5 was developed by Ross Quinlan as an extension of ID3 decision tree learning algorithm [5]. C4.5 is based a greedy, top-down recursive partitioning of datasets. After taking a training set in which every instance has a class label, C4.5 algorithm learns a classifier on it. This classifier predicts an unknown instance with a class label to accomplish the classification task. A decision tree is built byc4.5 using the information gain as a heuristic value for selecting the best attribute as a node in the tree to split the training set. Quinlan has introduced gain ratio in this version instead of information gain [5] C4.5 Decision Tree Induction Algorithm Input: Training set S of n examples, node R; Output: decision tree with root R; 1. If the instances in S belong to the same class or the amount of instances in S is too few, set R as leaf node and label the node R with the most frequent class in S; 2. Otherwise, choose a test attribute X with two or more values (outcomes) based on a selecting criterion, and label the node R with X; 3. Partition S into subsets S1, S2,, Sm according to the outcome of attribute X for each outcome; generate R m children nodes R1, R2,, Rm; 4. For every group (Si, Ri), build recursively a subtree with root Ri Ensemble Methods Ensemble methods can be applied to improve classifier s accuracy in predictions. Ensemble methods normally are used to construct an ensemble of trees for a given data set. Examples of ensemble methods are the bagging and boosting. One of the ways for classification is to take votes or weighted votes of ensemble of classifiers and calculate the decision based on weighted average. Both bagging and boosting use this approach. In bagging, the models receive equal weights, whereas for boosting, weighting is given influence to the more successful models [4]. AdaBoost, also known as Adaptive Boosting is used as part of implementation method to boost the performance of decision tree and it is implemented in WEKA (Waikato Environment for Knowledge Analysis) as AdaBoost.M1 [4]. This boosting algorithm can be applied to any classifier s learning algorithm [9]. AdaBoost algorithm in a pseudo code form is given in Figure 1 [7]. This algorithm creates an ensemble of classifiers, with each having a weighted vote which is function of t [4]. Organized by WorldConferences.net 322
3 Input : Training set S = {x i, y i }, i = 1,...,N; and y i leaner Output : Boosted classifier: = {c 1,...,c m }; T : number of iterations; I : Weak ( ) ( ) [ ( ) ] where are the induced classifiers (with ( ) ) and their assigned weights respectively 1: D 1 (i) 1/N for i = 1,, N 2: fort = 1 to Tdo 3: ( ) 4: ( )[ ( ) ] 5: if then 6: 7: return 8: end if 9: 10: D t + 1 (i) = D t (i). [ ( ) ] for i = 1,..., N 11: Normalise D t + 1 to be a proper distribution 12: end for Figure 1: Pseudocode for AdaBoost.M1 3. Decision Tree Induction Using Genetic Algorithm The majority of the existing algorithms for learning decision trees are greedy and a tree is induced in top-down manner, making locally optimal decisions at each node. In most cases, however, the constructed tree is not globally optimal. Furthermore, the greedy algorithms require a fixed amount of time and are not able to generate a better tree if an additional time is available. Genetic Algorithm (GA) is an evolutionary technique which mimics the process of nature s evolution and is used as search heuristic algorithm for decision tree learning. GA functions by randomly creating a group of individuals (represented as chromosomes), as population of decision trees. The individuals or solutions are then evaluated by deciding on the fitness level for each of the solutions in population. Fitness is a value assigned to a solution to determine how far or close it is to the best solution. The greater the assigned value, the better the solution is. These solutions are then reproduced to create one or more offspring which are then mutated randomly. This continues till the suitable solution is reached. Basically the algorithm evolves through these operators namely: selection, crossover and mutation. In this paper, decision tree generated using genetic algorithm are evaluated for a performance comparison to decision tree generated initially with or without boosting and boosting with AdaBoost.M1. Organized by WorldConferences.net 323
4 3.1. Decision Tree and C4.5 Feature selection or variable selection has been a focus of many research areas in many applications for datasets such as fraud detection, text processing,, and network intrusion. In these areas, the number of attributes can be considerably large. For example, as noted in German credit card application dataset, there are 20 attributes available, but for credit card transactions, there can be more than 30 attributes and the number of attributes grows throughout the banking process. As such, an effective and accurate feature selection method is required in order to be able to classify the instances correctly. Feature selection is selecting a subset of attributes occurring in the training set and used as features in classification. Wrapper approach has been examined as a way to integrate GA to be used as an approach in selecting subset of attributes. Choosing irrelevant attributes will cause the accuracy to degrade (overfitting), however applying the induction algorithm on a set of relevant attributes, accuracy is increased. This has been experimented and shown by John (1997) using C4.5 with credit approval dataset Wrapper Approach The attribute subset selection is applied by having an induction algorithm "wrapped" around a search engine and using the learning algorithm itself as an evaluation function [8]. The learning algorithm is being fed with dataset, and partitions the dataset into internal training and test set, with different sets of features removed from data. The feature subset with the highest evaluation is chosen as the final set on which to run the learning algorithm. The resulting classifier is then evaluated on an independent test set that was not used earlier during the search. In order to evaluate on the attribute selection for GA using the wrapper method, experiments are conducted on ID3, and C4.5 on German Credit Card approval dataset. The tree size, accuracy and time taken for attribute selection is taken for comparison. Population has been set to 50 for GeneticSearch engine in WEKA. 4. Experiments and Results The implementation of the proposed solution will be limited only to credit approval risk. Based on this scope, dataset containing credit approval is obtained for experimental studies. Two forms of experimental results are provided, which are, experimental results of decision tree without any boosting techniques and experimental results of decision tree together with AdaBoost.M German Credit Dataset The German Credit dataset has been obtained from the UCI Repository of Machine Learning Databases. This dataset classifies people described by a set of attributes as good or bad credit risks. It contains 1000 instances in which there are 7 numerical attributes and 13 categorical (nominal) attributes. Making it a total of 21 attributes together with risk class Good and Bad Experiment: Decision Tree without Boosting Decision tree will be induced without any boosting using ID3 and C4.5 algorithm. Dataset has numeric attributes which are discretized before using ID3.The resulting decision tree is then recorded for the performance analysis. Percentage split of 70% is used in which 70% of the data is used for training and 30% is kept as test data. Classification accuracy is 69.00% with 207 instances correctly classified for ID3 as shown in Table 7. Whereas for C4.5, classification accuracy is % shown in Table 7 with 221 instances correctly classified as given in confusion matrix Table 1. WEKA has used 300 samples for testing the tree. Based on the confusion matrix given below in Table 1 for C4.5, there are 29 instances wrongly classified as good, and 50 instances classified as bad. Organized by WorldConferences.net 324
5 Table 1: Confusion Matrix for C4.5 without Boosting a = b = 2 Table 2 :Confusion Matrix for ID3 without Boosting a = b = Experiment: Decision Tree with Boosting When doing experiment for boosting, number of iterations is set to 100, and resampling is set to true. Percentage split of 70% is used in which 70% of the data is used for training and 30% is kept as test data. This split criteria will be used throughout the testing similar to the first experiment. Based on the result in Table 2, it is shown that attribute A1 (Status of existing checking account) represents the top node of the decision tree for ID3. It has been noted that the classification accuracy using Percentage Split testing option is only 75%. The accuracy of correctly classified instances increased compared to earlier testing without boosting. The confusion matrix in Table 3 shows that 23 instances are incorrectly classified as good, and 52 instances are incorrectly classified as bad. Table 3: Confusion Matrix for ID3 with Boosting a = b = 2 Similarly, for C4.5, with the same parameters as for ID3, it is observed that attribute A1 (Status of existing checking account) represents the top node of the decision tree. It has been noted that the classification accuracy using Percentage Split testing option is only 79%. The accuracy of correctly classified instances increased compared to earlier testing without boosting. The confusion matrix in Table 4, shows that 40 instances are incorrectly classified as good, and 23 instances are incorrectly classified as bad. Organized by WorldConferences.net 325
6 Table 4: Confusion Matrix for C4.5 with Boosting a = b = 2 It is observed that boosted decision trees outperformed the decision trees without boosting applied as shown in Table 7. Table 7: Summary Result for ID3, C4.5 and Boosting Experiments Percentage Split 70% Classifier Classified Instances #pos #neg Correctly Classified Percentage ID % ID3 using AdaBoost.M % C % C4.5 using AdaBoost.M % 4.4. Experiment: Decision Tree With GeneticSearch Training set of German dataset is loaded and pre-processed using filter option in WEKA. After attribute selection has been applied to the dataset, it is found that seven attributes has been selected. Using this subset, the attributes are removed accordingly and passed on to ID3 for final evaluation. It is noted that classification accuracy has increased to % compared to evaluating the ID3 classifier alone with the dataset. Also a smaller size tree has been derived. The confusion matrix in Table 5 shows that 24 instances has been wrongly classified as good whereas 50 instances has been wrongly classified as bad. Table 5: Confusion Matrix for ID3 with GeneticSearch a = b = 2 For C4.5, it is noted that classification accuracy has reduced to 76.67% compared to evaluating the C4.5 classifier alone with the dataset. Also a smaller size tree has been derived. The confusion matrix in Table 6 shows that 10 instances has been wrongly classified as good whereas 18 instances has been wrongly classified as bad. Organized by WorldConferences.net 326
7 Table 6: Confusion Matrix for C4.5 with GeneticSearch a = b = 0 Table 8 denotes the summary for the experiment with GA. Table 8: Summary Result for ID3 and C4.5 For German Credit Dataset Percentage Split 70% Classified Instances Classifier #pos #neg Correctly Classified Percentage ID3 using GA % C4.5 using GA % 5. Conclusion This paper investigated decision tree learning algorithms using ID3, C4.5, Ensemble methods and wrapper techniques to conduct experiments for identifying and preventing bank frauds. It also has provided an evaluation and effectiveness of the ensemble of decision trees on the credit card dataset. Experimental results have shown that GA with ID3 or C4.5 performed better compared to using the ID3 and C4.5 classifier alone. The results have been provided and compared as in the Table 9. It also shows that C4.5 with AdaBoost.M1 gives higher accuracy compared to others. References [1] Kou,Y., Lu,C., Sirwongwattana, S., Huang,Y. Survey of Fraud Detection Techniques. International Conference on Networking, Sensing & Control, , [2] Delamaire, L., Abdou, H., Pointon, J., Credit Card Fraud and Detection Techniques: A Review. Banks and Banks Systems, 4(2), 57-68, [3] Rocha, B.C., Sousa Junior, R, Identifying Bank Frauds Using Crisp-DM and Decision Trees. International Journal of Computer Science & Information Technology, 2(5), , [4] Witten, I.H., Frank, E., Hall, M.A., Data Mining: Practical Machine Learning Tools and Techniques (3 rd ed.), Morgan Kaufmann, [5] J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, Inc., [6] Bernama (2009, December 9). 1,191 Internet banking fraud cases detected in M sia Jan-June. The Star. Organized by WorldConferences.net 327
8 [7] Galar, M. ern nde, A. Barrenechea, E. Bustince,. errera,., A eview on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 42(4), , 2012 [8] Kohavi. R, John,G.H, Wrappers for feature subset selection, Artificial Intelligence, Volume 97, 1(2), , [9] Freund,Yoav, Schapire, Robert E., Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, , Organized by WorldConferences.net 328
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil brunorocha_33@hotmail.com 2 Network Engineering
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
T3: A Classification Algorithm for Data Mining
T3: A Classification Algorithm for Data Mining Christos Tjortjis and John Keane Department of Computation, UMIST, P.O. Box 88, Manchester, M60 1QD, UK {christos, jak}@co.umist.ac.uk Abstract. This paper
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Performance Analysis of Decision Trees
Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India lav_dlr@yahoo.com
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods
Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
A Lightweight Solution to the Educational Data Mining Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
AnalysisofData MiningClassificationwithDecisiontreeTechnique
Global Journal of omputer Science and Technology Software & Data Engineering Volume 13 Issue 13 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
CLASS distribution, i.e., the proportion of instances belonging
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS 1 A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches Mikel Galar,
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
D A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
Impact of Boolean factorization as preprocessing methods for classification of Boolean data
Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,
Roulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms
Proceedings of the International Conference on Computational and Mathematical Methods in Science and Engineering, CMMSE 2009 30 June, 1 3 July 2009. A Hybrid Approach to Learn with Imbalanced Classes using
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise
Smart Grid Data Analytics for Decision Support
1 Smart Grid Data Analytics for Decision Support Prakash Ranganathan, Department of Electrical Engineering, University of North Dakota, Grand Forks, ND, USA Prakash.Ranganathan@engr.und.edu, 701-777-4431
Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm awm@cs.cmu.
Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 42-268-7599 Copyright Andrew W. Moore Slide Decision Trees Decision trees
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
Model Trees for Classification of Hybrid Data Types
Model Trees for Classification of Hybrid Data Types Hsing-Kuo Pao, Shou-Chih Chang, and Yuh-Jye Lee Dept. of Computer Science & Information Engineering, National Taiwan University of Science & Technology,
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data
Optimization of C4.5 Decision Tree Algorithm for Data Mining Application
Optimization of C4.5 Decision Tree Algorithm for Data Mining Application Gaurav L. Agrawal 1, Prof. Hitesh Gupta 2 1 PG Student, Department of CSE, PCST, Bhopal, India 2 Head of Department CSE, PCST, Bhopal,
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One
Rule based Classification of BSE Stock Data with Data Mining
International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance
Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz, Ibai Gurrutxaga, and José I. Martín Dept. of Computer
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Logistic Model Trees
Logistic Model Trees Niels Landwehr 1,2, Mark Hall 2, and Eibe Frank 2 1 Department of Computer Science University of Freiburg Freiburg, Germany landwehr@informatik.uni-freiburg.de 2 Department of Computer
Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering
Advances in Intelligent Systems and Technologies Proceedings ECIT2004 - Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 21-23, 2004 Evolutionary Detection of Rules
A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery
A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery Alex A. Freitas Postgraduate Program in Computer Science, Pontificia Universidade Catolica do Parana Rua Imaculada Conceicao,
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE S. Anupama Kumar 1 and Dr. Vijayalakshmi M.N 2 1 Research Scholar, PRIST University, 1 Assistant Professor, Dept of M.C.A. 2 Associate
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt
A Perspective Analysis of Traffic Accident using Data Mining Techniques
A Perspective Analysis of Traffic Accident using Data Mining Techniques S.Krishnaveni Ph.D (CS) Research Scholar, Karpagam University, Coimbatore, India 641 021 Dr.M.Hemalatha Asst. Professor & Head, Dept
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
Data Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality
Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality Tatsuya Minegishi 1, Ayahiko Niimi 2 Graduate chool of ystems Information cience,
Professor Anita Wasilewska. Classification Lecture Notes
Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Automatic Resolver Group Assignment of IT Service Desk Outsourcing
Automatic Resolver Group Assignment of IT Service Desk Outsourcing in Banking Business Padej Phomasakha Na Sakolnakorn*, Phayung Meesad ** and Gareth Clayton*** Abstract This paper proposes a framework
An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients
An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients Celia C. Bojarczuk 1, Heitor S. Lopes 2 and Alex A. Freitas 3 1 Departamento
DATA MINING APPROACH FOR PREDICTING STUDENT PERFORMANCE
. Economic Review Journal of Economics and Business, Vol. X, Issue 1, May 2012 /// DATA MINING APPROACH FOR PREDICTING STUDENT PERFORMANCE Edin Osmanbegović *, Mirza Suljić ** ABSTRACT Although data mining
Data Mining based on Rough Set and Decision Tree Optimization
Data Mining based on Rough Set and Decision Tree Optimization College of Information Engineering, North China University of Water Resources and Electric Power, China, haiyan@ncwu.edu.cn Abstract This paper
Data Mining with Weka
Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Data Mining with Weka a practical course on how to
An Experimental Study on Ensemble of Decision Tree Classifiers
An Experimental Study on Ensemble of Decision Tree Classifiers G. Sujatha 1, Dr. K. Usha Rani 2 1 Assistant Professor, Dept. of Master of Computer Applications Rao & Naidu Engineering College, Ongole 2
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Using Random Forest to Learn Imbalanced Data
Using Random Forest to Learn Imbalanced Data Chao Chen, chenchao@stat.berkeley.edu Department of Statistics,UC Berkeley Andy Liaw, andy liaw@merck.com Biometrics Research,Merck Research Labs Leo Breiman,
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Predicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
Increasing Classification Accuracy. Data Mining: Bagging and Boosting. Bagging 1. Bagging 2. Bagging. Boosting Meta-learning (stacking)
Data Mining: Bagging and Boosting Increasing Classification Accuracy Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Tel: 319-335
DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE
DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE 1 K.Murugan, 2 P.Varalakshmi, 3 R.Nandha Kumar, 4 S.Boobalan 1 Teaching Fellow, Department of Computer Technology, Anna University 2 Assistant
Improving spam mail filtering using classification algorithms with discretization Filter
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational
Network Intrusion Detection Using a HNB Binary Classifier
2015 17th UKSIM-AMSS International Conference on Modelling and Simulation Network Intrusion Detection Using a HNB Binary Classifier Levent Koc and Alan D. Carswell Center for Security Studies, University
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Ensembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De
First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms
First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms Azwa Abdul Aziz, Nor Hafieza IsmailandFadhilah Ahmad Faculty Informatics & Computing
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model
A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model ABSTRACT Mrs. Arpana Bharani* Mrs. Mohini Rao** Consumer credit is one of the necessary processes but lending bears
Evolutionary Tuning of Combined Multiple Models
Evolutionary Tuning of Combined Multiple Models Gregor Stiglic, Peter Kokol Faculty of Electrical Engineering and Computer Science, University of Maribor, 2000 Maribor, Slovenia {Gregor.Stiglic, Kokol}@uni-mb.si
Decision-Tree Learning
Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status
Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data
Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream