Comparison of Data Mining Techniques used for Financial Data Analysis



Similar documents
Data Mining Practical Machine Learning Tools and Techniques

Knowledge Discovery and Data Mining

Model Combination. 24 Novembre 2009

Chapter 12 Bagging and Random Forests

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Data Mining. Nonlinear Classification

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Using multiple models: Bagging, Boosting, Ensembles, Forests

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Social Media Mining. Data Mining Essentials

Random forest algorithm in big data environment

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Gerry Hobbs, Department of Statistics, West Virginia University

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

An Experimental Study on Ensemble of Decision Tree Classifiers

Chapter 6. The stacking ensemble approach

Decision Trees from large Databases: SLIQ

Data Mining Methods: Applications for Institutional Research

Ensemble Data Mining Methods

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

DATA MINING TECHNIQUES AND APPLICATIONS

REVIEW OF ENSEMBLE CLASSIFICATION

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

L25: Ensemble learning

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Leveraging Ensemble Models in SAS Enterprise Miner

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

II. RELATED WORK. Sentiment Mining

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Advanced Ensemble Strategies for Polynomial Models

Classification and Prediction

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Knowledge Discovery and Data Mining

Data Mining - Evaluation of Classifiers

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

A New Approach for Evaluation of Data Mining Techniques

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

On the effect of data set size on bias and variance in classification learning

How To Perform An Ensemble Analysis

MS1b Statistical Data Mining

Supervised Learning (Big Data Analytics)

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

How To Make A Credit Risk Model For A Bank Account

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

An Introduction to Data Mining

Customer Classification And Prediction Based On Data Mining Technique

Data Mining Classification: Decision Trees

Beating the MLB Moneyline

How To Solve The Kd Cup 2010 Challenge

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Predictive Modeling and Big Data

CS570 Data Mining Classification: Ensemble Methods

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Distributed forests for MapReduce-based machine learning

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information

Spam Detection A Machine Learning Approach

CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH

Data Mining Solutions for the Business Environment

Classification of Bad Accounts in Credit Card Industry

Data Mining is the process of knowledge discovery involving finding

Decision Support System on Prediction of Heart Disease Using Data Mining Techniques

Applied Multivariate Analysis - Big data analytics

Knowledge Discovery and Data Mining

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Ensembles and PMML in KNIME

Predicting borrowers chance of defaulting on credit loans

An Overview of Knowledge Discovery Database and Data mining Techniques

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

The Predictive Data Mining Revolution in Scorecards:

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Benchmarking of different classes of models used for credit scoring

Better credit models benefit us all

not possible or was possible at a high cost for collecting the data.

Prediction of Heart Disease Using Naïve Bayes Algorithm

Final Project Report

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Predictive Data modeling for health care: Comparative performance study of different prediction models

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Principles of Data Mining by Hand&Mannila&Smyth

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Transcription:

Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract In this paper comparison of different data mining techniques used in financial data analysis is done. Financial data analysis is used in many financial institutes for accurate analysis of customer data to find defaulter and valid customer. In this paper we study about different data mining techniques and further we compare the accuracy of these algorithms by applying on different datasets to find the most consistent and most accurate data mining algorithm Keywords Data mining, Bayes Classification, Decision Tree, Boosting, Bagging, Random forest algorithm I. INTRODUCTION In Banking Sectors and other such leading organization the accurate assessment of consumer is of uttermost importance. Credit loans and finances have risk of being defaulted. These loans involve large amounts of capital and their non-retrieval can lead to major loss for the financial institution. Therefore, the accurate assessment of the risk involved is a crucial matter for banks and other such organizations. Not only is it important to minimize risk in granting credit but also the errors in declining any valid customer. This is to save the banks from lawsuits. For this purpose we should know which algorithm has highest accuracy. Loan default risk assessment is one of the crucial issues which financial institutions, particularly banks are faced and determining the effective variables is one of the critical parts in this type of studies. In this Credit scoring is a widely used technique that helps financial institutions evaluates the likelihood for a credit applicant to default on the financial obligation and decide whether to grant credit or not. The precise judgment of the credit worthiness of applicants allows financial institutions to increase the volume of granted credit while minimizing possible losses. II. DATA MINING TECHNIQUES Data mining a field at the intersection of computer science and statistics is the process that attempts to discover patterns in large data sets. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Mostly use Data mining techniques are as follows: A. Bayes Classification A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong independent assumptions and is particularly suited when the dimensionality of the inputs is high. A naive Bayes classifier assumes that the existence (or nonexistence) of a specific feature of a class is unrelated to the existence (or nonexistence) of any other feature. Bayesian Algorithm: 1. Order the nodes according to their topological order. 2. Initiate importance function Pr 0 (X\E), the desired number of samples m, the updating interval l, and the score arrays for every node. 3. k 0, T Ø 4. for i 1 to m do 5. if (i mod l == 0) then 6. k k+1 7. update importance function Pr k (X\E) based on T end if 8. s i generate a sample according to Pr k (X\E) 9. T T U {s i } 10. Calculate Score( s i, Pr(X\E,e), Pr k (X E) and add it to the corresponding entry of every array according to the instantiated states. 11. Normalize the score arrays for every node. The major disadvantage of this model is that the predictive accuracy is highly correlated with this assumption. An advantage of this method is that it requires a small amount of training data to estimate the parameters necessary for classification. 112

B. Decision Tree A decision tree is a mapping from observations about an item to conclusion about its target value as a predictive model in data mining and machine learning. Generally, for such tree models, other descriptive names are classification tree (discrete target) or regression tree (continuous target). In these tree structures, the leaf nodes represent classifications, the inner nodes represent the current predictive attributes and branches represent conjunctions of attributions that lead to the final classifications. The popular decision trees algorithms include ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993) which is an extension of ID3 algorithm and CART. The most popular boosting algorithm is AdaBoost (Freund and Schapire, 1996) AdaBoost, short for Adaptive Boosting, is a machine learning algorithm, formulated by Yoav Freund and Robert Schapire It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their performance. The Boosting Algorithm AdaBoost Given: (x 1, y 1 ),, (x m,y m ); x i ϵ X, y i ϵ {-1,1} Initialize weights D 1 (i)=1/m For t= 1,, T: 1. (Call WeakLearn), which returns the weak classifier h t : X {-1,1} with minimum error w.r.t distribution D t ; 2. Choose α t ϵ R, 3. Update D t+1 (i)=(dt(i) exp (-α t y i h t (x i )))/Z t Where Z t is a normalization factor Choosen so that D t+1 is a distribution Output the strong Classifier H(x) =sign ( T t=1 α t h t (x)) C. Boosting Figure 1- Decision Tree The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers, and to derive weights to combine the predictions from those models into a single prediction or predicted classification. Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an "expert" in classifying observations that were not well classified by those preceding it. During deployment (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best prediction or classification. D. Bagging Bagging (Bootstrap aggregating) was proposed by Leo Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets. Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid over fitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach. Bagging leads to "improvements for unstable procedures, which include, for example, neural nets, classification and regression trees, and subset selection in linear regression. On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors. 113

The Bagging Algorithm Training phase 1. Initialize the parameter 2. For k=1,, L 3. Return D. D=Ø, the ensemble. L k the number of classifiers to train. Take a bootstrap sample S k from Z. Build a classifier D k using S k as the training set Add the classifier to the current ensemble, D=D U D k. Classification phase 4. Run D 1,, D k on the input x. 5. The clas with the maximum number of votes is chosen as the lable for x. E. Random forest Algorithm Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. No. of Instances No. of predictive attributes Percentage of good credit Percentage of bad credit German Australian Crx 1000 460 690 20 15 16 70% 60% 44.5% 30% 40% 55.5% The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and "Random Forests" is their trademark. The term came from random decision forest which was first proposed by Tin Kam Ho of Bell Labs in 1995. It is better to think of random forests as a framework rather than as a particular model. The framework consists of several interchangeable parts which can be mixed and matched to create a large number of particular models, all built around the same central theme. Constructing a model in this framework requires making several choices: 1. The shape of the decision to use in each node. 2. The type of predictor to use in each leaf. 3. The splitting objective to optimize in each node. 4. The method for injecting randomness into the trees. III. DATASET INFORMATION Data from the data warehouse about the customers is cleaned and processed. Relevant attributes of the customers required for the data analysis to score their credit behaviour is chosen. We select a good mix of customers who have taken credit of the same type. Meaning they have either take home loans or car loans or credit card. A good mix of customers implies that the data is taken for customers who have repaid the loan immediately or have repaid it after certain amount of delay or have defaulted on the loan. All these types of customers are included in the dataset. The data set excludes any special cases. The data for each customer is checked for errors. We decide on a certain amount of customers to be included in the dataset. This is so that processing will be easier and creation of a model will not take long. The larger the dataset, more time is required to create the decision tree. It must then also consider all variations in the large dataset decreasing its accuracy. Some public datasets are available on the net at the UCI machine learning repository. Since these are cleaned and scored they are convenient for use with our algorithm. Also, they have been used in the making of other similar algorithms before. Therefore their data s diversity is qualified for thus use. We have taken three different datasets of different size like small, medium and large dataset. The given below table shows the detailed information of these datasets IV. COMPARISON For comparison we apply different data mining on different datasets and check the accuracy of these algorithms. Following table shows the accuracy of each algorithm on different datasets 114

Data mining Techniques % Accuracy Australian German CRX Bayes Classification 85.20% 68.60% 85.40% Decision Tree 82.20% 63.40% 85.70% Boosting 82.80% 63.50% 80.90% Bagging 84.30% 67.60% 85.70% Random Forest 84.40% 71.80% 86.40% algorithm applied on australian dataset from this we can say that Bayes algorithm work better on small dataset. algorithm applied on german dataset from this we can say that Random forest algorithm work better on large dataset. algorithm applied on Crx dataset from this we can say that Random forest algorithm work better on medium dataset. V. CONCLUSIONS The system using data mining for loan Default risk analysis enables the bank to reduce the manual errors involved in the same. From the experiment we conducted we can conclude that out of these five data mining algorithm we applied on different datasets of different size, random forest algorithm is most consistent and has highest accuracy compared to others. 115

Further we can see Boosting has already increased the efficiency of decision trees. The assessment of risk will enable banks to increase profit and can result in reduction of interest rate. The accuracy and efficiency of boosted decision trees in financial data analysis can be improved by integration of domain knowledge. REFERENCES [1 ] Jiawei Han, Micheline Kamber, Data Mining Concepts and Technique, 2 nd edition [2 ] Margaret H. Dunham, Data Mining-Introductory and Advanced Topocs Pearson Education,Sixtyh Imnpression, 2009. [3 ] Abbas Keramati, Niloofar Yousefi, A Proposed Classification of Data Mining Techniques in Credit Scoring", International Conference on Industrial Engineering and Operations Management, 2011 [4 ] Boris Kovalerchuk, Evgenii Vityaev, DATA MINING FOR FINANCIAL APPLICATIONS, 2002 [5 ] Defu Zhang, Xiyue Zhou, Stephen C.H. Leung, JieminZheng, Vertical bagging decision trees model for credit scoring, Expert Systems with Applications 37 (2010) 7838 7843. [6 ] Raymond Anderson, Credit risk assessment: Enterprise-credit frameworks. [7 ] Girisha Kumar Jha, Artificial Neural Networks. [8 ] Hossein Falaki, AdaBoost Algorithm. [9 ] Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, Mining Very Large Database IEEE,1999 [10 ] http://en.wikipedia.org 116