ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA



Similar documents
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Data Mining Approach to Detect Heart Dieses

An Experimental Study on Ensemble of Decision Tree Classifiers

Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia

Decision Trees from large Databases: SLIQ

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

How To Solve The Kd Cup 2010 Challenge

Keywords data mining, prediction techniques, decision making.

Social Media Mining. Data Mining Essentials

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining Classification: Decision Trees

Random forest algorithm in big data environment

Comparison of Data Mining Techniques used for Financial Data Analysis

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

D-optimal plans in observational studies

DATA MINING TECHNIQUES AND APPLICATIONS

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Performance Analysis of Decision Trees

Data Mining for Knowledge Management. Classification

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Classification and Prediction

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

COMBINED METHODOLOGY of the CLASSIFICATION RULES for MEDICAL DATA-SETS

DATA MINING AND REPORTING IN HEALTHCARE

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

Heart Disease Diagnosis Using Predictive Data mining

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Customer Classification And Prediction Based On Data Mining Technique

Data Mining. Nonlinear Classification

Distributed forests for MapReduce-based machine learning

Automatic Resolver Group Assignment of IT Service Desk Outsourcing

REVIEW OF ENSEMBLE CLASSIFICATION

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

CLUSTERING AND PREDICTIVE MODELING: AN ENSEMBLE APPROACH

Prediction of Heart Disease Using Naïve Bayes Algorithm

Using multiple models: Bagging, Boosting, Ensembles, Forests

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Feature Selection for Classification in Medical Data Mining

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

A Survey on classification & feature selection technique based ensemble models in health care domain

An Overview of Knowledge Discovery Database and Data mining Techniques

Comparison of Six Classification Techniques for Post Operative Patient data in the Medicable discipline

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

AnalysisofData MiningClassificationwithDecisiontreeTechnique

Data Mining Part 5. Prediction

Healthcare Data Mining: Prediction Inpatient Length of Stay

Data Mining - Evaluation of Classifiers

Application of Data Mining Techniques to Model Breast Cancer Data

Comparison of K-means and Backpropagation Data Mining Algorithms

Decision Tree Learning on Very Large Data Sets

Knowledge Discovery and Data Mining

Comparative Analysis of Serial Decision Tree Classification Algorithms

New Ensemble Combination Scheme

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

K Thangadurai P.G. and Research Department of Computer Science, Government Arts College (Autonomous), Karur, India.

Introducing diversity among the models of multi-label classification ensemble

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Predictive Data modeling for health care: Comparative performance study of different prediction models

Data Mining Methods: Applications for Institutional Research

International Journal of Software and Web Sciences (IJSWS)

Artificial Neural Network Approach for Classification of Heart Disease Dataset

Blog Post Extraction Using Title Finding

Gerry Hobbs, Department of Statistics, West Virginia University

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Classification algorithm in Data mining: An Overview

Data Mining Practical Machine Learning Tools and Techniques

DATA PREPARATION FOR DATA MINING

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods

DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE

Advanced Ensemble Strategies for Polynomial Models

Studying Auto Insurance Data

Mining Wiki Usage Data for Predicting Final Grades of Students

INVESTIGATIONS INTO EFFECTIVENESS OF GAUSSIAN AND NEAREST MEAN CLASSIFIERS FOR SPAM DETECTION

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE

Data Mining Analysis (breast-cancer data)

Chapter 6. The stacking ensemble approach

2.1. Data Mining for Biomedical and DNA data analysis

REPORT DOCUMENTATION PAGE

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR.

SURVIVABILITY ANALYSIS OF PEDIATRIC LEUKAEMIC PATIENTS USING NEURAL NETWORK APPROACH

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Transcription:

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh, India lav_dlr@yahoo.com 2 Associate Professor, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh, India usharanikuruba@yahoo.co.in ABSTRACT Data mining is the process of analyzing large quantities of data and summarizing it into useful information. In medical diagnoses the role of data mining approaches increasing rapidly. Particularly Classification algorithms are very helpful in classifying the data, which is important in decision making process for medical practitioners. Further to enhance the classifier accuracy various pre-processing techniques and ensemble techniques were developed. In this study a hybrid approach, CART classifier with feature selection and bagging technique has been considered to evaluate the performance in terms of accuracy and time for classification of various breast cancer datasets. KEYWORDS Data mining, Classification, Decision Trees, Ensemble, Bagging, Breast Cancer datasets. 1. INTRODUCTION In the data mining community, decision tree algorithms are very popular since they are relatively fast to train and In data mining, decision tree algorithms are very popular due to their characteristics such as fast to train and produce transparent models. The machine learning community has produced a large number of programs to create decision trees for classification. Very notable among these for classification of data include ID3, C4.5, C5, CART, CHAID, SLIQ, SPRINT, ScalParc and so on. These classifiers provide support for many health care areas in decision making. Out of these CART has been proved to the best classifier for medical data [4]. Classification and regression trees (CART) were initially developed in the 1960 s by Morgan and Sonquist92 and have since been refined and popularized by Breiman [3]. Based on recursive partitioning they provide non-parametric flexibility in the prediction of categorical i.e., classification or continuous i.e., regression outcomes. Accuracy is very important in classification of medical data; a hybrid approach has been presented to enhance the classification accuracy of breast cancer. As the Breast Cancer is one of leading causes of death in women in this study various breast cancer datasets are considered to analyze the performance of CART decision tree algorithm with feature selection and bagging technique for higher classification accuracy and improved diagnosis. DOI : 10.5121/ijitcs.2012.2103 17

The organization of the paper is paper is: A brief overview of related work, the theory of decision tree, feature selection and bagging algorithm will be given in section 2; The experiments and evaluation are presented in section 3;And Section 4 contains the conclusion of the study. 2. BACKGROUND 2.1 Overview of Related Work Several studies have been reported that they have focused on the importance of bagging technique in the field of medical diagnosis. These studies have applied different approaches to the given problem and achieved high classification accuracies. Here are some examples: My Chau Tu s [14] proposed the use of bagging with C4.5 algorithm, bagging with Naïve bayes algorithm to diagnose the heart disease of a patient. My Chau Tu s [15] used bagging algorithm to identify the warning signs of heart disease in patients and compared the results of decision tree induction with and without bagging. Tsirogiannis s [19] applied bagging algorithm on medical databases using the classifiers neural networks, SVM S and decision trees. Results exhibits improved accuracy of bagging than without bagging. Pan wen [16] conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C4.5 with bagging. Kaewchinporn C s [9] presented a new classification algorithm TBWC combination of decision tree with bagging and clustering. This algorithm is experimented on two medical datasets: cardiocography1, cardiocography2 and other datasets not related to medical domain. Jinyan LiHuiqing Liu s [8] experimented on ovarian tumor data to diagnose cancer using C4.5 with and without bagging. Dong-Sheng Cao s [6] proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy with bagging to find the structure activity relationships in the area of chemometrics related to pharmaceutical industry. Liu Ya-Qin s [12] experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability. Tan AC s [18] used C4.5 decision tree, bagged decision tree on seven publicly available cancerous micro array data, and compared the prediction performance of these methods. 2.2 Decision Trees Decision tree [7] is one of the classification methods, which classify the labeled trained data into a tree or rules. Once the tree or rules are derived in learning phase to test the accuracy of a classifier test data is taken randomly from training data. After Verification of accuracy, unlabeled data is classified using the tree or rules obtained in learning phase. The structure of a decision tree is similar to the tree with a root node, a left sub tree and right sub tree. The leaf nodes in a tree represent a class label. The arcs from one node to another node denote the conditions on the attributes. The Tree can be built as: 18

The selection of attribute as a root node is done based on attribute splits The decisions about the node to represent as terminal node or to continue for splitting the node. The assignment of terminal node to a class. The attribute splits depends on the impurity measures such as Information gain, gain ratio, gini index e.t.c. Once the tree is built then it is pruned to check for over fitting and noise. Finally the tree is an optimized tree.the advantage of tree structured approach is easy to understand and interpret, handles categorical and numeric attributes, robust to outliers and missing values. Decision tree classifiers are used extensively for diagnosis of diseases such as breast cancer, ovarian cancer and heart sound diagnosis and so on [1], [17], [10],[2]. 2.3 Feature Selection A preprocessing technique feature selection identifies and removes irrelevant attributes that do not play any role in the classification task. Several feature selection methods are available with different search techniques to produce a reduced data set. This reduced data set improves accuracy compared with original dataset. Feature selection does not alter the relevance or meaning the data set. The feature selection methods are categorized as filter, wrapper and hybrid. The result of these methods varies in time and accuracy. A brief summary of Feature Selection process [13] is as follows: Generate candidate feature subset from the original data set using methods such as complete, heuristic and random and then evaluate the generated candidate feature subset using different evaluation functions such as distance, information, dependency, consistency, classifier error rate. These functions produce relevancy value. This relevancy value used as termination condition to decide the subset as an optimal feature subset or not. If the subset is an optimal feature subset then a validation is performed on that otherwise the above process continues. In the field of medical diagnosis accuracy is very important hence one can incorporate preprocessing techniques in classification tasks. 2.4 Bagging Bagging means Bootstrap aggregation [11] an ensemble method to classify the data with good accuracy. In this method first the decision trees are derived by building the base classifiers c 1, c 2, ----, c n on the bootstrap samples D 1, D2, ----, Dn respectively with replacement from the data set D. Later the final model or decision tree is derived as a combination of all base classifiers c 1, c 2, ----, c n with the majority votes. Bagging can be applied on any classifier such as neural networks, Bayesian algorithms, Rule based algorithms, neural networks, Support vector machines, Associative classification, Distance based methods and Genetic Algorithms. Applying bagging on classifiers especially on decision trees, Neural nets increases accuracy of classification. Bagging plays an important role in the field of medical diagnosis.many research works in this aspect is depicted in related work. The process of bagging is shown in the figure 1. 19

Figure 1. Bagging Process 3. EXPERIMENTS Experiments are conducted on three breast cancer datasets. In this study CART decision tree algorithm is chosen for analysis of data because it was proved as the best algorithm in our previous study [4] among the frequently used decision tree classifiers ID3 and C4.5 to classify the medical data. Generally, datasets may contain some attributes that do not have any significance in classification task. Such attributes can be eliminated by feature selection methods. For this purpose, in our previous study, experiments are conducted on breast cancer data sets using several feature selection methods in conjunction with cart algorithm [5]. The results proved that a particular feature selection method is best for a specific breast cancer dataset. Because the data sets chosen differ in the number of instances or records and number of attributes. The conjunction of CART algorithm with feature selection methods increased the classification accuracy. Further, to enhance accuracy a hybrid approach is considered in this study. The hybrid approach is formed by combining the best feature selection method to a particular breast cancer data set, bagging and cart decision tree algorithm. The data sets are collected from UCI machine learning repository [www.ics.uci.edu] which is publicly available. The description of the datasets is given in Table 1. 20

Table 1: Description of Breast Cancer Datasets Dataset No. of Attributes No. of Instances No. of Classes Missing values Breast Cancer 10 286 2 yes Breast Cancer Wisconsin (Original) 11 699 2 yes Breast Cancer Wisconsin (Diagnostic) 32 569 2 no To obtain consistency the missing values in the datasets are replaced with mean value of the respective attribute. Experiments are conducted using Weka tool and the results are compared with bagging and without bagging using 10-fold cross validation. The procedure of hybrid approach is illustrated below. Apply the best feature selection method on breast cancer data set. Once the reduced data set is obtained then conduct experiment of bagging with cart algorithm. Thus, the above procedure is performed on three breast cancer data sets iteratively to obtain the classification results. The best feature selection method to each breast cancer data [5] is given in the following Table 2. Table 2: Best Feature Selection Method for Breast Cancer Datasets Data Set Breast Cancer Breast Cancer Wisconsin (Original) Feature selection Technique SVMAttributeEval PrincipalComponentsAttributeEval Breast Cancer Wisconsin (Diagnostic) SymmetricUncertAttributesetEval As accuracy is very important in the field of medical domain, the performance measure accuracy of classification is considered in this study. The results of the hybrid approach are tabulated in Table 3. Table 3: Hybrid Approach Accuracy and time to build a model. Data Set Accuracy Time Breast Cancer 74.47 28.25 Breast Cancer Wisconsin (Original) 97.85 1.38 Breast Cancer Wisconsin (Diagnostic) 95.96 1.95 The comparison of the accuracies of three methods is represented in Table 4. With this comparison, it is clear that the hybrid approach enhanced the classification accuracy than other approaches for all three Breast cancer data sets. 21

Table 4: Accuracy of CART algorithm, CART with Feature Selection Method and Hybrid Approach Dataset CART CART With Feature Selection Method Hybrid Approach Breast Cancer 69.23 73.03 74.47 Breast Cancer Wisconsin (Original) 94.84 96.99 97.85 Breast Cancer Wisconsin (Diagnostic) 92.97 94.72 95.96 The comparison of the accuracies of the above three methods is presented in the graphical form in figure 2. 100 90 80 70 CART Accuracy(%) 60 50 40 30 CART With Feature Selection Method Hybrid Approach 20 10 0 Breast Cancer Breast Cancer Wisconsin (Original) Breast Cancer Wisconsin (Diagnostic) Data Sets Figure 2. Accuracy of CART algorithm, CART with Feature Selection Method and Hybrid Approach 4. CONCLUSION For medical diagnosis various data mining techniques are available. In this study, for classification of medical data we employed decision tree algorithm because it produce human readable classification rules which are easy to interpret. A hybrid method is proposed to enhance the classification accuracy of Breast Cancer data sets. The training data is tested with 10-fold cross validation. The data sets are preprocessed to remove missing values. The feature selection methods used to eliminate those attributes that have no significance in the classification process. Bagging the training dataset is one of the most common methods of improving decision tree. 22

The experimental results of a hybrid approach with the combination of preprocessing, bagging with cart demonstrated the enhanced classification accuracy of the selected data sets. REFERENCES 1. Antonia Vlahou, John O. Schorge, Betsy W.Gregory and Robert L. Coleman, Diagnosis of Ovarian Cancer Using Decision Tree Classification of Mass Spectral Data, Journal of Biomedicine and Biotechnology 2003:5 (2003) 308 314. 2. Aruna, Dr S.P. Rajagopalan and L.V.Nandakishore, An Empirical Comparison of Supervisedlearning algorithms in Disease Detection. International Journal of Information Technology Convergence and Services (IJITCS) Vol.1, No.4, August 2011. 3. Breiman, Friedman, Olshen, and Stone. Classification and Regression Trees, Wadsworth, 1984, Mezzovico, Switzerland. 4. D.Lavanya, Dr.K.Usha Rani, Performance Evaluation of Decision Tree Classifiers on Medical Datasets. International Journal of Computer Applications 26(4):1-4, July 2011. 5. D.Lavanya, Dr.K.Usha Rani,., Analysis of feature selection with classification: Breast cancer datasets,indian Journal of Computer Science and Engineering (IJCSE),October 2011. 6. Dong-Sheng Cao, Qing-Song Xu,Yi-Zeng Liang, Xian Chen, Automatic feature subset selection for decision tree-based ensemble methods in the prediction of bioactivity, Chemometrics and Intelligent Laboratory Systems. 7. J. Han and M. Kamber, Data Mining; Concepts and Techniques, Morgan Kaufmann Publishers, 2000. 8. Jinyan LiHuiqing Liu, See-Kiong Ng and Limsoon Wong, Discovery of significant rules for classifying cancer diagnosis data, Bioinformatics 19(Suppl. 2)Oxford University Press 2003. 9. Kaewchinporn.C, Vongsuchoto. N, Srisawat. A A Combination of Decision Tree Learning and Clustering for Data Classification, 2011 Eighth International Joint Conference on Computer Science and Software Engineering (JCSSE). 10. Kuowj, Chang RF,Chen DR and Lee CC, Data Mining with decision trees for diagnosis of breast tumor in medical ultrasonic images,march 2001. 11. L. Breiman, Bagging predictors, Machine Learning, 26, 1996, 123-140. 12. Liu Ya-Qin, Wang Cheng, Zhang Lu, Decision Tree Based Predictive Models for Breast Cancer Survivability on Imbalanced Data, 3rd International Conference on Bioinformatics and Biomedical Engineering, 2009. 13. Mark A. Hall, Lloyd A. Smith, Feature Subset Selection: A Correlation Based Filter Approach, In 1997 International Conference on Neural Information Processing and Intelligent Information Systems (1997), pp. 855-858. 14. My Chau Tu, Dongil Shin, Dongkyoo Shin, Effective Diagnosis of Heart Disease through Bagging Approach, 2nd International Conference on Biomedical Engineering and Informatics,2009. 15. My Chau Tu, Dongil Shin, Dongkyoo Shin, A Comparative Study of Medical Data Classification Methods Based on Decision Tree and Bagging Algorithms Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing, 2009. 16. Pan Wen, Application of decision tree to identify a abnormal high frequency electrocardiograph, China National Knowledge Infrastructure Journal, 2000. 23

17. Stasis, A.C. Loukis, E.N. Pavlopoulos, S.A. Koutsouris, D. Using decision tree algorithms as a basis for a heart sound diagnosis decision support system, Information Technology Applications in Biomedicine, 2003. 4th International IEEE EMBS Special Topic Conference, April 2003. 18. Tan AC, Gilbert D. Ensemble machine learning on gene expression data for cancer classification, Appl Bioinformatics. 2003;2(3 Suppl):S75-83. 19. Tsirogiannis, G.L, Frossyniotis, D, Stoitsis, J, Golemati, S, Stafylopatis, A Nikita,K.S, Classification of Medical Data with a Robust Multi-Level Combination scheme, IEEE international joint Conference on Neural Networks. 24