An Empirical Comparison of Ensemble and Hybrid Classification

Transcription

1 Proc. of Int. Conf. on Recent Trends in Signal Processing, Image Processing and VLSI, ICrtSIV An Empirical Comparison of Ensemble and Hybrid Classification B V Sumana 1 and T. Santhanam 2 1 Assistant Professor, Department of Computer Science, Vijaya College Jayanagar, Bangalore, India [email protected] 2. Associate Professor, & Head, Department of Computer Applications, DG Vaishnav College, Chennai, India [email protected] Abstract The application of Data mining has proved to be successful in almost all the fields including medical domain. Medical data mining is the process of extracting useful knowledge and hidden patterns from medical data. This paper proposes a hybrid model for classifying Cleveland Heart dataset with hybrid feature selection and compares the performance with the base classifiers and ensemble classifiers. The model is developed in four stages. In the initial stage, Cleveland Heart dataset selected from the UCI repository is cleaned by deleting all the instances with missing values. In the second stage Fuzzy and Rough Set is used in a cascaded fashion for relevant feature extraction. In the third stage the resultant dataset was clustered into two segments using K-means and incorrectly clustered samples were eliminated to get final samples. Finally, the correctly clustered samples from the previous stage was trained with 5 different classifiers to build the final classifier model using 10 fold cross validation. Experimental results proved that proposed hybrid model showed enhanced classification accuracy compared to base classifiers and ensemble classifiers. It yielded highest accuracy of 99.54%. Index Terms Classification, Clustering, Fuzzy, Rough Set, K-means, Hybrid I. INTRODUCTION Recent technological advances has led society to generate large amounts of data in almost all fields like business, marketing, surveillance, science, medicine, economics, fraud detection sports etc., The data stored is growing exponentially from tera bytes to peta bytes and later might be to yotta bytes. There is often information hidden in the data that is not readily available until and unless analyzed. This huge amount of data is a key source to be processed and analyzed for knowledge extraction that enables support for cost savings and decision making. Human analysts may take weeks to extract useful information. Hence need techniques to extract information. Data mining has become an important tool to transform these data into information. The application of Data mining has proved to be successful in almost all fields. It has also proved a similar application in medical domain also. Knowledge extracted (using data mining) from the medical data provide the physicians an additional source of knowledge to take decisions in their practices, treatment planning, risk analysis and other predictions Disease diagnosis is one of the applications where data mining is proving successful results. Data mining which is a confluence of multi disciplines like machine learning, statistics, pattern recognition, visualization etc., provides various techniques like association, regression, prediction, clustering and DOI: 03.AETS Association of Computer Electronics and Electrical Engineers, 2014

2 classification [1]. The two most widely used data mining techniques are Classification and Clustering. Classification, supervised learning technique whose goal is to predict the target class for each case in the data where the classes are predefined. Clustering, an unsupervised learning technique whose goal is to cluster the data into groups of similar objects in which the objects are similar to one another within the same cluster and dissimilar to the objects in other clusters [2]. Presently, though various classification algorithms are available in the literature researchers are facing the problem of choosing the best model for a particular data set as these traditional algorithms suffer with common problems, such as computational complexity, sticking to local minima or over-fitting to the data set used for training [3]. One of the most eminently used approaches to overcome these problems is the ensemble learning which is primarily used to improve the performance of a classifier. It is a process by which predictions of multiple classifiers are combined to classify new samples to achieve better prediction accuracy [4]. The commonly used ensemble techniques are bagging, boosting, voting and stacking. Despite recent researches have been noticed some problems, like ensembles do not always improve accuracy of the model but tends to increase the error of each individual base classifier [5]. The use of many classifiers makes them more complex and produces output that is very hard to analyze [6]. It is stated in [7] that not all the ensemble classifiers improve the classification accuracy on all the datasets. Sarvestan Soltani et al [8]. deduced that it takes more build time compared to a single classifier. The most challenging tasks to be faced while developing ensemble classifiers are (i) the combination of the classifiers to be used. (ii) The base classifiers used for ensemble must be simple so that they should not over fit. (iii) To get a good ensemble, the base learners used should be as accurate as possible, and as distinct as possible. (iv) Sometime we get poor accuracy due to difficulty in selecting the correct combination of classifiers. To overcome these problems recent researches have shown enhanced results in diagnosing the disease when more than one technique is hybridized. Hybridization is an emerging approach where more than one technique is combined, example, clustering and classification or clustering and association and so on. From the literature study it seen that recent researches has focused on hybridization of more than one technique but the research gap is none of the paper used hybrid feature selection, hence presently a hybrid model combining clustering and classification is proposed with hybrid feature selection combining fuzzy and rough set technique to optimize the accuracy of the classifier. The objective of this paper is to review the hybrid clustering and classification model [9, 10] and analyze the performances of hybrid model over single classification model and ensemble, based on classification accuracy, error rate, specificity, sensitivity and time taken to build the model on Cleveland Heart dataset. This work helped us to propose the best model for heart dataset and also helped to provide some suggestions to the researchers facing problem to choose the best algorithm suitable for the particular dataset by proposing a hybrid model which includes classification and clustering to diagnose the heart disease. Heart dataset was selected, because heart disease is one the common cause of death globally. As estimated in million people died from cardiovascular disease. 80% of deaths took place in low and middle income countries and it is estimated that by 2030 more than 23 million people will die annually from CVDs [11]. The death rate between 1 st January 2013 and 31 st march 2013 is 161,894 in USA. About 9.5 million deaths, which are about one in six deaths worldwide, occur in India every year [12]. Heart disease is a condition of the body that affects heart's ability to work. Different categories of heart disease are Coronary heart disease, Cardiomyopathy, Cardiovascular disease, congestive heart failure and heart attack. Heart attacks and strokes are usually acute events and are mainly caused by a blockage that prevents blood from flowing to the heart or brain. The most common reason for this is a build-up of fatty deposits on the inner walls of the blood vessels that supply the heart or brain. Strokes can also be caused by bleeding from a blood vessel in the brain or from blood clots. The overall objective of this paper is to study the performance comparison of single classification model and ensemble over hybrid model on heart dataset to segment patients into two clusters one cluster with presence of disease and another with absence of disease and to find which model performs best. The rest of the paper is organized as follows Section II provides a brief review of the related work. Section III explains the proposed model and Heart dataset used in this study. Section IV presents an overview of the classifiers used. Section V discusses the performance measures adopted in this study. Section VI reports the results of the experiment which is followed by the standard guidelines. Finally, Section VII and VIII conclude the paper with an outlook of future work. 464

3 II. LITERATURE REVIEW A lot of research work has been done on various medical data sets. There are many researches going on till today. It is not possible to list all the researches. Hence only few are listed below. TABLE I lists few researches on single, ensemble and hybrid clustering and classification method performed using UCI repository medical datasets. TABLE I. DATA MINING TECHNIQUES APPLIED ON DIFFERENT MEDICAL DATASETS Type Author Year Data set Technique Sellappan Palaniappan et al Heart Naïve Bayes, Decision Trees, Neural Network Sarvestan Soltani A. et al Breast Cancer MLP,SOM,RBF,PNN Anbarasi et al. Genetic with Decision tree,naïve Bayes and 2010 Heart Classification via Clustering AH Chen et al Heart ANN Single Sam Chao et al 2009 Rajeswari K et al Heart Decision Tree Algorithms Umair Abdullah et al Shariq Bashir et al Thanh-Trung Nguyen 2010 Apriori and FP Growth Sunil Joshi et al Xiaoyong Lin et al Lior Rokach et al Sotiris et al Indra Bhan et al Ensemble Srimani et al classifiers Bendi Venkata Ramana Liver 2012 Rotation Forest et al. Sarwesh et al Shantakumar B.Patil et al Heart P. Rajendran et al Medical Image Association and Classification Sung Ho Ha et al chest S Kartik et al Liver Rough Set and Classification Asha T et al Tuberculosis Hybrid Sarojini et al. Diabetes. heart and 2011 cancer Asha Gowda Karegowda Diabetes 2012 et al. Clustering and Classification Shomona Gracia Jacob et Lymphography 2012 al. Shezad Shaikh et al NSL-KDD III. METHODS AND MATERIALS For this experiment WEKA an open source tool and Cleveland Heart dataset collected from UCI Machine Learning Repository [13] is used. The experiment was conducted using 10-fold cross validation to test the accuracy and time complexity of the classifiers. A. Proposed Model The methodology consists of four stages based on clustering and classification that classifies the dataset into two clusters. In the initial stage, Heart dataset selected is cleaned by deleting all the instances with missing values. In the second stage Fuzzy Rough Set is used in a cascaded fashion for relevant feature extraction. In the third stage the resultant dataset was then clustered into two segments using K-means and incorrectly clustered samples were eliminated to get final samples. Finally, the correctly clustered samples from the previous stage was trained with 5 different classifiers to build the final classifier model using 10 fold cross validation. 1) Data Preprocessing: is an important preliminary preparation step in the data mining process which includes cleaning, integration, transformation, feature extraction and selection [14]. The accuracy and quality of analysis depends on the quality of the data. The data may contain missing values, noisy, irrelevant and redundant information. If the data is not handled properly the mining process will produce misleading results. The Heart dataset used in this study has 7 missing values. If it is not handled properly it will produce misleading results during classification process. There are many approaches to handle missing data. In our 465

4 approach, the Heart data set was refined by deleting records containing missing values since it contributed only 2% of the data samples and was transformed to a form appropriate for clustering. Hence data preprocessing acts as the preliminary preparation process for transforming the data suitable for clustering 2) Feature Selection: Data may contain many redundant or irrelevant features. Redundant features are those which provide no more information and irrelevant features are those which provide no useful information. The classification accuracy of a given algorithm generally depends on the nature of dataset rather than the algorithm itself. The main characteristics of a dataset are its attributes, classes and number of instances. Feature selection is a form of dimensionality reduction where in the input data will be transformed into a reduced representation set of features eliminating irrelevant features and selecting a subset of relevant features for the model construction which optimizes the accuracy of the classifiers. In this approach Fuzzy Rough set feature selection (FRFS) was adapted to select the best attributes and clustering as a reduction technique applying which the wrongly clustered instances were eliminated to get final samples. Fuzzy-Rough set Feature Selection (FRFS) was adapted, as it can analyze both quantitative and qualitative features and can reduce mixture of nominal and continuous valued features based only on the original data without any additional information about the data. Though Rough set theory proposed by Pawlak (1982) has many successful advantages in the extraction of feature subsets it has the limitation of handling only nominal data therefore fuzzy set theory is combined with Rough set to handle continuous data. Hence the hybrid FRFS can handle mixture of nominal and continuous valued features 3) Clustering Using K-Means Algorithm: Clustering is an unsupervised learning technique using which data elements are segmented into related groups without prior knowledge of the group definitions. Their basic task is to group objects into meaningful categories and develop classification labels automatically. Numerous methods are available in the literature for clustering. The k-means algorithm is one of the widely recognized clustering algorithms that are applied in numerous scientific and industrial applications. Hence K- Means clustering algorithm is adapted in the present approach as it is an unsupervised partition method which is simple and takes relatively low computational time [15, 16, 17]. K-means algorithm takes k as an input which is a positive integer denoting the number of clusters and groups the data in accordance with their characteristic values into K distinct clusters. So that the resulting objects of one cluster are dissimilar to that of other cluster and similar to objects of the same cluster. Finally, the relevant features identified and the correctly classified samples from first and second stage and the relevant instances identified in the third stage were given as an input to five different classifiers of WEKA using 10 fold cross validation. The performances of the classifiers were evaluated based on the confusion matrix. Table 2 illustrates the defined process. TABLE II. SUMMARY OF PREPROCESSING Data set No of instances No of instances after preprocessing No of incorrectly clustered instances using Kmeans Error in clustering (%) No of instances after elimination of wrongly clustered instances Heart % 219 IV. OVERVIEW OF THE CLASSIFIERS There are large numbers of classifiers available in the literature such as Bayes, rule based, Neural Networks; tree etc. classifiers may be of any type their optimum goal is to predict the class. In our approach we have evaluated the Cleveland Heart dataset using very prominently used five different classifiers. Bayesian classifiers Naive Bayes. Naive Bayes (NB) is a probabilistic method for classification based on Bayesian theorem. It assumes independence among the attributes that the input features are conditionally independent of each other Support Vector Machines using Sequential Minimal Optimization. The Support Vector Machine (SVM) algorithm builds a hyper plane to separate different instances into their respective classes SMO implements the sequential minimal optimization algorithm for training a support vector classifier using polynomial or Gaussian kernels which is a fast and an efficient version of SVM implemented in WEKA. 466

5 Instance Based Learners. IBk Is an Instance based Classifier in which classification is done on the basis of a majority vote of k neighboring instance. Trees J48 Decision Tree A divide-and-conquer approach to the problem of learning.the decision tree (J48) is an implementation of C4.5 in WEKA. The tree comprises of nodes (attributes) at every stage that are structured with the help of training examples. Rule Learner PART is a rule learner classifier proposed by Frank and Witten (1998) which is a combination of C4.5 and Ripper This algorithm generates ordered set of rules called decision lists and new data is compared to each rule in the list and the item is assigned the category of the first matching rule (a default is applied if no rule successfully matches).part builds the decision tree in each iteration using C4.5 s heuristics and makes the best leaf into a rule. Ensembles Bagging combines the multiple models generated by training a single algorithm on random sub-samples of a given dataset. Unbiased voting is used during the fusion process. Adaboost Boosting, in contrast to bagging, uses weighted voting to generate more misclassified instances in its successive models. Rotation Forest is successful ensemble technique in which each tree is trained on the whole data set in a rotated feature space Dagging creates a number of disjoint, stratified folds out of the data and feeds each chunk of data to a copy of the supplied base classifier. Random Forest constructs random forests by bagging ensembles of random trees. The characteristics of the dataset used is explained in TABLE 3 TABLE III. DATA SET DESCRIPTIONS Data Set No. of Attributes No. of Classes No. of Instances Missing Values Including class Heart Cleveland Yes (7) Sl Attribute Description Values no 1 Age Age in years Continuous 2 Sex Male or Female 1 = male 0 = female 3 cp Chest pain type 1 = typical type 1 2 = typical type angina 3 = non-angina pain 4 = asymptomatic 4 Thestbps Resting blood pressure Continuous value in mm hg 5 Chol Serum cholesterol Continuous value in mm/dl 6 Restecg Resting electrographic results 0 = normal 1 = having_st_t wave abnormal 2 = left ventricular hypertrophy 7 Fbs Fasting blood sugar 1 _ 120 mg/dl 0 _ 120 mg/dl 8 Thalach Maximum heart rate achieved Continuous value 9 Exang Exercise induced angina 0= no 1 = yes 10 Oldpeak ST depression induced by exercise Continuous value relative to rest 11 Slope Slope of the peak exercise ST 1 = unsloping 2 = flat 3 = downsloping Segment 12 Ca Number of major vessels colored by 0-3 value floursopy 13 Thal Defect type 3 = normal 6 = fixed 7 = reversible defect V. ESTIMATIONS FOR MODEL PERFORMANCE A. Stratified 10 Fold Cross Validation Method In this study stratified Cross validation with 10 folds has been used for evaluating the classifier models. Cross validation is a statistical technique used for evaluating the performance of the predictive model and also used to compare learning algorithms by dividing data into 2 segments one used to train a model and the other used 467

6 to validate the model [18]. Stratification is a process of partitioning the data such that each class is properly represented in both training and test sets [19]. In a stratified 10-fold Cross-Validation the data is divided randomly into 10 parts in which the class is represented in approximately the same proportions as in the full dataset. Each part is held out in turn and the learning scheme trained on the remaining nine-tenths; then its error rate is calculated on the holdout set. The learning procedure is executed a total of 10 times on different training sets, and finally the 10 error rates are averaged to yield an overall error estimate. When seeking an accurate error estimate, it is standard procedure to repeat the CV process 10 times [19]. B. Performance Measures Supervised Machine Learning (ML) has several ways of evaluating the performance of the classifiers. The quality of classification algorithms is measured based on the confusion matrix which records correctly and incorrectly recognized examples for each class. Table 4 presents a confusion matrix for binary classification, where TP are true positive TN are true negative, FP false positive, FN false Negative. The different measures used with the confusion matrix are TABLE IV. CONFUSION MATRIX Actual class Predicted class Test Negative (T-) Test Positive (T+) Disease Absent (D-) True Negative (TN) False Positive (FP) Disease Present (D+) False Negative (FN)) True Positive (TP) Accuracy: The accuracy of a classifier is the percentage of the test set tuples that are correctly classified by the classifier. Accuracy = (TP + TN) / (TP + TN + FP + FN) Sensitivity: Sensitivity is also referred as True positive rate i.e., the proportion of positive tuples that are correctly identified. Sensitivity = TP/ (TP+FN) Specificity: Specificity is the True negative rate that is the proportion of negative tuples that are correctly identified Specificity= TN/ (TN + FP) Time: The amount of time required to build the model. VI. EXPERIMENTAL RESULTS This section explains the experimental results and analysis of the study. In this study the classification accuracy of 5 classification algorithms were analyzed over Heart dataset selected from UCI repository and empirically compared the accuracy of the proposed model with the Benchmark comparison of results given in [20]. TABLE V. EXPERIMENTAL RESULTS USING BASIC LEARNING CLASSIFIERS Performance Evaluators Naïve Bayes K-NN (IBK) Part C4.5 (J48) SVM (SMO) Accuracy Error Rate Sensitivity Specificity Time to build the model TABLE VI. EXPERIMENTAL RESULTS USING ENSEMBLE CLASSIFIERS Performance Bagging Adaboost Rotation Dagging Random Forest Evaluators (AdaboostM1) Forest Accuracy Error Rate Sensitivity Specificity Time to build the model

7 TABLE. VII EXPERIMENTAL RESULTS USNG PROPOSED HYBRID MODEL (WITHOUT FEATURE SELECTION) Performance Evaluators Naïve Bayes K-NN (IBK) Part C4.5 (J48) SVM (SMO) Accuracy Error Rate Sensitivity Specificity Time to build the model TABLE.VIII. EXPERIMENTAL RESULTS USING PROPOSED HYBRID MODEL+HYBRID FEATURE SELECTION (FUZZY+ROUGHSET FRFS WITH ANTSEARCH) Data set Heart Cleveland TABLE IX COMPARISON OF PROPOSED HYBRID MODEL OVER BENCHMARK COMPARISON OF CLASSIFIERS From Table V, VI, VII and VIII experimental results showed that cascaded K-means clustering and classification with cascaded Fuzzy Rough Set Feature Selection showed an enhanced classification accuracy. Table IX gives the comparison of the proposed model with benchmark comparison of results given in [20]. A. Research Findings 1) From Table VIII it is seen that the proposed model showed good accuracy when compared to the hybrid model from the literature study and benchmark comparison of results given in [20]. The highest accuracy obtained from the proposed model on Cleveland Heart dataset is 99.54%. 2) From the literature study it is noticed that ensemble classifiers always does not give promising enhanced results[4], it depends on the type of the base classifiers used, where as the proposed model in this study, irrespective of the classifier gives promising enhanced result when compared with traditional base classifiers and ensemble classifiers. It is also investigated that by integrating feature selection with K-means the accuracy of the classifier could be enhanced even further. VII. CONCLUSION Performance Evaluators Naïve Bayes K-NN (IBK) Part C4.5 (J48) SVM (SMO) Accuracy Error Rate Sensitivity Specificity Time to build the model Accuracy Range Benchmark Single classifier model Proposed Hybrid model Classifier with Highest accuracy on proposed model 46.2% to 90.0% 99.54% PART Various data mining techniques are available for diagnosis of a disease. The main goal of the research was to identify the enhancement of the hybrid model over single classifier model and ensemble model. Accuracy of different base classifiers depends on the type of the data and type of the features. Ensemble classifiers has advantage over traditional classifiers because it works on the concept, when dealing with a complicated problem, a group of experts with varied experience in the same area will have a higher probability of reaching a satisfactory solution than a single expert. To get good ensemble accuracy, the base classifiers should be simple and accurate so that they should not over fit because as known the disadvantage of ensemble learning is when trying to maximize classifier accuracy tends to increase the error of each individual base classifier. Therefore in this study a hybrid model of clustering and classification with hybrid feature selection was proposed to diagnose the presence or absence of heart disease to enhance the classification accuracy of the data set tested with 10-fold cross validation. The experimental results showed that irrespective of the type of the classifier the proposed hybrid approach with the combination of preprocessing and hybrid feature selection demonstrated the promising enhanced classification accuracy on Heart data set, because in the feature selection phase the redundant and irrelevant 469

8 features were eliminated and in clustering phase irrelevant instances were eliminated and then the resultant dataset was trained with different classifiers. VIII. FUTURE WORK Accuracy of the classifier also differs on the error rate of the cluster algorithm hence our future work will focus on applying the cluster algorithm which produces less error rate compared to k-means and also from the investigation it was found that feature selection before clustering enhanced the classification accuracy therefore our future work will also focus on applying different feature selection algorithms and test the performance of the proposed model on different datasets. REFERENCES [1] Data mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, ISBN [2] Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Pearson/Addison Wesley, ISBN [3] T. G. Dietterich, Ensemble methods in machine learning, in MCS 00: Proceedings of the First International Workshop on Multiple Classifier Systems, pp. 1 15, Springer-Verlag, (London, UK), 2000 [4] Introduction to Machine Learning Second Edition by Ethem Alpaydın MIT Press, [5] Cios K J and Moore G W, 2002.Uniqueness of medical data mining. Artificial Intelligence in Medicine 26(1-2), [6] A Review of Ensemble Technique for Improving Majority Voting for Classifier Sarwesh Site M.Tech Scholer LNCT Bhopal India Dr. Sadhna K. Mishra Prof LNCT Bhopal India, IJARCSSE. Volume 3 Issue 1 ISSN: X [7] Srimani P. K. and Manjula Sanjay Koti.(2013)." Medical Diagnosis Using Ensemble Classifiers - A Novel Machine- Learning Approach". Journal of Advanced Computing.(2013) 1: 9-27 doi: /jac [8] Sarvestan Soltani A., Safavi A. A., Parandeh M. N. and Salehi M., Predicting Breast Cancer Survivability using data mining techniques, Software Technology and Engineering (ICSTE), 2nd International Conference, 2010, vol.2,pp [9] Asha Gowda Karegowda, M.A. Jayaram, A.S. Manjunath,Cascading K-means Clustering and K-Nearest Neighbor Classifier for Categorization of Diabetic Patients, International Journal of Engineering and Advanced Technology (IJEAT) ISSN: , Volume-1, Issue-3, (Feb 2012). [10] T. Asha, S. Natarajan, and K. N. B. Murthy, A Data Mining Approach to the Diagnosis of Tuberculosis by Cascading Clustering and Classification, Journal of computing, vol. 3, no. 4, 2011 [11] World Health Organization. Cardiovascular diseases (CVDs) Fact sheet Updated March 2013 Available at. [12] Prospective study if 1 million deaths in India: Rationale Design and Validation result Prabhat Jha Vendhan Gajalakshmi, Prakash C Gupta, Rajesh Kumar, Prem Mony, Neeraj Dhingra,Richard peto and RGI-CGHR Prospective Study Collaborators Mauricio Hernandez Avila, Academic Editor [13] UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. { [14] A Systematic Approach on Data Pre-processing In Data Mining S.S.Baskar, Dr. L. Arockiam, S.Charles. COMPUSOFT International Journal of Advanced Computer Technology' ISSN [15] Khaled Hammouda, Fakhreddine Karray, A Comparative Study of Data Clustering Techniques, University of Waterloo, Ontario, Canada, Volume 13, Issues 2-3, November 1997,pp [16] A Comparative Performance Analysis of Clustering Algorithms Pallavi, Sunila Godara International Journal of Engineering Research and Applications (IJERA) ISSN: Vol. 1, Issue 3, pp [17] Narendra Sharma, Aman Bajpai and Ratnesh Litoriya, Comparison the various clustering algorithms of weka tools, International Journal of Emerging Technology and Advanced Engineering, Volume 2, Issue 5, May 2012 [18] Cross-Validation PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Encyclopedia of Database Systems. Springer US 2009, pp [19] I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition: Morgan Kaufmann Pub, 2005 [20] Benchmark datasets used for classification: comparison of results Computational Intelligence Laboratory Department of Informatics Nicolaus Copernicus University { 470