A Survey on classification & feature selection technique based ensemble models in health care domain



Similar documents
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

REVIEW OF HEART DISEASE PREDICTION SYSTEM USING DATA MINING AND HYBRID INTELLIGENT TECHNIQUES

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Artificial Neural Network Approach for Classification of Heart Disease Dataset

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

SURVIVABILITY ANALYSIS OF PEDIATRIC LEUKAEMIC PATIENTS USING NEURAL NETWORK APPROACH

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Comparison of Six Classification Techniques for Post Operative Patient data in the Medicable discipline

Prediction of Heart Disease Using Naïve Bayes Algorithm

Comparison of K-means and Backpropagation Data Mining Algorithms

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

A Content based Spam Filtering Using Optical Back Propagation Technique

Introduction to Data Mining

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods

INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR.

Data quality in Accounting Information Systems

Keywords data mining, prediction techniques, decision making.

Random forest algorithm in big data environment

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Diagnosis of Breast Cancer Using Intelligent Techniques

Keywords Data Mining, Knowledge Discovery, Direct Marketing, Classification Techniques, Customer Relationship Management

1. Classification problems

Data Mining On Diabetics

Social Media Mining. Data Mining Essentials

Mining Health Data for Breast Cancer Diagnosis Using Machine Learning

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining Analysis (breast-cancer data)

International Journal of Software and Web Sciences (IJSWS)

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network

Towards applying Data Mining Techniques for Talent Mangement

REVIEW ON PREDICTION SYSTEM FOR HEART DIAGNOSIS USING DATA MINING TECHNIQUES

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining Part 5. Prediction

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Data Mining Techniques for Prognosis in Pancreatic Cancer

Heart Disease Diagnosis Using Predictive Data mining

Knowledge Discovery from patents using KMX Text Analytics

Towards better accuracy for Spam predictions

Utilization of Neural Network for Disease Forecasting

Chapter 6. The stacking ensemble approach

An Introduction to Data Mining

Application of Data Mining in Medical Decision Support System

Meta-learning. Synonyms. Definition. Characteristics

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Data Mining: A Preprocessing Engine

Use of Artificial Neural Network in Data Mining For Weather Forecasting

Data Mining Solutions for the Business Environment

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

How To Solve The Kd Cup 2010 Challenge

Data Mining using Artificial Neural Network Rules

AnalysisofData MiningClassificationwithDecisiontreeTechnique

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

A Review of Data Mining Techniques

Scalable Developments for Big Data Analytics in Remote Sensing

Detection of Heart Diseases by Mathematical Artificial Intelligence Algorithm Using Phonocardiogram Signals

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Comparison of Data Mining Techniques used for Financial Data Analysis

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Feature Selection for Classification in Medical Data Mining

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Role of Neural network in data mining

An Overview of Knowledge Discovery Database and Data mining Techniques

Face Recognition For Remote Database Backup System

Customer Classification And Prediction Based On Data Mining Technique

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

PERFORMANCE ANALYSIS OF VARIOUS DATA MINING CLASSIFICATION TECHNIQUES ON HEALTHCARE DATA

DATA MINING AND REPORTING IN HEALTHCARE

Advanced Ensemble Strategies for Polynomial Models

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Genetic Neural Approach for Heart Disease Prediction

Data Mining - Evaluation of Classifiers

Wireless Remote Monitoring System for ASTHMA Attack Detection and Classification

Single Level Drill Down Interactive Visualization Technique for Descriptive Data Mining Results

SVM Ensemble Model for Investment Prediction

Learning is a very general term denoting the way in which agents:

Data Mining Algorithms Part 1. Dejan Sarka

In Proceedings of the Eleventh Conference on Biocybernetics and Biomedical Engineering, pages , Warsaw, Poland, December 2-4, 1999

life science data mining

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Neural Networks and Support Vector Machines

Predictive Data modeling for health care: Comparative performance study of different prediction models

First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms

Data Mining Classification Techniques for Human Talent Forecasting

Dan French Founder & CEO, Consider Solutions

Chapter 12 Discovering New Knowledge Data Mining

Feature Subset Selection in Spam Detection

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining

Using Data Mining for Mobile Communication Clustering and Characterization

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Transcription:

A Survey on classification & feature selection technique based ensemble models in health care domain GarimaSahu M.Tech (CSE) Raipur Institute of Technology,(R.I.T.) Raipur, Chattishgarh, India garima.sahu03@gmail.com Rakesh KumarKhare (H.O.D.) Department of Information Technology Raipur Institute of Technology,(R.I.T.) Raipur, Chattishgarh, India rakesh_khare2001@yahoo.com Abstract There are various ensemble models used in healthcare domain but classification & feature selection technique based models are very effective and useful in diagnosis of skin diseases. The classification technique used in the model determines the scalability, while feature selection technique makes the model quick to diagnose the skin disease. Over the years, various ensemble models have been proposed which are having combination of various classification techniques and some models are based upon feature selection techniques. Some are hybrid models so that accuracy of the model gets increase. Our survey indicates the latest developments in techniques used to build the ensemble model. The combination of techniques varies with the disease so the combination suitable in dermatology can be determined through this survey. The ensemble model can be evaluated on the basis of system parameters such as accuracy, sensitivity, recall etc. The system parameters are represented in terms of percentage, by which ensemble model produces the results with the dataset used. Keywords-dermatology; expert system; feature selection; classification; sensitivity 1. Introduction Dermatology is the study related to skin diseases. The process to diagnose the skin diseases is very complex so that sometimes it happens that care providers may not diagnose the disease within time and it may turn out to be skin cancer. The similar features of six categories of skin diseases make it difficult diagnosis. In order to timely identification and accurate results, the concept of ensemble model came into existence so that dependency upon previous experience of care provider can be minimised. To increase the quickness of diagnosis process, verification of \some less important, redundant features available in data collected from patients can be avoided. To fulfil this purpose, a set of minimum number of features is required so that accurate diagnosis results can be produced quickly. The classification of data can be performed on the basis of features related to diseases, combination of classification techniques are used to build an ensemble model. The classification techniques works in two ways. First is training mode and other is testing mode. In training mode, the model is used to build using algorithm with the training data related to the domain. In testing mode, the built model is being tested as per the expected results. The deviation of the model from the desired result and the accuracy along with other system features are measured. This paper gives an overview of the use of data mining techniques such as classification and feature selection technique. Combination of these techniques has been used in various expert systems. These expert systems which are mentioned in this paper are mainly related to health care domain. This paper also conveys the information related to the efficiency of the model proposed by different researchers. 2. Clinical Knowledge Discovery Clinical knowledge discovery refers to identification of different patterns from raw data related to health care domain and extraction of the relevant information from large databases which consists of patient data. In this process, several steps are present. Those steps involve removal of noise and redundant data. The data which has been 957

collected from different sources needs to be combined. The data which is specifically related to the analysis task is to be selected or retrieved from large database and the transformation occurs on the application of techniques to utilize the data in specified manner. The whole process is carried down in every expert system with the difference of order and usage of techniques.the care provider may escape themselves from providing the wrong treatment so that expert system may help them in the process of providing effective treatment to the patient. The associative rules present in expert system closely analyse the features available in the patient data. The effectiveness of the features can be determined through the rules and redundant features can be eliminated to form a set of minimum number of features which are only required for the purpose of diagnosis of the disease. Based upon this methodology, different expert systems are proposed and used by care providers for better treatment. The combination of classification techniques and feature selection methods describe the uniqueness and effectiveness of the proposed expert system. 3. Literature Survey The clinical decision support systems which are in current use in health care sector and the proposed one are described here for the consolidated view of research and development in the field of clinical decision support system [Table- I]. Marcano-Cedeno et al [1]proposed the expert system for the patients of acquired brain injury. It is based upon the application of Decision Tree (DT), Multi Layer Perceptron (MLP), General Regression Neural Network (GRNN) classification techniques so that efficient and proper care can be provided to patients. Their proposed expert system depicts better results in terms of system features such as accuracy (90.38% with DT, 78.77% with MLP & 75.96% with GRNN), (90.35% with DT, 77.83% with MLP & 75.66% with GRNN) and (90.62% with DT, 88.26% with MLP & 78.99% with GRNN). The experimental results clearly rank Decision Tree technique as a favourable one to be implemented in the expert system to get better and efficient result. The results are produced using the patient data stored in the database of hospital named The Institute Guttman. To get On-demand report of the patients based upon the primary input such as heart rate, number of various blood cells & some other physiological factors collected via different sensors are processed locally and/or at remote location, if required. The infrastructure involves uses wireless network for local analysis and after the determination of index value, if it is greater than a threshold value, the data is to be transferred to the remote system for further accurate analysis to determine the disease and preparation of the reports. In the central database, various techniques such as fuzzy logic and artificial neural network are applied to ensure proper analysis. The two tier architecture is proposed by GennaroTartarisco et al [2]. The report indicates the stress level of the patients and support care providers in decision making by avoiding personal risk. Use of sentic computing is suggested by Erik Cambria et al [3] to bridge the gap between the text based questionnaire related to symptoms gathered through patients and the use of common sense knowledge. This helps to provide quick medication to the patients. H. Altay Guvenir et al [4] proposed expert system with voting feature intervals technique which is also known as VFI5. This voting feature interval classification technique uses to identify the new input gathered through the patients by analysing it with the help of available training data. This technique is very effective to increase the efficiency of the diagnosis process because every feature is involved in the voting process separately. This voting is used to determine the predicted class. This accurate and quick classification of features enhanced the productivity of the expert system. This technique involves the projection of features. This projection creates the points of projections, the similarity among these points of projection enables them to be grouped into intervals. An interval is the representation of features with similar class. This process is followed to select the features from a given set of features collected from 958

the patients suffering from disease of dermatology family. Apart from using a single classification technique in the expert system, Alaa M. Elsayad et al [5] proposed an effective ensemble model which uses three different classification techniques such as C5.0 decision tree technique, multilayer perceptron neural network and linear discriminant analysis. The model is constitute with the optimization effect of individual models involved in the ensemble model so that limitation of individual model can be eliminated so with the help of combination of the techniques, more effective model is built. The decision tree technique divides the pattern in two parts to constitute tree form. The multilayer perceptron neural network presents each input pattern to network and apply the function of weighted sums. Linear discriminant analysis classifies the data patterns belonging to unknown classes based upon available training patterns of already used classes. The proposed ensemble model achieves the accuracy of 98.23% which shows the usability of the build model. To diagnose thyroid disease (Hyperthyroidism & Hypothyroidism) an expert system is proposed by Ali Keles et al [6]. The expert system uses neuro fuzzy classification method for the purpose of diagnosis of thyroid disease and achieved accuracy of 95.33% by cross validation of ten folds. The system consists of three layer of feed forward neural network. Those layers of neural network represent the input, fuzzy rules and the output. The proposed model has been implemented in an diagnosis application having selection processes to be chosen by the care providers/individuals. It is difficult to diagnose thyroid disease at early stage so that this expert system is very useful so that proper can be arranged. Ali Keles et al [7] proposed an expert system for another very critical disease which is breast cancer. This expert system is also based upon neuro fuzzy rules. It produces the results related to accuracy, specificity & sensitivity as 96%, 97% & 76%. Kung-Jeng Wang el al [9] try to estimate of survival of breast cancer patients using decision tree classifier along with two new data mining techniques. A remarkable result has been produced in terms of system accuracy of 99.25% at training stage and 98.99% at testing stage of experiment setup. The accuracy has been achieved by using support vector machine &artificial neural network data mining technique. The proposed expert system by Dinesh K. Sharma et al [8] is very effective to diagnose skin diseases using voting method on the basis of weight assigned to each individual model so that ensemble model may produce better and efficient results. The support vector machine technique creates a hyper plane similar to the surface of decision. In decision surface, the distance between the positive & negative samples gets increased. Through many support vectors, a classification task is classified. This expert system generates good results for other system features i.e.: sensitivity, specificity etc. Table-I System parameters of proposed classification models for critical diseases Authors/Research ers A. Marcano- Cedeno, Paloma Chausa, Alejandro Garcia, Cesar Caceres, Josep M. Tormos and Enrique J. Gomez[1] Ali Keles, AyturkKeles[6] Kung-Jeng Wang, BunjiraMakond, Kun-Huang Chen, Kung-Min Wang [9] H. Altay Guvenir, GulsenDemiroz, NilselIlter[4] Classificatio n technique used Decision Tree neuro fuzzy classification method synthetic minority oversampling technique,par ti-cle swarm optimization & C5.0 decision tree Voting Feature Intervals-5 Experiment al Results 90.38% 90.35% 90.62% 95.33% 94.25% 97.4% 94.5% 96.2% 959

Alaa M. Elsayad[5] A. Marcano- Cedeno, Paloma Chausa, Alejandro Garcia, Cesar Caceres, Josep M. Tormos and Enrique J. Gomez [1] Ali Keles, AyturkKeles, UgurYavuz[7] Dinesh K. Sharma, H. S. Hota[8] A. Marcano- Cedeno, Paloma Chausa, Alejandro Garcia, Cesar Caceres, Josep M. Tormos and Enrique J. Gomez [1] Shelly Gupta, Dharminder Kumar and Anand Sharma[10] H.A. Guvenir, N. Emeksiz [12] Ridwan AI Iqbal [14] C5.0 decision tree technique, multilayer perceptron neural network and linear discriminant analysis Multi Layer Perceptron neuro fuzzy classification method support vector machine & artificial neural network General Regression Neural Network (GRNN) support vector machine nearestneighb or classifier, naive Bayesian classifier and voting feature intervals-5. Attribute Cost Minimized FOCL 98.23% 78.77% 77.83% 88.26% 96% 76% 97% 98.99% 98.61% 99.8% 75.96% 75.66% 78.99% 99.25% 99.2% 90.17% Shelly Gupta el al [10] has compared various classification techniques using advanced tool such as WEKA [11], Tanagra & Clementine. All the techniques are evaluated on the basis of accuracy percentage and their respective error rate with 10 fold cross validation. The experiment clearly ranks all the available classification techniques which can be implemented to build an expert system for the patients of diabetes. H.A. Guvenir et al [12] not only propose an expert system but implement it into a visual tool which can be used for the purpose of diagnosis of complex skin diseases. It uses three classification techniques named nearest neighbor classification, naive bayesian classifier with normal distribution method and the advanced classification technique, voting feature intervals-5. The nearest neighborclassification determines the closeness of unknown sample to a group of similar patterns. The closeness is determined by calculating the Euclidean distance between them (Jiawei Han, MichelineKamber [13]). The conditional probability of each feature is calculated in naive bayesian classification technique and in VFI-5, score of each feature is being calculated by giving assigned weightage to them, on the basis of voting, class is determined for each feature. Ridwan Al Iqbal [14] suggested a hybrid approach for the formation of clinical decision support system having the qualities of knowledge based expert system as well as data mining based expert system. It reduces the dependency upon the reliability of the training data so that more accuracy can be achieved. In this hybrid approach, inference engine will map new sample with the existing rules derived from the knowledge of domain experts and if the correct matching of attribute is not available, expert system will try to classify with the available training data and model will accurately produce the expected results. The feature selection methodology proposed by Ahmed K. Farahat et al [15] is unique because it uses the recursive method of unsupervised learning environment and implements it into the supervised learning environment. It selects the features based upon criteria of measurement of error in reconstruction of matrix of data collected from the subset of the features which 960

are already selected. They propose the recursive formula for the criteria along with the effective greedy algorithm. Diagnosis time has been reduced in a great amount with the application of feature selection technique before the application of classification technique or building the model because there can be many irrelevant and redundant features which are not required to be incorporated within the formation of the model for the detection of skin diseases. The feature selection techniques used by Shuzlina Abdul-Rahman et al [16] are correlation feature selection and fast correlation based filter. Correlation feature selection ranks all the features on the basis of their respective co relevancy with the class attribute and fast correlation based filter method directly remove the features which score less than a threshold value and provide the ranks to them. After feature selection process getscompleted, the classification technique of artificial neural network is applied. The neural network used in this model is of multilayer perceptron having one hidden layer. Through back propagation neural network, better performance can be achieved by adjusting the parameters. This model achieved the accuracy of 91.2% which is a significant number for the expert system. 4. Conclusion from Literature Survey This survey paper elaborates the key combination of classification techniques and the effect of using more than one technique. This paper also aims the development in the feature selection methods which definitely accelerates the efficiency of the expert systems of health care domain. Many complex diseases can be detected automatically at early stage with the help of these artificial intelligence systems. This paper also gives an idea to the researchers about the effect of data mining techniques over the dataset related to the patients of different diseases so that classification and feature selection technique can be chosen easily for specific disease. The research dataset available at UCI repository [17] is very reliable because most of the expert systems have been proposed with the experimental result upon it. This survey paper embarks the development and automation in health care domain. It shows good signs for further research and development in the health care sector. 5. REFERENCES [1] A. Marcano-Cedeno, Paloma Chausa, Alejandro Garcia, Cesar Caceres, Josep M. Tormos and Enrique J. Gomez, Data mining applied to the cognitive rehabilitation of patients with acquiredbrain injury, Expert Systems with Applications (2012). [2] GennaroTartarisco, Giovanni Baldus, Daniele Corda,RossellaRaso, AntoninoArnao,Marcello Ferro, Andrea Gaggioli and Giovanni Pioggia, Personal Health System architecture for stress monitoring and supportto clinical decisions, Computer Communications 35 (2012) 1296-1305. [3] Erik Cambria, Tim Benson, Chris Eckl and Amir Hussain, Sentic PROMs: Application of sentic computing to the development of a novelunified framework for measuring health-care quality, Expert Systems with Applications 39 (2012) 10533 10543. [4] H. Altay Guvenir,GulsenDemiroz, NilselIlter, Learning differential diagnosis oferythematosquamous diseases using voting featureintervals, Artificial Intelligence in Medicine 13 (1998) 147 165. [5] Alaa M. Elsayad, Diagnosis of Erythemato- Squamous Diseases using Ensemble of DataMining Methods, ICGST-BIME Journal, Volume 10, Issue 1, December 2010. [6] Ali Keles, AyturkKeles, ESTDD: Expert system for thyroid diseases diagnosis, Expert Systems with Applications 34 (2008) 242 246. [7] Ali Keles, AyturkKeles, UgurYavuz, Expert system based on neuro-fuzzy rules for diagnosis breast cancer, Expert Systems with Applications 38 (2011) 5719 5726. [8] Dinesh K. Sharma, H. S. Hota, Data mining techniques for prediction of different categories of dermatology diseases, Academy of Information and Management Sciences Journal, Volume 16, Number 2, 2013 103-115. [9] Kung-Jeng Wang, BunjiraMakond, Kun-Huang Chen, Kung-Min Wang, A hybrid classifier combining SMOTE with PSO to estimate 5-year 961

survivability of breast cancer patients, Applied Soft Computing (2013). [10] Shelly Gupta, Dharminder Kumar and Anand Sharma, Performance analysis of various data mining classification techniques on health care data, International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 4, August 2011. [11] WEKA (2013).An open source free data mining tool for teaching and research, web source:http://www.cs.waikato.ac.nz/ml/wekalast accessed on February, 2014. [12] H.A. Guvenir, N. Emeksiz, An expert system for the differential diagnosis oferythematosquamous diseases, Expert Systems with [15] Ahmed K. Farahat Ali Ghodsi Mohamed S. Kamel, An Efficient Greedy Method for Unsupervised Feature Selection, 2011 11th IEEE International Conference on Data Mining. Applications 18 (2000) 43 49. [13] Jiawei Han &MichelineKamber, Data Mining Concepts and Techniques, Second edition, Morgan Kaufmann Publications, San Francisco, CA. USA. [14] Ridwan AI Iqbal, Hybrid Clinical Decision Support System: AnAutomated Diagnostic System For RuralBangladesh,IEEE/OSA/IAPR International Conference on Informatics, Electronics & Vision (2012). [16] Shuzlina Abdul-Rahman, Ahmad KhairilNorhan, Marina Yusoff, Azlinah Mohamed, SofianitaMutalib, Dermatology Diagnosis with Feature SelectionMethods andartificial Neural Network, 2012 IEEE EMBS International Conference on Biomedical Engineering and Sciences, Langkawi, 17th-19th December,2012 (371-376). [17] NilselIlter, H. Altay Guvenir (March, 2014), UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/datasets/dermatology ], Irvine, CA: University of California, School of Information and Computer Science. 962