A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode



Similar documents
DATA MINING TECHNIQUES AND APPLICATIONS

The Data Mining Process

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM

Azure Machine Learning, SQL Data Mining and R

Efficiency in Software Development Projects

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Business Intelligence. Data Mining and Optimization for Decision Making

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Data Mining Techniques for Prognosis in Pancreatic Cancer

Intelligent Diagnose System of Wheat Diseases Based on Android Phone

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Social Media Mining. Data Mining Essentials

How To Identify A Churner

Hybrid Data Envelopment Analysis and Neural Networks for Suppliers Efficiency Prediction and Ranking

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

An Overview of Knowledge Discovery Database and Data mining Techniques

SVM Ensemble Model for Investment Prediction

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

How To Solve The Kd Cup 2010 Challenge

DEA implementation and clustering analysis using the K-Means algorithm

Supervised Feature Selection & Unsupervised Dimensionality Reduction

not possible or was possible at a high cost for collecting the data.

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

E-commerce Transaction Anomaly Classification

D-optimal plans in observational studies

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

MS1b Statistical Data Mining

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

High Productivity Data Processing Analytics Methods with Applications

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Data Mining Part 5. Prediction

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Data Mining - Evaluation of Classifiers

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, HIKARI Ltd,

An Introduction to Data Mining

Hong Kong Stock Index Forecasting

Sub-class Error-Correcting Output Codes

Spam detection with data mining method:

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

REVIEW OF ENSEMBLE CLASSIFICATION

New Ensemble Combination Scheme

Data Mining Practical Machine Learning Tools and Techniques

Learning is a very general term denoting the way in which agents:

Experiments in Web Page Classification for Semantic Web

Random forest algorithm in big data environment

Design call center management system of e-commerce based on BP neural network and multifractal

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Analyze It use cases in telecom & healthcare

Data Mining for Fun and Profit

Programming Exercise 3: Multi-class Classification and Neural Networks

Data Mining Analytics for Business Intelligence and Decision Support

Scalable Developments for Big Data Analytics in Remote Sensing

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Data Mining. Nonlinear Classification

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Package acrm. R topics documented: February 19, 2015

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS

Model Combination. 24 Novembre 2009

Active Learning SVM for Blogs recommendation

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Meta-learning. Synonyms. Definition. Characteristics

Introduction. A. Bellaachia Page: 1

AnalysisofData MiningClassificationwithDecisiontreeTechnique

Maschinelles Lernen mit MATLAB

Introduction To Ensemble Learning

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Classification of Bad Accounts in Credit Card Industry

: Introduction to Machine Learning Dr. Rita Osadchy

Support Vector Machines with Clustering for Training with Very Large Datasets

Data Mining: Motivations and Concepts

Genetic Algorithm Based Interconnection Network Topology Optimization Analysis

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Decision Support Systems

Car Insurance. Prvák, Tomi, Havri

Learning from Diversity

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

The Artificial Prediction Market

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

The Big Data mining to improve medical diagnostics quality

A Survey on Pre-processing and Post-processing Techniques in Data Mining

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

DATA PREPARATION FOR DATA MINING

Chapter 6. The stacking ensemble approach

Transcription:

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode Seyed Mojtaba Hosseini Bamakan, Peyman Gholami RESEARCH CENTRE OF FICTITIOUS ECONOMY & DATA SCIENCE UNIVERSITY OF CHINESE ACADEMY OF SCIENCES 2014.06.03 03/06/2014 ITQM2014 1

Abstract Data mining is a one of the growing sciences in the world that can play a competitive advantages rule in many firms. Data mining algorithms based on their functions can be divided in four categories; o Classification o Feature selection o Assassination rules o Clustering 03/06/2014 ITQM2014 2

Entropy Method DataSets Modeling DEA Selected Features Testing Classification B A C 03/06/2014 ITQM2014 3

Abstract o Feature selection algorithms mostly used for obtaining more precise and strong machine learning algorithms along with reducing the computation time. o Data Envelopment Analysis which is a useful technique for determining the efficiency of decision-making units. o Entropy method which its function is weighting the criteria to selecting the appropriate features. 03/06/2014 ITQM2014 4

Problem Definition Classification methods are widespread and strong tools to dealing with a real problems such as firm bankruptcy prediction, credit card assessment, intrusion detection, fraud detection and ets. Totally, a classification machine contains the following four fundamental components: (1) a set of attribute or characteristic values (2) a sample training data set (3) an acceptance domain (4) a classification function. The accuracy of classification and predictive power are two main issues related to classification methods. 03/06/2014 ITQM2014 5

Problem Definition Selecting an appropriate set of features to represent the main information of original datasets is an important factor that influences the accuracy of classification methods. The goal of this paper is to providing a novel feature selection method based on Data envelopment analysis and Entropy to gain more classification accuracy. 03/06/2014 ITQM2014 6

Feature selection o Enhancing the classification accuracy and predictability ability o Increasing the training process speed o Decreasing the storage demands o Better understanding and interpretability of a domain. 03/06/2014 ITQM2014 7

Feature selection Different kind of methods have been proposed feature reduction. Totally they can be divided in two main groups: o feature extraction o feature selection. Although a number of comprehensive studies have been done on feature selection and classification methods to select the best subset of features to improve the accuracy of classification methods, this study focus on applying new model based on MCDM methods. 03/06/2014 ITQM2014 8

Shannon s entropy Shannon s entropy is a well-known method for calculating the weights for multiple criteria decision making problem. Step 1: Normalize the decision matrix. x ij Pij, j 1,, m, i 1,, n m x j 1 ij By normalizing the decision matrix we make a free unit matrix. Step 2: By using formula 2 calculating the entropy: j 1,2,, n m E j k pij lnij p ij 1 k i 1 ln m N : Number of attribute M : Number of Samples 03/06/2014 ITQM2014 9

Shannon s entropy Step 3: Calculate the degree of deviation of each criteria from its entropy s value: d j 1 E j Step 3: Calculate the degree of importance or weight of each criteria: d n w 1,2,, n, w 1 1,2,, n j d j j 1 j n j j j j j j 1 The entropy method is based on the variance of values in each criterion, so we can conclude if criteria have more deviation, then the value of its entropy would be increased and it shows that this criterion is more important for classification. 03/06/2014 ITQM2014 10

Data Envelopment Analysis DEA use to calculate the efficiency of Decision-making units (DMUs). This method is a nonparametric based on linear programming and was first proposed by Charnes, Cooper & Rhodes (1978). The basic DEA model known as CCR model: E st P s r 1 m i 1 r rj r 1 : 1 m i 1 i 1,2,..., m j 1,2,..., n r 1,2,..., s u, v o r i Max u y vx i ij s u y r vx i rp ip 03/06/2014 ITQM2014 11

Data Envelopment Analysis There are two different methods to solve this problem, one can be output maximization or input minimization. Here, we choose the first method by placing denominator equal to 1, so in the following an output maximization CCR model presented: E Max u y s P r rp r 1 m st : v x 1 i 1 r rj i ij r1 i1 i 1,2,..., m j 1,2,..., n r 1,2,..., s u, v o r i i ip m s u y v x o 03/06/2014 ITQM2014 12

Experimental Evaluation Table 1: Characteristics of selected datasets Row Name Number Numbers Numbers Attribute of attributes of Instances of Classes Characteristics 1 Breast Cancer Wisconsin (Diagnostic) 32 569 2 Integer 2 Statlog (Landsat Satellite) Data Set 36 6435 6 Integer 3 Statlog (Vehicle Silhouettes) Data Set 18 946 4 Integer 03/06/2014 ITQM2014 13

Our proposed model In the following steps we show how our model works: Step1: compute the entropy value of each attribute in different classes by (a) at first separating the datasets according to their classes type, (b) then calculating the Entropy of each attribute. Step 2: considering each attribute as a Decision-Making Units (DMUs) Step 3: placing the input of DMUs equal to 1. Step 4: placing the output of DMUs equal to entropy value gain form step 1. Step 5: compute the efficiency of each attribute. Step 6: selecting the efficient attribute. Step 7: applying other feature selection algorithms on the same datasets for selecting the features. Step 8: comparing the result of our models with the result of step 7. 03/06/2014 ITQM2014 14

The selected features Table 2: The selected features from 1 st dataset by different features selection algorithms and our proposed model Feature Selection Number of Selected Features Method selected features CfS Subseteval 2,7,8,14,19,21,23,24,25,27,28 11 Consistency subset eval 2,11,13,21,22,27,28,29 8 Filtered Subset eval 2,7,8,14,21,23,24,27,28 9 Our Proposed Model 7,8,11,13,14,16,17,20,26, 27 10 Table 3: The selected features from 2 nd dataset by different features selection algorithms and our proposed model Feature Selection Number of Selected Features Method selected eatures CfS Subseteval 1,4,5,6,9,10,12,13,14,16,17,18,20,21,22,24,25,26,28,29,30,33,36 23 Consistency subset eval 1,2,7,10,11,17,18,24,28,29,31,33 12 Filtered Subset eval 1,4,5,6,9,10,12,13,14,16,17,18,20,21,22,24,25,26,28,29,30,33,36 23 Our Proposed Model 2,3,4,6,7,8,10,11,12,14,16,18,22,24,26,28,30,32,34,35,36 21 Table 4: The selected features from 3 th dataset by different features selection algorithms and our proposed model Feature Selection Number of Selected Features Method selected features CfS Subseteval 4,5,6,7,8,9,11,12,14,15,16 11 Consistency subset eval 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18 18 Filtered Subset eval 4,5,6,7,8,9,11,12,14,15,16 11 Our Proposed Model 3,4,6,7,8,11,12,13,15,16 10 03/06/2014 ITQM2014 15

Experimental Evaluation Furthermore, we made the new datasets based on the selected features and then try to classify these new datasets by three classifications algorithms in SPSS clementine software and compare their accuracy. We used the 75% of each dataset as training dataset and the rest as testing dataset. The result showed in table 5 to 7. Table 5: The accuracy of classification algorithms based on the selected feature for Breast Cancer Wisconsin data set The accuracy of classification algorithms Feature Selection Method SVM C5.0 Logistic Regression average CfS Subseteval 87.84 92.57 95.95 92.12 Consistency subset eval 93.24 92.57 98.65 94.82 Filtered Subset eval 87.84 91.22 96.62 91.89 Our Proposed Model 89.86 93.92 95.95 93.24 03/06/2014 ITQM2014 16

Experimental Evaluation Table 6: The accuracy of classification algorithms based on the selected feature for Landsat Satellite data set The accuracy of classification algorithms Feature Selection Method SVM C5.0 Logistic Regression Average CfS Subseteval 86.84 85.42 84.98 85.74 Consistency subset eval 87.10 85.42 84.89 85.80 Filtered Subset eval 86.84 85.42 84.98 85.74 Our Proposed Model 88.96 86.04 85.42 86.80 Table 7: The accuracy of classification algorithms based on the selected feature for Vehicle Silhouettes Data Set The accuracy of classification algorithms Feature Selection Method SVM C5.0 Logistic Regression Average CfS Subseteval 58.74 66.50 67.96 64.40 Consistency subset eval 69.42 68.45 74.27 70.71 Filtered Subset eval 58.74 66.50 67.96 64.40 Our Proposed Model 61.17 73.3 64.56 66.34 03/06/2014 ITQM2014 17

Conclusion and future work As shown in the last sector by applying Data Envelopment Analysis and Entropy method for selecting the features, the result is comparable with the other method and in some of the cases it has a better result. According to the acquired result we suggest other researchers to use different MCDM methods such as TOPSIS and SAW integrated with other methods of weighting such as Expected Value method for selecting the features. Furthermore, our proposed model can be used for ranking the features instead of selecting them. 03/06/2014 ITQM2014 18

References 1. Shi Y, Peng Y, Kou G, et al. Introduction to Data Mining Techniques via Multiple Criteria Optimization Approaches and Applications. Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications 2008;3. 2. Yan H, Wei Q. Data envelopment analysis classification machine. Information Sciences 2011;181:5029-5041. 3. Zhang D, Shi Y, Tian Y, et al. A class of classification and regression methods by multiobjective programming. Frontiers of Computer Science in China 2009;3:192-204. 4. Allwein EL, Schapire RE, Singer Y. Reducing multiclass to binary: A unifying approach for margin classifiers. The Journal of Machine Learning Research 2001;1:113-141. 5. Qi Z, Tian Y, Shi Y. A nonparallel support vector machine for a classification problem with universum learning. Journal of Computational and Applied Mathematics 2014;263:288-298. 6. Chen Z-Y, Fan Z-P, Sun M. A hierarchical multiple kernel support vector machine for customer churn prediction using longitudinal behavioral data. European Journal of Operational Research 2012. 7. Anbazhagan S, Kumarappan N. A neural network approach to day-ahead deregulated electricity market prices classification. Electric Power Systems Research 2012;86:140-150. 8. Rennie JD, Rifkin R. Improving multiclass text classification with the support vector machine. 2001. 9. Smadja D, Touboul D, Cohen A, et al. Detection of Subclinical Keratoconus Using an Automated Decision Tree Classification. American journal of ophthalmology 2013. 10. Zhang Z, Shi Y, Zhang P, et al. A rough set-based multiple criteria linear programming approach for classification. Computational Science ICCS 2008: Springer, 2008;476-485. 03/06/2014 ITQM2014 19

References 11. Peng Y, Kou G, Chen Z, et al. Cross-validation and ensemble analyses on multiple-criteria linear programming classification for credit cardholder behavior. Computational Science-ICCS 2004: Springer, 2004;931-939. 12. Kwak W, Shi Y, Cheh JJ, et al. Multiple criteria linear programming data mining approach: An application for bankruptcy prediction. Data Mining and Knowledge Management: Springer, 2005;164-173. 13. De Stefano C, Fontanella F, Marrocco C, et al. A GA-based feature selection approach with an application to handwritten character recognition. Pattern Recognition Letters 2013. 14. Yao M, Qi M, Li J, et al. A novel classification method based on the ensemble learning and feature selection for aluminophosphate structural prediction. Microporous and Mesoporous Materials 2014;186:201-206. 15. Sakar CO, Kursun O, Gurgen F. A feature selection method based on kernel canonical correlation analysis and the minimum Redundancy Maximum Relevance filter method. Expert Systems with Applications 2012;39 16. Kohavi R, John GH. Wrappers for feature subset selection. Artificial intelligence 1997;97:273-324. 17. Guyon I, Elisseeff A. An introduction to variable and feature selection. The Journal of Machine Learning Research 2003;3:1157-1182. 18. Sikora R, Piramuthu S. Framework for efficient feature selection in genetic algorithm based data mining. European Journal of Operational Research 2007;180:723-737. 19. Shirouyehzad H, Lotfi FH, Dabestani R. Aggregating the results of ranking models in data envelopment analysis by Shannon's entropy: a case study in hotel industry. International Journal of Modelling in Operations Management 2013;3:149-163. 20. Charnes A, Cooper WW, Rhodes E. Measuring the efficiency of decision making units. European journal of operational research 1978;2:429-444. 21. Azadeh A, Saberi M, Moghaddam RT, et al. An integrated Data Envelopment Analysis Artificial Neural Network Rough Set Algorithm for assessment of personnel efficiency. Expert Systems with Applications 2011;38:1364-1373. 22. Amado CA, Santos SP, Sequeira JF. Using Data Envelopment Analysis to support the design of process improvement interventions in electricity distribution. European Journal of Operational Research 2013. 23. Andersen P, Petersen NC. A procedure for ranking efficient units in data envelopment analysis. Management science 1993;39:1261-1264. 03/06/2014 ITQM2014 20

Thank you 03/06/2014 ITQM2014 21