Analysis of Various Techniques to Handling Missing Value in Dataset Rajnik L. Vaishnav a, Dr. K. M. Patel b a

Similar documents
A Review of Missing Data Treatment Methods

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Comparison of K-means and Backpropagation Data Mining Algorithms

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

The treatment of missing values and its effect in the classifier accuracy

Problem of Missing Data

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Social Media Mining. Data Mining Essentials

E-commerce Transaction Anomaly Classification

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Random forest algorithm in big data environment

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

Dealing with Missing Data

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Visualization of large data sets using MDS combined with LVQ.

An Overview of Knowledge Discovery Database and Data mining Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Regression model approach to predict missing values in the Excel sheet databases

A Content based Spam Filtering Using Optical Back Propagation Technique

Performance Analysis of Decision Trees

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Advanced Ensemble Strategies for Polynomial Models

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Performance Analysis of Data Mining Techniques for Improving the Accuracy of Wind Power Forecast Combination

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

A Basic Introduction to Missing Data

International Journal of Advance Research in Computer Science and Management Studies

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Customer Classification And Prediction Based On Data Mining Technique

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

A Data Mining Study of Weld Quality Models Constructed with MLP Neural Networks from Stratified Sampled Data

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE

Chapter 6. The stacking ensemble approach

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Distributed Framework for Data Mining As a Service on Private Cloud

Keywords data mining, prediction techniques, decision making.

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

A survey on Data Mining based Intrusion Detection Systems

Support Vector Machines with Clustering for Training with Very Large Datasets

Rule based Classification of BSE Stock Data with Data Mining

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

LVQ Plug-In Algorithm for SQL Server

Comparison of Classification Techniques for Heart Health Analysis System

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

BIG DATA IN HEALTHCARE THE NEXT FRONTIER

How To Cluster

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Big Data Racing and parallel Database Technology

How To Solve The Kd Cup 2010 Challenge

Prototype-based classification by fuzzification of cases

Neural Networks for Sentiment Detection in Financial Text

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining - Evaluation of Classifiers

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A Survey on Accessing Data over Cloud Environment using Data mining Algorithms

Pattern Recognition Using Feature Based Die-Map Clusteringin the Semiconductor Manufacturing Process

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Map/Reduce Affinity Propagation Clustering Algorithm

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Environmental Remote Sensing GEOG 2021

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

REVIEW OF ENSEMBLE CLASSIFICATION

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Big Data with Rough Set Using Map- Reduce

A Study of Data Perturbation Techniques For Privacy Preserving Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Intrusion Detection System with Data Mining Techniques

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Prediction of Stock Performance Using Analytical Techniques

INTERACTIVE DATA EXPLORATION USING MDS MAPPING

Prediction of Heart Disease Using Naïve Bayes Algorithm

Cross-Validation. Synonyms Rotation estimation

How To Use Neural Networks In Data Mining

Machine Learning in FX Carry Basket Prediction

Genetic Neural Approach for Heart Disease Prediction

A Study of Web Log Analysis Using Clustering Techniques

Decision Support System on Prediction of Heart Disease Using Data Mining Techniques

AnalysisofData MiningClassificationwithDecisiontreeTechnique

Transcription:

Available online at www.ijiere.com International Journal of Innovative and Emerging Research in Engineering e-issn: 2394-3343 e-issn: 2394-5494 Analysis of Various Techniques to Handling Missing Value in Dataset Rajnik L. Vaishnav a, Dr. K. M. Patel b a School of Engineering, RK University, Rajkot, India, rajnik.vaishnav55@gmail.com +91-9033319249. b Associate Professor, School of Engineering, RK University, Rajkot, India, kamlesh.patel@rku.ac.in, +91-9099240604 ABSTRACT: Data mining has made a great progress in recent year but the problem of missing data or value has remained great challenge for data mining. data mining is the field of studying experimental data sets for the discovery of interesting and potentially useful relationships. Missing data or value in a datasets can affect the performance of classifier which leads to difficulty of extracting useful information from datasets. There are a number of alternative ways of dealing with missing data. Several methods like Listwise Deletion, Pairwise Deletion, Mean, Regression, K-Means (KMI), Fuzzy K-means clustering (FKMI), Support Vector Machine (SVMI) for imputation of missing values using available values in the data set. In this study, different methods are reviewed and compared with their advantages and disadvantages Keywords: missing value, data mining, KMI, FKMI, SVMI I. INTRODUCTION The aim of data mining is extracting the knowledge out of huge set of data. The knowledge that is mined should be useful and advantageous. This method involves many areas such as medical diagnosis, databases, learning machine and statistical analysis. Data Integrity is the foremost aim of database missing data helps to degrade the integrity and to avoid the degradation or to improve or to maintain the data integrity missing data to be imputed properly. Handling of imputation causes the three major issues 1. Loss of information, as a consequence, a loss of efficiency. 2. Data handling is an issue, computation and analysis due to irregularities in the data structures. 3. Systematic difference among the data.[3] Missing data are the absence of data items for a subject; they hide some information that may be important. In practice, missing data have been one major factor affecting data quality. The presence of missing data is a general and challenging problem in the data analysis field. Fortunately, missing data imputation techniques can be used to improve data quality. Missing data imputation techniques refer to any strategy that fills in missing values of a data set so that standard data analysis methods can be applied to analyze the completed data set. Missing data or missing values occur when no data value is stored for an instance in the current record. Missing data might occur because value is not relevant to a particular case, could not be recorded when data was collected or ignored by users because of privacy concerns. Most information system usually has some missing values due to unavailability of data. Sometimes data is not presented or get corrupted due to inconsistency of data files. Missing data is a common problem that has a significant effect on the conclusion that can be drawn from the data. Missing data is absence of data items that hide some information that may be important.[6] Missing data randomness is classified in three classes. Missing completely at random (MCAR): Missing values are scattered randomly across all instances. In this type of randomness, any missing data handling method can be applied without risk of introducing bias on the data. This is the highest level of randomness. It occurs when the probability of an instance (case) having a missing value for an attribute does not depend on either the known values or the missing data. In this level of randomness, any missing data treatment method can be applied without risk of introducing bias on the data.[2] Missing at random (MAR): Missing at random (MAR) is a condition, which occurs when missing values are not randomly distributed across all observations but are randomly distributed within one or more. When the probability of an instance having a missing value for an attribute may depend on the known values, but not on the value of the missing data itself;[8] Not missing at random (NMAR): When the probability of an instance having a missing value for an attribute could depend on the value of that attribute.. It is also called as non-ignorable missingness. The probability of an instance with a missing value for an attribute might depend on the value of that attribute.[9] 191

II. PREVIOUS WORK In this section we would like to share the survey experience about imputation Techniques like Mean- Mode imputation, Hotdeck imputation, K-nearest neighbour s imputation, multiple imputation, Multivariate imputation by chained equations (MICE). Mean-Mode imputation (MMimpute): MMimpute filling the missing data by the mean or mode(qualitative) from all Know data set. Hotdeck (HD) imputation: Given an incomplete pattern, HD replaces the missing data with values form input data vector that is closest in terms of the attributes that are Know in both patterns.. HD attempts to preserve the distribution by substituting different observed values for each missing.the similar method of HD is Cold deck imputation method which takes other data source than current dataset[3] K-nearest neighbor`s imputation (KNNimpute): Training dataset incomplete pattern where missing data K will be selected with the help of known data such that they minimize some distance measure. Once the K nearest neighbours have been found, are placement value to substitute for the missing attribute value must be estimated. How the replacement value is calculated depends on the type of data. Regression : Using regression method for imputation, the values from the features are observed and then predicted values are used for filling MVs[11]. Fuzzy K-Means clustering (FKMI): In FKMI, membership function plays an important role. Membership function is assigned with every data object that depicts in what degree the data object is belonging to the particular cluster. Data objects would not get allotted to concrete cluster which is denoted by centroid of cluster (as in the case of K means), this is due to the various membership degrees of every data with entire K clusters. Unreferenced attributes for every uncompleted data are substituted by FKMI on the basis of membership degrees and cluster centroid values.[9] Support Vector Machine (SVMI) :The SVMI is regression based method to impute the MVs. It takes condition attributes (here, decision attribute i.e. output) and decision attributes (here, conditional attributes). SVMI then would be applied for prediction of values of missed condition attribute. Classification algorithm: Classification is a supervised learning method. It means that learning of classifier is supervised in that it is told to which class each training tuples belongs. Data classification is a two step process. First step, a classifier is build describing a predetermined set of data classes or concepts. The data classification process has two phases, these are:- [7] Learning- Classification algorithm analyzed the training data. Classifier is represented in the form of classification rules. This phase is also viewed as learning of a mapping. Figure 1. Learning State Classification- To estimates the accuracy of classification algorithm test data is used. The rules can be applied to classification of new data tuples. Accuracy of a classifier on a given test set is percentage of test set that are correctly classified by classifier. The associated class labels of each test tuples is compared with learning classifier class prediction for that tuple.[14] 192

Figure 2. Classification III. THE TREATMENT OF MISSING VALUES There are several methods for treating missing data, some methods are described below. Missing data treatment methods can be divided into three categories, as proposed in[6] A. Ignoring and discarding data There are two main ways to discard data with missing values. The first method is known complete case analysis it is available in all statistical programs and is the default method in many programs. This method consists of discarding all instances with missing data. The second method is known as discarding instances and/or attributes. This method consists of determining the extent of missing data on each instance and attribute, and deleting the instances and/or attributes with high levels of missing data. Unfortunately, relevant attributes should be kept even with high degree of missing values B. Parameter estimation Maximum likelihood procedures are used to estimate the parameters of a model defined for the complete data. Maximum likelihood procedures that use variants of the Expectation-Maximization algorithm can handle parameter estimation in the presence of missing data c. method is a class of procedures that aims to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set assist in estimating the missing values. This work focuses on imputation of missing data. IV. CLUSTERING AND MISSING VALUE IMPUTATION In this phase, clusters will be generated by passing every record in every block as the seed-point-record. The records in the training set will be formulated into number of clusters in which the missing-attribute records are the seed points. In most of the clustering models, the concept of similarity is based on distance. Here we explore a new approach for measuring the similarity, wherein a record is said to be similar to the seed-point- record only if it has the maximum weight. The training dataset is scanned for measuring the similarity. The standard deviation of the attributes is taken into account for measuring the similarity. The weights are assigned to the records based on the number of similar attributes. The cluster is then formed by grouping the records which have higher degree of weights. Finally, the missing attribute- value(s) in each cluster resulted by computing the mean value of the respective attribute(s) in the cluster and complete dataset is generated without any missing any data.[15] Fig. 3: Flow of clustering algorithm to impute the value. 193

V. PRONE AND CONS OF DIFFERENT TECHNIQUES TO HANDLE MISSING VALUE Techniques of Missing Value Listwise Deletion Pairwise Deletion Mean (Most Common ) (MCI) Regression Expectation Maximization (EM) KNN Hot deck K-Means (KMI) Fuzzy K- means clustering (FKMI) Support Vector Machine (SVMI) Short Review Prone Cones - Deletion of cases containing missing values (entire row is deleted) High loss of information due to deletion of entire row High effect on variability Loss of precision and induce bias. - Deletion of records only from column containing missing values Less loss of information by keeping all available values Less effect on variability Less Loss of precision and induce bias[2] - Replace MVs with the arithmetic mean of data Resultant Mean and SD after imputation may be much higher than that of original Not a good substitution method.[2] - Replace MVs with the values predicted from observed values Regression Equation: Y = α0 + α1 X -Iterative method, finds maximum likelihood Two steps: Expectation (E step), Maximization (M step) Iteration goes on until algorithm converges -Use alike records to fill the MVs The topmost k nearest data records for Y are chosen Average of entire values of such k data records becomes the impute value of Y[6] -Then KMI uses algorithm called nearest neighbour to impute the MVs in the same way as KNNI -Unreferenced attributes for every uncompleted data are substituted by FKMI on the basis of membership degrees and cluster centroid values[8] -Takes condition attributes (here, decision attribute i.e. output) and decision attributes (here, conditional attributes) SVMI then would be applied for prediction of values of missed condition - Simple to use. -Loss of huge data, loss of precision, high effect on variability, induce bias -Simple, keeping all available values i.e. only missing values are deleted - Simple to use, it is built in most of the statistical packages -Calculated data saves deviations from mean and distribution shape -Accuracy increases if the model is right -MVs are imputed by realistically obtained values which avoids distortion in distribution -Fast and hence good for running big datasets. Reduces intra cluster variance to minimum. -Best outcome for overlapping data, better than k means imputation. Data objects may be part of more than one cluster center. -Efficient in large dimensional spaces. Efficient memory consumption -Loss of data, not a better solution as compared to other methods -Resultant Mean and SD after imputation may be much higher than that of original. - Degree of freedom gets distort and may raises relationship -Takes time for converging, very Complex -Bit empirical work for accuracy estimation, creates problem if any other sample has no close relation in entire manner of the dataset -Does not assure the global min. variance. Difficult to predict K value. -High computation time. Noise sensitive i.e. low or no membership degree for noisy objects. -Poor performance if number of samples are much lesser than number of features 194

Artificial Neural Network with Rough Set Theory(ANNR ST) Attribute[10] -Divided into : Reducing RST attribute and ANN construction for missing attribute value prediction[2] International Journal of Innovative and Emerging Research in Engineering -Generally ANNRST dataset yields better accuracy than CMCI and KNN classifiers. -Complex computations and time Consuming. VI. CONCLUSIONS The need of extracting useful knowledge from the dataset leads to have a complete dataset before mining. This dataset contain less number of missing value. Thus, imputation methods are widely used to fill the missing values of different kinds of datasets. In this survey, the overall views on the imputation methods and their categories are discussed. Thus it can be clearly seen that many methods are proposed for handling missing values present in the dataset. Further, these imputation methods are compared along with their advantages and disadvantages. REFERENCES [1] B. Mehala, P. Ranjit Jeba Thangaiah, and K. Vivekanandan Selecting Scalable Algorithms to Deal With Missing Values. International Journal of Recent Trends in Engineering, Vol. 1, No. 2, May 2009 [2] Dipak V. Patil., Multiple of Missing Data with Genetic Algorithm based Techniques. IJCA Special Issue on Evolutionary Computation for Optimization Techniques ECOT, 2010 [3] Dr. A.Sumathi, Missing Value Techniques Depth Survey And an Algorithm To Improve The Efficiency Of. IEEE- Fourth International Conference on Advanced Computing, ICoAC 2012 [4] Naresh Rameshrao Pimplikar*, Asheesh Kumar, Apurva Mohan Gupta., Study of Missing Value Methods A International Journal of Advanced Research in Computer Science and Software Engineering 4(3), March - 2014, pp. 1487-1491. [5] YamanishiLuis E. Zairate, Bruno M. Nogueira, Tadeu R. A. Santos and Mark A. J. Songe, Techniques for Missing Value Recovering in Imbalanced Databases: Application in a Marketing Database with Massive Missing Data. 2006 IEEE International Conference on Systems, Man, and Cybernetics [6] N. Poolsawad L. Moore C. Kambhampati and J. G. F. Cleland "Handling Missing Values in Data Mining - A Case Study of Heart Failure Datasetl"2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2012). [7] Preeti Patidar,Durga Toshniwal,"Handling Missing Value In Decision Tree Algorithm" ACEEE. [8] Kaiser,"Algorithm For Missing Values In Categorical Data With Use Of Association Rules" The Ninth International Conference on Web-Age Information Management 2012 IEEE(2008). [9] T.V Rajnikanth "An Enhanced Approach On Handling Missing Values Using Bagging K-Nn " 2013 International Conference on Computer Communication and Informatics (ICCCI -2013), Jan. 04 06, [10] Sandeep Kumar Singh,Archana Purwar "Empirical Evaluation of Algorithms to impute Missing Values for FinancialDataset" 2014 IEEE International Conference [11] Y. Kou, C.-T. Lu, and D. Chen. Spatial weighted outlier detection. In Proceedings of the Sixth SIAM International Conference on Data Mining,pp. 614 618, Bethesda, Maryland, USA, 2006. [12] Jerzy W. Grzymala-Busse and Ming Hu A Comparison of Several Approaches to Missing Attribute Values in Data Mining. W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 378 385, 2001. Springer-Verlag Berlin Heidelberg 2001 [13] Dan Li, Jitender Deogun, William Spaulding, and Bill Shuart Towards Missing Data :A Study of Fuzzy K-means Clustering Method. Springer-Verlag Berlin Heidelberg 2004 [14] Feng Honghai, Chen Guoshun, Yin Cheng,Yang Bingru, and Chen Yumei A SVM Regression Based Approach to Filling in Missing Values. Springer-Verlag Berlin Heidelberg 2005 [15] Shamsher Singh, Prof. Jagdish Prasad Estimation of Missing Values in the Data Mining and Comparison of Methods. W. Ziarko and Y. Yao (Eds.): RSCTC 2000, LNAI 2005, pp. 378 385, 2001. Springer- Verlag Berlin Heidelberg 2001 195