Imputation of Missing Data: A Semi-Supervised Clustering methodology

Transcription

1 Imputation of Missing Data: A Semi-Supervised Clustering methodology Ilango Paramasivam 1, Hemalatha Thiagarajan 2, Poonkuntran Shanmugam 3, Nickolas Savarimuthu 4, 1 PhD Research Scholar, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, India. Assistant Professor, School of Computing Sciences, VIT University, Vellore, India. 2 Professor, Department of Mathematics, National Institute of Technology, Tiruchirappalli, India. 3 Lecturer, School of Computing Sciences, VIT University, Vellore, India. 4 Assistant Professor, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, India. {ilangosarojini@gmail.com} {hema@nitt.edu} {s_poonkuntran@yahoo.co.in} {nickolas@nitt.edu} Abstract: Data mining is being applied with success in different fields of human endeavor including Marketing, Customer Relationship Management and Healthcare. Real world data sets are always accompanied by missing data, a major factor affecting data quality. Missing data has been a pervasive problem in data analysis since the origin of data collection. In the knowledge discovery process, missing data introduces bias in the model evaluation and leads to inaccurate mining results. The objective of this research is to propose a semi-supervised clustering methodology for the imputation of missing data in databases. For this purpose, missing data are simulated on the complete Pima Indians Type II Diabetes dataset in order to evaluate the performance of the proposed algorithm. The performance is compared with other existing imputation methods. The comparative analysis shows that the proposed method produces stable results and shows less variance in the error rate over different percentages of missing data compared to other methods. Keywords: Data mining, Missing data, Imputation methods, Semi-supervised clustering, Average Imputation Error. I. INTRODUCTION The revolution in storage technologies made storage of huge amounts of data in databases feasible. Specialized tools for access, analysis, knowledge discovery and effective use of stored knowledge of the data are required to handle such extensive data. The relatively young and growing fields of data mining and knowledge discovery generate valuable information for business processes dealing with such data. A data mining process typically consists of six steps: understanding the problem domain, understanding the data, data preparation, data mining, evaluation of the discovered knowledge and finally using the knowledge thus obtained (Hair et al. 1998). Real-world datasets are suffering from questionable quality owing to incomplete, redundant and inaccurate data. These problems, if not addressed, will certainly affect the performance of data 1

2 mining processes and affect the validity of the discovered patterns or knowledge, leading to erroneous predictions. Hence, a lot of effort and time are spent on pre-processing the data. It is estimated that about 20 per cent of the effort is spent on business objective determination, about 60 per cent on data preparation and about 10 per cent on data mining and analysis of knowledge and knowledge assimilation steps respectively (Cabena et al. 1998). The application of efficient and sound data pre-processing procedures can improve the quality of the data without losing any critical information. The better the quality of the data, the higher is the quality of knowledge extracted from it. A number of widely used, effective data pre-processing techniques that proved to be useful in practice include: data cleaning, integration, and transformation (Han et al. 2006). In addition to these, feature selection, feature extraction, model formulation and discretization are also widely applied (Kantardzic 2003) to extract highly informative features of the database relevant to the application. Data cleaning, an important preliminary step in data mining, seeks to improve the quality of the data, making the data more reliable for effective mining process. Data cleaning algorithms attempt to smooth noise in the data, identify and eliminate inconsistencies and remove missing values or replace them with values imputed from the rest of the data. Missing data handling is one of the major tasks in data cleaning. Datasets are frequently characterized by their incompleteness. The data incompleteness is the result of mal-functioning equipment, inadvertent deletion of data instances (or records), inconsistency with other recorded data, data being considered unimportant at the time of entry, data excluded owing to misunderstanding, and refusal to respond to queries. The absence of complete data then hampers the decision-making processes because of the dependence of decisions on complete information (Stefanakos and Athanassoulis 2001; Marwala et al. 2006). Generally, the presence of 1% missing data in the database is considered trivial and 1% to 5% is manageable. However, sophisticated methods and tools are required to handle 5-15% missing data. Many data handling methods for missing data are described in the literature (Han and Kamber 2006). They fall into one of the following categories: i) Use of only complete data by deleting the incomplete instances, ii) Imputation of missing data. The strategy of deleting incomplete instances from the databases could be used when the size of missing data is small. Such an approach leads to significant information loss, especially as the amount of missing data increases (Allison 2001; Little et al. 2002). Ignoring or deleting the cases containing missing data from analysis and concentrating only on sample units for which complete data are available (complete case analysis) might seem reasonable at first sight, and indeed it might appear that there is no other option available. However, in doing so, the analysts often throw away a large part of their data, especially when a 2

3 data set contains many variables and whole records are deleted based on only one or two variables not being measured. Hence, the knowledge derived by the data mining algorithms using the database after the deletion of incomplete instances, will be biased. In response to these issues, significant effort has been devoted to developing and evaluating various imputation methodologies. Imputation is a method of filling in the missing values by attributing values derived from other available data to them. Imputation is defined as the process of estimating missing data of an observation, based on valid values of other variables (Hair et al. 1998). Imputation minimizes bias in the mining process, and preserves expensive to collect data, that would otherwise be discarded (Marvin et al. 2003). It is important that the estimates for the missing values are accurate, as even a small number of biased estimates may lead to inaccurate and misleading results in the mining process. This paper proposes a semi-supervised clustering methodology to impute the missing values in the database. In order to combine the benefits of supervised and unsupervised learning methods, semi-supervised clustering (Watanabe 1985) has been proposed to guide the clustering process in the effective imputation of missing data in the incomplete records. The incomplete records are designated as centre points for generating clusters. The missing-attribute value(s) is/are then imputed with the mean value of the respective attribute(s) of the complete records in the respective clusters. The performance of the proposed method is evaluated by estimating the Average Imputation Error (Mariso 2005) which is measured as the difference between the imputed value and the actual value. To ensure the reliability of the evaluation, the experiments are performed on the complete Pima Indians Type II Diabetes dataset. As the complete dataset has been taken up for the experimental study, different percentages of missing data were generated using random labeling feature(s) and to overcome the bias due to the artificial introduction of missing values randomly in the dataset, twenty simulations were conducted. The random introduction of missing data and multiple simulations provide an unbiased platform to evaluate the efficacy of the imputation process. For comparison purpose, the five most popular imputation methods have been investigated in this work namely k-nearest Neighbours (k-nn), Mean-based imputation, two correlation-based methods known as LSImpute_Rows and EMImpute_Columns (Trond et al. 2004), and a multiple imputation (MI) method referred to as NORM (Schafer 1999), which is based on the expectation maximization (EM) algorithm. The performance of the proposed method is compared with these imputation methods in terms of their average imputation errors, for different percentages of missing data. 3

4 A review of literature on different methods of handling missing data in various applications is presented in Section II. Section III describes the Semi-Supervised Clustering methodology for the imputation of missing data and the experimental set up for evaluating its performance. Section IV presents performance analysis and Section VI Concludes the research work. II. LITERATURE REVIEW The data mining process deals primarily with prediction, estimation, classification, pattern recognition and development of association rules. The reliability of the outcome of the data mining process depends heavily on the quality of the data and on the chosen sample data used for model training and testing (Brown and Kros 2003). The databases in real life applications are usually large, and so the problems of incomplete, inaccurate and inconsistent data are inevitable (Laurence and Jeremy 2006). The presence of missing data in the databases negatively impacts the performance of the data mining process (Hair et al. 1998). Missing values may appear either in conditional attributes or in class attribute (target attribute) in the dataset. Numerous case studies are found in the literature regarding the imputation of missing data (Barnard and Meng 1999; Van et al. 1999). More importantly, research (Batista and Monard 2003) indicates that a meaningful treatment of missing data should always be independent of the problem being investigated. Many approaches to deal with missing values are described (Han and Kamber 2006), for instance: Ignore objects containing missing values; Fill the missing values manually based on the dependant value; Substitute the missing values by a global constant or the mean of the objects. Though the first approach is simple, it may result in the loss of too much useful data, whereas the second one is time consuming and leads to logical inconsistencies. So these approaches are not feasible for many applications. The third approach assumes that all the missing data have the same value which would probably lead to considerable distortions in data distribution. Moreover, these methods can be used only when the percentage of missing data is below 5%. Another common technique for dealing with missing data is to create a new data value (similar to missing data ) and use it to represent missing data. However, this has the unfortunate side effect that data mining algorithms may try to use missing as a legal value, which is inappropriate. It also sometimes has the effect of artificially inflating the accuracy of some data mining algorithms on some datasets (Friedman et al. 1996). The best solution to handle the incomplete data is to get the most probable value to fill the missing values. The method of imputation, however, is a popular strategy and it uses as much information as possible from the observed data to predict missing values (Zhang 2005). Imputation techniques range from fairly simple ideas such as using the mean or mode of the 4

5 attribute as replacement for a missing value (Little and Rubin 1986; Clark and Niblett 1989) to more sophisticated ones that use statistical models like regression (Hu et al. 1998) or the machine learning approaches such as Bayesian networks (Beaumont 2000) and decision-tree induction (Coppola et al. 2000). Using the mean or mode is generally considered as a poor choice, because it distorts other statistical properties of the data such as the variance and does not take dependencies between attributes into account. Hot-deck imputation is another type of an imputation method which (Lakshminarayan et al. 1999) fills in missing data using values from other rows of the database that are similar to the row with the missing data and its performance depends completely on identifying the rows similar to the row with missing data. Traditionally, the techniques for the imputation of missing value can be roughly classified into parametric imputation (e.g. linear regression) and non-parametric imputation (e.g. non-parametric Kernel-Based Regression methods, Nearest Neighbours method referred to as NN). The parametric regression imputation is superior if a dataset can be adequately modeled parametrically, or if users can correctly specify the parametric forms for the dataset. For instance, the linear regression methods are effective with the continuous target attribute which is a linear combination of the conditional attributes. However, when we do not know the actual relation between the conditional attributes and the target attribute, the performance of linear regression for imputing missing values is very poor. In real applications, if the model is ill-specified due to lack of knowledge about the distribution of the real dataset, the estimations of parametric method may be highly biased and the optimal control factor settings may be miscalculated. Non-parametric imputation can provide superior fit by capturing the structure of the dataset and it offers a nice alternative if users have no idea about the actual distribution of a dataset. For example, the NN method is regarded as one of the non-parametric techniques used to compensate for missing values in sample surveys (Chen and Shao 2001) where the nature and distribution of the data is not known. Using a non-parametric algorithm is beneficial when the form of relationship between the conditional attributes and the target attribute is not known a-priori (Lall and Sharma 1996). In the K Nearest Neighbours (KNN) method of estimating the missing data, a distance is assigned between all pairs of points in a dataset. Distance is defined as the Euclidean distance between two points. From these distances, a distance matrix is constructed among all possible pairings of points (x, y). Each data point in the dataset has a class label in the set C = {C 1... C n }. The data points K- closest neighbors are then found by analyzing the distance matrix. The K-closest data points are then analyzed to determine which of the class labels is the most common in the set. The most common class label is then assigned to the data point under consideration. The disadvantage is that 5

6 if two or more class labels occur an equal number of times for a specific data point within its K- closest neighbors, then the KNN test is inconclusive. A general approach to handle missing data is to create data mining algorithms that internally handle missing data and still produce good results. For example, the CART decision-tree learning algorithm (Breiman et al. 1984) internally handles missing data essentially using an implicit form of imputation, based on regression. Regression imputation (Beaumont 2000) imputes missing data with predicted values derived from a regression equation based on variables in the dataset that contain no missing data. Regression assumes a specific relationship between attributes that may not hold good for all datasets. The widely used missing data imputation techniques such as KNN, NORM, EMImpute Columns, LSImpute and Mean Imputation are investigated for comparative analysis in this paper. KNN imputes missing data by analyzing the K-closest data points and identifies the most common one as the replacement for the missing data. NORM implements missing value estimation based on the expectation maximization algorithm (Schafer 1999). EMImpute_Columns and LSImpute_Rows are feature based extraction methods. In the LSImpute_Rows method, the least squares principle is used to estimate missing data using correlations between the reference record and other records. The least squares principle is based on minimizing the sum of squared errors of a regression model. EMImpute_Columns performs imputation using the relevant columns of the records. Mean Imputation method fills the missing value with the arithmetic mean value of the respective attribute of the dataset. It may unduly affect the other statistical properties of the dataset. In this research work, the performance of the proposed method is compared with those of KNN, NORM, EMImpute Columns, LSImpute and Mean Imputation, all of which are evaluated on the same dataset. III. SEMI-SUPERVISED CLUSTERING METHODOLOGY In this paper, a Semi-Supervised clustering methodology for imputing missing values in the database is proposed. Clustering is an unsupervised data mining technique used for discovering patterns in the database. It is a process of dividing a set of objects into unknown groups, where objects within each group should be highly similar to each other and dissimilar to objects in any other group. In order to combine the benefits of supervised and unsupervised learning methods, semi-supervised clustering (Watanabe 1985) has been proposed, incorporating knowledge specific to the problem under analysis to guide the clustering process. The incomplete record(s) is/are assigned as centre point(s) for generating clusters. It facilitates the detection of those instances that should be placed in the same cluster, and those that should be separated to different clusters (Han 6

7 and Kamber 2006). The methodology is referred to as Semi-supervised, since clusters are generated for every incomplete record by assigning them as centre or seed points. As instances in the same cluster are similar to each other, they share certain properties. The value(s) of the missing attribute(s) in each centre point record of the cluster is imputed by computing the mean value of the respective attribute(s) of the records in the cluster. Algorithm SESU_CLUST_IMPUTE ( ) // D: Dataset M: Number of records N: Number of attributes B: Block(s) of records with missing data Attr: Attributes C: Clusters R: Record // Do IMPUTE (D) { For i = 1 to N do { // formulate b i with set of records R with the i th attribute missing // b i = Ø; For j = 1 to M do { If Value (Attr i (R)) =? then b i = b i U {R}; } } For i = 1 to N { CLUSTER (D, C, b i ); IMPUTE (C, b i )); } } Testing dataset The database is searched to identify and group the records with same set of missing attribute(s) to form blocks. A maximum of (2 n - 2) blocks can be generated, where n is the total number of attributes in the dataset. In our experiments only one attribute at a time was considered as missing. This was repeated for each attribute resulting in eight experiments with 20 replications for each attribute. These records with a single missing attribute are collectively treated as the testing dataset. The incomplete records in these blocks are passed as centre points for the clustering process. The remaining complete records form the training dataset. 7

8 Weight Generation and Normalization Weight is used as a similarity measure to form clusters. Weight specifies the number of similar attribute(s) between an incomplete record of the testing dataset and a complete record of the training dataset. The similarity of an attribute is measured using a function which uses the dispersion of the attribute i.e. the standard deviation, in the complete dataset. The statistical parameter which describes the spread of the values in a dataset is standard deviation (σ) (Ward 2004). An attribute in a complete record of the training dataset is considered to be similar to the corresponding attribute in the incomplete record of the testing dataset if and only if the attribute value in the complete record lies in the range, attribute value of the incomplete record ± 1σ, where σ is the standard deviation of the considered attribute. In our work, searching for similar records in terms of attributes is done by keeping the incomplete record as centre point. So, the attribute of the record in the training dataset is said to be similar to the attribute of the incomplete record only if it is within the spread from the attribute of the incomplete record. The weights are then assigned as follows using equation (1). Weight _ of _ record _ in _ Train _ dataset R, W( R) W( R ) (1) where, W(R i ) = 1 if σ β + σ and zero otherwise. The summation is taken over the nonmissing attributes of the record in the testing dataset, where the attribute value of the incomplete record, β is the attribute value of the complete record in the training dataset and σ is the standard deviation of the respective attribute. A normalization step is carried out on the weights obtained as above, so as to generalize the similarity measurement between records to any dataset, irrespective of the number of attributes. The weights are normalized using max-difference normalization. The assigned weights defined in equation (1) are normalized by dividing the weights by (n- missing (a)) as given in equation (2). i W(R) Normalized Weight of R with respect to a, NW a (R) (2) (n - missing(a) ) where W(R) is the weight generated on a complete record R of the training dataset with respect to the incomplete record a chosen as seed point, n is the number of attributes in the dataset and missing(a) refers to the number of missing attributes in the incomplete record a. Cluster generation and Imputation of missing data Finally, clusters are generated with every incomplete record in every block of the testing dataset as centre point. The normalized weight plays a central role in cluster formation. If a is a member of the testing dataset (it has missing attributes) then a complete record R belongs to the 8

9 cluster C a centered at a, if NW a (R) 0.6. The threshold value of 0.6 was used so that the members of the cluster are highly similar to the centre point. The clusters thus formed will have the incomplete record as centre point and a set of similar complete records as members. The value(s) of the missing attribute(s) in the centre point record of each cluster is imputed by computing the mean value of the respective attribute(s) of the records in the cluster. The difference between the Mean Imputation method and the proposed method is that in this method mean is computed from the set of similar records that are grouped into clusters. Experimental Set up Machine learning datasets are described using conditional attributes and a decision attribute. The conditional attributes characterize the different parameters of the entity and are interdependent. The decision attribute is derived based on the values of the conditional attributes for every entity in the dataset. The value of every attribute plays a vital role in deriving the value of the decision attribute. The records of a dataset are classified into different target groups based on the values of the attributes. The objective of this research work is to evaluate the proposed methodology on the imputation of missing values against other known methods. The reliability of the performance evaluation is very important in any research work. Hence, a complete dataset, Pima Indian Type II Diabetes Dataset, from the UCI repository of machine learning databases (Asuncion and Newman 2007), was taken for experimental study and performance evaluation. The dataset includes 768 complete instances described by eight features labeled as Number of times pregnant, Glucose tolerance test, Diastolic blood pressure, Triceps skin fold thickness, 2-hour serum insulin, Body Mass Index, Diabetes pedigree function and Age. The predictive class value 1 is interpreted as "tested positive for diabetes" and class value 0 is interpreted as tested negative for diabetes. Class value 0 was found in 500 instances and 1 in 268 instances. The dataset has only numerical attributes and each transaction is viewed as a multi dimensional vector. All the attributes of the dataset are considered in turn for the experiments, as the decision attribute is derived using these attributes. Different percentages of missing data (from 5% to 80%) are generated on a complete dataset using random labeling feature which forms the testing dataset and the rest of the complete records collectively forms the training dataset. Twenty simulations were conducted to overcome the bias on the performance of the proposed methodology due to the random introduction of missing values in the dataset. The results are validated by estimating average imputation error(e). 9

10 E m n k 1 i 1 O ij Iij / max where n is the number of imputed values, m is the number of random simulations for each missing value, O ij is the original value, I ij is the imputed value of the attribute j in the i th imputation, Max j is the maximum value of the of the attribute j, Min j is the minimum value of the of the attribute j, j is the attribute under consideration. The performance of the method is compared with other widely used existing imputation methods namely 10-NN, NORM, EMImpute Columns, LSImpute_Rows and Mean Imputation. j min j /n /m IV PERFORMANCE ANALYSIS The performance of semi-supervised clustering methodology on imputation of missing data is shown in Table I. It shows the computed average imputation error for all the eight attributes for different percentages of missing data. The average imputation error varies significantly from attribute to attribute due to the nature and distribution of the respective attribute of the dataset. The overall average imputation error varies from a minimum of to a maximum of over the different percentages of missing data from 5% to 80%. Size of Missing Data Table I: Average Imputation Error for the attributes of Pima Indians Type II dataset Number of times pregnant Glucose tolerance test Diastolic blood pressure Triceps skin fold thickness 2 hour serum insulin Body mass index Diabetes pedigree function Age Average 5% % % % % % % % % % % % % % % %

11 Figure 1: Performance of Semi-Supervised Clustering methodology on the imputation of missing data Figure-1 shows the impact of the size of the missing data on the estimation of average imputation error. Within each attribute the average imputation error shows very little variation with respect to the size of missing data. With an increase in the size of missing data, the availability of complete set of records in the training dataset is reduced. The method is robust as it is capable of formulating clusters with similar records when there is missing data up to 50%, with no noticeable increase in error levels. In the presence of missing data above 50%, the cluster formation is not as effective. It is due to the sparse availability of similar records in the training dataset when the size of incomplete records is above 50%. Hence, the imputation of missing data will be less consistent with the observed values when the size of missing data is above 50%. Comparative Analysis The performance of the proposed method is compared with other imputation methods namely 10-NN, NORM, EMImpute Columns, LSImpute_Rows, Mean Imputation and is shown in Table-II. As a sample performance, up to 35% of missing data is taken up for comparison. The imputation process in all the methods uses the complete record(s) of the dataset. Normally, the accuracy of the imputed values depends on the size of missing data and the availability of complete instances for the imputation process. Hence, the missing data is simulated up to one-third of the dataset, i.e. a maximum of 35% missing data is taken up for the comparative performance analysis. Table II shows the Mean Imputation error ± Standard Deviation across the simulations for each method and each size of missing data, correct to first decimal place. 11

12 Table II Comparative performance of various imputation methods Method Size of Missing Data 5% 10% 15% 20% 25% 30% 35% 10-NN 10.2± ± ± ± ± ± ±10.5 NORM 16.0± ± ± ± ± ± ±13.9 EMImpute_Columns 8.5± ± ± ± ± ± ±22.3 LSImpute_Rows 10.9± ± ± ± ± ± ±23.3 Mean Imputation 12.5± ± ± ± ± ± ±10.3 SESU_ CLUST 11.0± ± ± ± ± ± ±7.4 Figure 2 Comparative Performance of various Methods of Missing Data Imputation Average Imputation Error Figure 3 Comparative Performance of various Methods of Missing Data Imputation Standard Deviation Table-II shows the comparative performance of the imputation methods on the Pima Indians Type II diabetes dataset for different percentages of missing data up to 35%. It is observed from Table-II that NORM method produces the highest mean error rate with the least accurate estimation results. On the other hand, EMImpute_columns shows lowest mean imputation error rate among the existing methods, with stability in the imputation process for different sizes of missing data. The 12

13 proposed methodology performs better than the Mean Imputation and NORM with an average imputation error ranging from 10.9 to Though, the 10-NN and LSImpute_Rows methods show better performance in terms of average imputation error than the proposed Semi-Supervised Clustering methodology SESU_CLUST, a noticeable increase in the mean error rate is found in these methods as the size of the missing data increases. It is observed from Table II and Figure 2 that the proposed methodology SESU_CLUST is able to achieve stable error rate for increasing percentages of missing data. Figures 2 and 3 represent the comparative performance of various missing data imputation methods in terms of average imputation error and standard deviation respectively. Standard deviation is the valid measure of dispersion from the centre point of any data set. The higher the value of standard deviation, the higher is the interval of distribution of imputed values (Ward 2004). It is observed from Table II and Figure 3 that LSImpute_Rows shows the highest standard deviation which varies from 23.3 to 24.0 and hence the mean error rate is widely dispersed in the range mean error rate ± σ. EMImpute_Columns stands next to LSImpute_Rows, with its standard deviation varying from 22.3 to Though this method resulted in the lowest average error rate among the existing five methods, its range of mean error rate is wide as the standard deviation is higher. In the NORM method of imputation the standard deviation varies from 13.4 to 13.9 with the mean error rate varying from 15.7 to The Mean Imputation performs with the mean error rate varying from 12.3 to 12.5 and the standard deviation varying from 10.3 to 10.5 which is higher than the proposed methodology SESU_CLUST. The range for the proposed methodology SESU_CLUST varies from 3.8 to 18.2 (Minimum range 14.4) attained at 5% missing data, to 3.5 to 18.5 (Maximum range 15) reached in the 30% case. The ranges for the other sizes of missing data vary from 14.6 to 15. There is significant reduction in the lengths of the intervals, which is due to the uniformly smaller standard deviation values of SESU_CLUST compared to those of 10- NN, even though the mean values are slightly on the higher side. The proposed methodology SESU_CLUST has the least standard deviation when compared to all the other methods. As the method imputes missing values using a set of similar complete records formulated as clusters, it resulted in lowest standard deviation. The lower value of standard deviation indicates that the imputed values are less dispersed from its centre point of the data series. As the imputation process completely depends on the clusters which are generated using the incomplete record as the centre point, the less dispersed imputed value around the centre point (mean value) of the cluster is a valid representation or replacement for the missing attribute(s) in the incomplete records. 13

14 V. CONCLUSION The effective use of information technology is crucial for organizations to stay competitive in today s complex, evolving environment. The organizations face a lot of challenges when trying to deal with large, diverse, and often complex databases. They adopt several strategies to improve the quality of data in the database. The missing data, one of the pervasive problems in data analysis, is handled using various strategies based on the problem context. The proposed SESU_CLUST method imputes the missing values by exploiting the concept of semi-supervised clustering. The method exhibits relatively better performance compared to other methods as it produces stable results up to 50% of the missing data with lowest standard deviation of the error. The weight normalization facilitates the application of the method to any dataset irrespective of the number of attributes. The performance of the method can be improved by generating clusters with high intracluster-similarity for the imputation process. REFERENCE Allison, P.D. (2001). Missing Data Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, Sage, CA. Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science. Barnard J., Meng X. (1999), Applications of multiple imputations in medical studies: from AIDES to NHANES, Statistical Methods in Medical Research, Vol. 8, pp Batista, G., Monard, M. (2003), An analysis of four missing data treatment methods for supervised learning, Applied Artificial Intelligence, 17 (5/6), Beaumont, J-F. (2000), On regression imputation in the presence of non-ignorable non-response, in: Proceedings of the Survey Research 570 Methods Section, ASA, pp Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984), Classification and Regression Trees, Chapman and Hall. Brown, M. L., Kros, J. F. (2003), The impact of missing data on data mining, In Data Mining: Opportunities and Challenges, J. Wang, Ed. IGI Publishing, Hershey, PA, Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., Zanasi, A. (1998), Discovering Data Mining: From Concepts to Implementation, Prentice-Hall, Upper Saddle River, NJ. Chen, J., Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation, Journal of American Statistics Association, Vol.96: Clark, P., Niblett, T. (1989), The CN2 induction algorithm, Machine Learning, 3 (4),

15 Friedman, J.H., Kohavi, R., Yun, Y. (1996), Lazy decision trees, in: Proceedings of 13th AAAI and 8th IAAI, pp Hair, J., Anderson, R., Tathan, R., Black, W. (1998), Multivariate Data Analysis, Upper Saddle River, NJ. Prentice Hall, Han, J., Kamber, M. (2006), Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, San Matel, CA. Hu, M., S.M. Salvucci, S.M., Cohen, M.P. (1998), Evaluation of some popular imputation algorithms, in: The Survey Research Methods Section of the ASA, pp Kantardzic, M. (2003), Data Mining Concepts, Models, Methods and Algorithms, Wiley-IEEE Computer Society Press, New York, NY. Krzysztof, C., Kurgan, L. (2002), Trends in data mining and knowledge discovery, Knowledge Discovery in Advanced Information Systems, Springer, Berlin. Coppola L, Di Zio M, Luzi O, Ponti A and M. Scanu. (2000), Bayesian networks for imputation in official statistics: a case study, in: 575 DataClean Conference, pp Lakshminarayan, K., Harp, S.A., Samad, T. (1999), Imputation of missing data in industrial databases, Applied Intelligence 11 (3), Lall, U., and Sharma, A. (1996), A nearest-neighbor bootstrap for re-sampling hydrologic time series Water Resource. Res. 2001, Vol.32: Laurance, Jeremy. (2006), Breast cancer cases rise 80% since Seventies, The Independent, Retrieved on Little, R.J.A., Rubin, D.B. (1986) Statistical Analysis with Missing Data, John Wiley & Sons, Inc., USA. Little, R.J.A., Rubin, D.B., Statistical Analysis with Missing Data, second ed. John Wiley and Sons, Hoboken, NJ. Mariso Giardina, Yongyang Huo, Francisco Azuaje, Paul McCullagh and Roy Harper. (2005), A Missing Data Estimation Analysis in Type II Diabetes Databases, Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems (CBMS 05) /05, IEEE. Marvin L.Brown, John F.Kros. (2003), Data mining and the impact of missing data, Industrial Management & Data systems, 103/8, Marwala, T., Chakraverty, S., and Mahola, U. (2006). Fault classification using multi-layer perceptrons and support vector machines. International Journal of Engineering Simulation, 7(1), Ocean Research, 23, Schafer, J.L (1999), NORM: multiple Imputations of incomplete multivariate data under a normal model, version 2.03, software for Windows 95, 98, NT, Website: http;// Stefanakos, C., & Athanassoulis, G. A. (2001). A unified methodology for analysis, completion and simulation of non-stationary time series with missing values, with application to wave data. Applied Ocean Research, 23,

16 Trond. H, Bjarte. D, Inge. J(2004), LSimpute: accurate estimation of missing values in microarray data with least squares methods, Nucleic Acids Research, 32(3). Van Buren.S, Boshuizen.H, Knook.D. (1999), Multiple imputation of missing blood pressure covariates in survival analysis, Statistics in Medicine, Vol.18, pp Ward, B. (2004), The Best of Both Worlds: A Hybrid Statistics Course, Journal of Statistics Education [Online], 12(3). Watanabe. S, Pattern Recognition: Human and Mechanical, John Wiley and Sons, Inc., New York, USA, Zhang, S.C., et al. (2005), Missing is useful: Missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering, 2005, Vol. 17(12):

17 Authors Profile Ilango Paramasivam received his Masters Degree in Computer Applications from Alagappa University, Karaikudi, India. He is pursuing Ph.D in Data Mining in the Department of Computer Applications, National Institute of Technology (NIT), Tiruchirappalli. He has 16 years of Post Graduate teaching experience in Computer Science & Engineering. He is currently working as Assistant Professor in the Intelligent Systems Division, School of Computing Sciences, VIT University, Vellore. His major research interests include Data Mining, Machine Learning and Distributed Computing. Hemalatha Thiagarajan received her Ph.D degree from the University of Texas at Austin. She has 25 years of graduate teaching experience in Mathematics and Computer Science. She is currently working as Professor in the Department of Mathematics, National Institute of Technology (NIT), Tiruchirappalli. Her major research interests are Algorithms, Operations Research. Poonkuntran Shanmugam received B.E in Information Technology from Bharathidasan University, Tiruchirapalli, India and M.Tech in Computer and Information Technology from Manonmaniam Sundaranar University, Tirunelveli. He is currently pursuing Ph.D in the Department of Computer Science & Engineering, Manonmaniam Sundaranar University, Tirunelveli. He has 4 years of experience in teaching and research. Currently, working on Medical image processing -security models. His areas of interest are digital image processing, soft computing and energy aware computing in computer vision. Nickolas Savarimuthu received his M.E.and Ph.D. degrees in Computer Science both from the National Institute of Technology (NIT), Tiruchirappalli. He is an Assistant Professor in the Department of Computer Applications, NIT. Tiruchirappalli. He is having more than 20 years of teaching and research experience. His major research interests include Data Preprocessing and Data mining in general and applications of data mining in particular. 17