P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang 2 1 PSchool of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China liupeng@mail.shufe.edu.cn Care Computing Group, School of Computer Science, University of Westminster, London, Northwick Park, HA1 3TP, UK {eldarze, chounp, vasilc, huangw}@wmin.ac.uk 2 Abstract. It is well accepted that many real-life datasets are full of missing data. In this paper we introduce, analyze and compare several well known treatment methods for missing data handling and propose new methods based on Naive ian classifier to estimate and replace missing data. We conduct extensive experiments on datasets from UCI to compare these methods. Finally we apply these models to a geriatric hospital dataset in order to assess their effectiveness on a real-life dataset. 1 Introduction Data Mining (DM) is the process of discovering interesting knowledge from a large amounts of data stored either in databases, data warehouse, or other information repositories [1]. According to [2], about 20% of the effort is spent on the problem and data understanding, about 60% on data preparation and about 10% on data mining and analysis of knowledge, respectively. Why is more than half of the project effort spent on data preparation? Actually, there are a lot of serious data quality problems in realworld datasets. Problems often encountered include incomplete, redundant, inconsistent or noisy data [2]. These serious quality problems if not addressed reduce the performance of data mining algorithms. Hence, in many cases a lot of effort is spent on the data preparation phase in order to achieve a good result. Missing data is a common issue in many real-life datasets. Rates of less than 1% missing data are generally considered trivial, 1-5% manageable. However, 5-15% requires sophisticated methods to handle, and more than 15% may severely impact any kind of interpretation [3]. This paper discusses and evaluates some treatment methods for missing data. Missing mechanism and the guidelines for treatment are presented in Section 2. Section 3 introduces some popular treatment methods of missing data and proposes a new model based on Naive ian Classifier and information gain. Experimental analysis and model comparison are described in Section 4. The proposed models are applied to a hospital dataset and the results are reported in Section 5. Conclusions and further work are discussed in Section 6. X. Li, S. Wang, and Z.Y. Dong (Eds.): ADMA 2005 LNAI 3584, pp. 583 590, 2005. Springer-Verlag Berlin Heidelberg 2005
584 P. Liu et al. 2 Missing Mechanism and Guidelines for Treatment The effect of the missing data treatment methods mainly depend on the missing mechanism. According to [4], missing data can be classified into three: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). Some data mining approaches treat missing data with internal algorithms, say,. However it is still significantly important to construct complete datasets with treatment methods for missing data for two reasons [5]. First, any data mining approaches can be used with a complete dataset and second, it can prove a basic point for the comparison of the data mining approaches. In general, treatment methods for missing data can be divided into three common approaches [3]: Ignore or Discard the Instances Which Contain Missing Data. Parameter Estimation. In this approach, variants of Expectation-Maximization algorithm are used in the maximum likelihood procedures to estimate the parameters for missing data. These methods are normally superior to ignore or discard methods. However, there are not widely applied mainly because the assumption of variable distributions can not be easily derived and the high degree of complexity for calculation [6]. Imputation Techniques. It uses the present information of the dataset to estimate and replace missing data correspondingly. It aims to recognize the relationships among the values in dataset and to estimate missing data based on these relationships. In general, missing data treatment methods should satisfy three rules. First, any missing data treatment method should not change the distribution of the dataset. Second, the relationship among the attributes should be retained. Third, it must not be too complex or computationally costly for a method to be applied in real life. 3 Missing Data Treatment Methods In this section we introduce some popular missing data treatment methods and our proposed models which are based on the concept of information gain and Naive ian Classifier. 3.1 Popular Treatment Methods Case Deletion. This method discards the cases with missing data for at least one attribute. A variation of this method is to delete the instances or attributes which have high missing rate. However before deleting any attribute, it is necessary to run relevance analysis. Constant Replacement. The mean (for numeric data) or mode (for nominal data) is used to replace missing data. To reduce the influence of exceptional data, the median can also be used. Internal Treatment Method for. This method uses a probabilistic approach to handle missing data [7].
An Analysis of Missing Data Treatment Methods and Their Application 585 K-Nearest Neighbor Imputation. This method uses k-nearest neighbor (KNN) algorithms to replace missing data. Computational efficiency is thought to be a big problem for this method. While the KNN look for the most similar instances, the whole dataset should be searched. On the other hand, the selection of the value k and the measure of similarity will greatly influence the results. 3.2 Naive ian Imputation (NBI) Naive ian Classifier is a popular classifier, not only for its good performance, but also for its simple form [6]. It is not sensitive to missing data. In this case, the prediction accuracy of Naive ian Classifier is always higher than that of [1]. Missing data treatment methods based on Naive ian Classifier and concept of information gain, named Naïve ian Imputation (NBI), consist of two phases. In phase1, the order of the attributes to be treated is defined. This process treats attributes with missing data one by one. It is an iterative process. The order of the attribute to be treated affects the overall results. Each attribute s importance for classification is different. The performance of random order of attributes is therefore weak and unreliable. To overcome this inherent deficiency in random order we propose several methods: (1) Using the original data to estimate all the missing data. It is independent of the estimate order of attributes. (2) On descending order of missing rate. Initially, it replaces missing data in the attribute with the highest missing rate, then uses the modified dataset to estimate and replace missing in the next attribute. (3) On descending order of information gain. For the classification task of data mining, information gain reflects the importance of the attributes for the classification task. (4) On descending order of the weighted index by missing rate and information gain. This method combines missing rate and information gain concepts. It is proved to be effective. (5) In experiments, while using methods 3 or 4, the performance of the algorithm is best when the first n (usually 3~4) attributes have been treated. Therefore, treatment of the first three or four attributes will be enough. Method 1 can be applied to the remaining attributes. In this way, the quality and efficiency can be balanced. In phase2, the Naive ian classifier is built to estimate and replace the missing data, using the attribute defined in the first phase as class attribute and the whole dataset as the training subset. According to the different estimate order of attributes, NBI has five different combinations, namely Model 1, Model 2, Model 3, Model 4, and Model 5. 4 Experimental Analysis To compare the methods introduced in Section 3, we use three datasets from UCI [8], namely, Nursery, Crx and German (Table 1). Table 1. Datasets summary dataset Inst. Attar. class Nursery 12960 8 5 Crx 690 15 2 German 1000 20 2
586 P. Liu et al. In order to evaluate the performance of missing data treatment methods, Decision tree classifier is built on the modified dataset. If the performance of the classifier is turned out to be satisfactory, the performance of missing data treatment model is considered to be satisfactory. Predictive accuracy of model, predictive accuracy of class and predictive profit (against internal method) which are used in the paper for measuring performance of data mining algorithm are defined as follows: = Number of correct categorized instances Prediction accuracy of model 100% Total number of instances (1) = Number of correct categorize d instances in the class Prediction accuracy of class 100 % Total number of instances in the class = Prediction accuracy - prediction accuracy of intern al method Prediction profit 100% prediction accuracy of internal method (2) (3) The number of correct categorized instances and the number of correct categorized instances in the class are calculated by using the modified dataset and the prediction accuracy of internal method comes from the classifier which uses the modified dataset built by internal missing data treatment method. The experiments are as follows: Firstly, the datasets are randomly divided into two subsets: 66% of records as a training subset and the remaining 33% as a testing subset. Missing data are artificially implanted in different rates, from 10% to 60% of records into the training subsets in order to maintain the integrity of testing subsets. Decision tree is used as a classifier in this paper and the analysis for most representative node attributes is desirable. Hence attributes with high information gain are selected to be inserted with missing data. Finally, three methods are applied into the training subsets these are: replacing method, internal method and Model 4 on Naive ian Classifier. The experiments are repeated three times for each method and the average error rate is calculated. Due to the lack of space, only part of the results is presented in Tables 2 and 3, and the comparative performances are graphed in Figures 1 and 2. From Figures 1 and 2 we can see that, in most cases Model 4 of NBI is superior to mean replacing method and internal method. For dataset Nursery, while the Table 2. Error rates for Dataset Nursery Missing Att.: heal, par Att.: heal, par, fina proportion (%) (%) (%) (%) (%) (%) 0% 3.8±0.3 - - 3.8±0.3 - - 10% 4.2±0.2 4.2±0.3 3.9±0.1 4.5±0.1 4.5±0.3 4.2±0.6 20% 4.7±0.3 5.3±0.6 5.0±0.6 5.1±0.4 5.5±0.4 5.3±0.2 30% 5.7±1.1 7.0±1.9 6.4±0.5 6.9±0.8 6.8±0.3 6.6±0.1 40% 7.4±1.0 7.0±0.5 7.1±0.4 8.1±0.6 7.4±0.2 7.5±0.8 50% 10.8±0.1 10.5±1.0 8.8±1.0 11.0±0.1 10.1±0.3 10.1±0.4 60% 11.0±0.3 13.6±6.3 9.8±1.2 11.8±0.4 15.4±6.9 11.3±0.9
An Analysis of Missing Data Treatment Methods and Their Application 587 Table 3. Error rates for Dataset Crx Missing Att.: A5, A6 Att.: A5, A6, A2 proportion (%) (%) (%) (%) (%) (%) 0% 13.5±0.3 - - 13.5±0.3 - - 11% 13.5±0.3 13.2±0.4 13.5±0.3 13.5±0.3 13.5±0.8 13.5±0.3 21% 13.4±2.3 12.3±0.9 13.5±0.3 13.7±2.2 12.3±0.9 12.1±1.1 35% 14.6±1.3 14.3±2.3 12.9±0.7 14.6±1.3 13.7±2.9 12.3±0.9 50% 13.8±2.2 13.4±0.7 13.2±1.2 14.3±1.6 13.0±0.3 13.5±1.1 60% 17.7±4.6 15.3±2.7 12.6±0.5 16.0±0.5 12.7±0.2 12.6±1.2 proportion of missing data is small, the performances of the three methods are similar. As the missing rate goes beyond 40%, the difference among the three methods becomes more obvious. The performance of the mean replacing method worsens as the missing rate increases. However, in the tree structure, the node attributes and their levels do not change markedly. Increases in the proportion of the missing data proportion do not influence the structure of the decision tree, because the information gain ratio of the attribute heal, par and fina are much bigger than that of other attributes. The performance of internal methods is very similar to the performance of model 4. Results for dataset Crx are illustrated in Figure 2. As the proportion of missing data increases, the classification error rates of both mean replacing methods and internal methods increase. However results for Model 4 remain stable and very close to the rate of the original dataset which does not have missing data. In this experiment, attribute A5, A6 and A2 were found to be strongly dependent on other attributes. In order to find the dependency relationship among attributes, each attribute, one by one, is predicted by the other attributes in dataset. If the error rate for a classifier is low then the attribute has a strong relationship with other attributes. For Crx, error rates of these three attributes are very low, about 25%. It makes the new dataset completed by Model 4 very close to the original dataset which does not have missing values. Therefore, the predictive error rates of Model 4 always fluctuate around that of the original dataset. We also find that an increase in the missing data does not affect the predictive error rates of missing data, but an increase in the number of attributes with missing data will influence the performance of the missing treatment methods. For dataset German, the performed well when missing data are inserted into one attribute. When missing data are inserted into three attributes, the performances of the three methods are similar to each other. For both Crx and German datasets, as the proportion of missing data increases, the nodes of the decision tree changed. The high levels attributes became lower levels or are not even selected as nodes. Using the attributes with weak classifying power will reduce the performance of the decision tree [3]. After changing the structure of the decision tree, the performance of Model 4 is better than that of internal methods.
588 P. Liu et al. 10% 8% 6% 4% 2% Missing data on attribute heal 18% 17% 15% 13% Missing data on attribute A5 10% 8% 6% 4% 2% Missing data on attribute heal & par Missing data 18% on attribute A5 & A6 17% 15% 13% 10% 8% 6% 4% 2% Missing data on attribute heal,par & fina Fig. 1. Comparative results for Nursery 18% Missing data on attribute A5,A6 & A2 17% 15% 13% Fig. 2. Comparative results for Crx 5 Application in Healthcare The approaches of data mining have a wide use in the healthcare domain. If the inpatient length of (LOS) can be predicted efficiently, the planning of hospital resources can be greatly enhanced [9]. However, most healthcare datasets contain a lot of missing data. Treatment methods for missing data discussed earlier in this paper are applied to a real life dataset to improve the accuracy of predictive models of LOS. 5.1 Clinics Dataset The Clinics dataset contains data from a clinical computer system that was in use between 1994 and 1997, for the management of patients in a Geriatric Medicine department of a metropolitan teaching hospital in the UK [10]. It contains 4722 patient records including patient demographic details, admission reasons and LOS. For ease of analysis, LOS was categorized into three groups: short- group (0-14 days), medium- group (15-60 days) and long-term group (61+ days) (variable LOS GROUP). These boundaries are chosen in agreement with clinical judgment to
An Analysis of Missing Data Treatment Methods and Their Application 589 help describe the stages of care in such a hospital department. The missing data account for a lot in the Clinics dataset. There are 3017 instances (63.89%) that contain missing data. The proportion of missing data per LOS GROUP is 63.29%, 61.81% and 74.86%, respectively. There are 8 attributes with missing data. 5.2 Practice and Analysis After Applying these five models based on Naive ian Classifier proposed in this paper to Clinics dataset we obtained all the prediction accuracy and the prediction profits as in Table 4. Table 4. The prediction profits against and prediction accuracy of all models and classes accuracy accuracy of class prediction prediction profits of class Model of model Short Medium Long profits of model Short Medium Long 52.44% 44% 69% 10% - - - - Model 1 55.62% 46% 69% 29% 6.10% 4.5% 0.0% 190.0% Model 2 55.23% 46% 68% 30% 5.30% 4.5% -1.4% 200.0% Model 3 55.87% 48% 68% 31% 6.50% 9.1% -1.4% 210.0% Model 4 55.82% 47% 69% 31% 6.50% 6.8% 0.0% 210.0% Model 5 55.53% 45% 70% 30% 5.90% 2.3% 1.5% 200.0% From Table 4, we can see that the average prediction accuracy of NBI is higher than the internal model, especially for the long category. In Clinics dataset, 6 attributes with missing data were treated. The missing proportion for two of them is about 20% and for one above 40%. In this case, NBI outperform internal method. Among these five models, Model 4 and Model 5 performed better than the others. Furthermore, for Model 3, 4 and 5, three or four attributes were enough to obtain a good result. NBI can improve the prediction accuracy of the whole model, especially for the long category. The highest prediction profit for the long has reaches 210%. In the cases where the missing data proportion is large, there are many attributes with missing data and a strong relationship among attributes is exhibited, treatment methods for missing data based on Naive ian Classifier perform well. 6 Conclusions This paper presents a comparative analysis of several well known missing data treatments and proposes an efficient and effective missing data predictive model, NBI. These methods were tested on the Nursery, Crx, German datasets from UCI and Clinics dataset from a geriatric department. NBI performs better than internal model. The type of the attributes with missing data affects the results of the treatment methods. While the important attribute for classifying contains fewer missing data or none, internal model perform very well. However, treatment methods based on Naive ian Classifier are more commonly used. Comparatively, in the case of high missing data proportion and many attributes with missing data, NBI will perform more satisfactorily.
590 P. Liu et al. References 1. Han J. and Kamber M., Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, 2000 2. Cios K.J. and Kurgan L., Trends in Data Mining and Knowledge Discovery. In N.R. Pal, L.C. Jain, and Teoderesku N., editors, Knowledge Discovery in Advanced Information Systems. Springer, 2002. 3. Acuna E. and Rodriguez C., The treatment of missing values and its effect in the classifier accuracy. In D. Banks, L. House, F.R. McMorris,P. Arabie, W. Gaul (Eds).Classification, Clustering and Data Mining Applications. Springer-Verlag Berlin-Heidelberg, 639-648. (2004). 4. Little R. J. and Rubin D.B., Statistical Analysis with Missing Data. Second Edition. John Wiley and Sons, New York. (2002). 5. Magnani M., Techniques for Dealing with Missing Data in Knowledge Discovery Tasks, accessed at http://magnanim.web.cs.unibo.it/index.html on Aug. 28, 2004 6. Hand D., Mannila H., and Smyth P., Principles of data mining. MIT Press, 2001 7. Quinlan J. R., Programs for Machine Learning. Morgan Kaufmann, CA, 1988. 8. Merz C. J. and Murphy P. M., UCI Repository of Machine Learning Datasets, 1998. http://www.ics.uci.edu/ mlearn/mlrepository.html. 9. Marshall, A. Vasilakis, C. and El-Darzi, E. Modelling Hospital Patient Flow: Recent Developments and Future Directions. (accepted) Health Care Management Science. 10. Marshall, A. McClean, S. Shapcott, C. Hastie, I. and Millard, P. Developing a ian Belief Network for the Management of Geriatric Hospital Care. Health Care Management Science, 4(1), pp 25-30, 2001.