Elia El-Darzi School of Computer Science, University of Westminster, London, UK

Transcription

1 The current issue and full text archive of this journal is available at Applying data mining algorithms to inpatient dataset with missing values Peng Liu School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai, China Elia El-Darzi School of Computer Science, University of Westminster, London, UK Lei Lei School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai, China, and Christos Vasilakis, Panagiotis Chountas and Wei Huang School of Computer Science, University of Westminster, London, UK Data mining algorithms 81 Abstract Purpose Data preparation plays an important role in data mining as most real life data sets contained missing data. This paper aims to investigate different treatment methods for missing data. Design/methodology/approach This paper introduces, analyses and compares well-established treatment methods for missing data and proposes new methods based on naïve Bayesian classifier. These methods have been implemented and compared using a real life geriatric hospital dataset. Findings In the case where a large proportion of the data is missing and many attributes have missing data, treatment methods based on naïve Bayesian classifier perform very well. Originality/value This paper proposes an effective missing data treatment method and offers a viable approach to predict inpatient length of stay from a data set with many missing values. Keywords Health services, Patients, Data analysis Paper type Research paper Introduction In recent years data mining (DM) approaches have been widely applied in the field of healthcare (Ceglowski et al., 2005; Isken and Rajagopalan, 2002; Ridley et al., 1998). Typically, DM process comprises of six steps: understanding the problem domain, understanding the data, preparing the data, data mining, discovering knowledge evaluation, and finally using the discovered knowledge (Krzysztof and Kurgan, 2002). Cabena et al. (1998) estimates that about 20 per cent of the effort is spent on business objective determination, about 60 per cent on data preparation and about 10 per cent on data mining and analysis of knowledge and knowledge assimilation steps, respectively. Why is more than half of the project effort spent on data preparation? Actually, there are a lot of serious data quality problems in real-world datasets. Problems that often encountered include: incomplete, redundant, inconsistent, or noisy data. These serious quality problems if not addressed will certainly reduce the performance of data Journal of Enterprise Information Management Vol. 21 No. 1, 2008 pp q Emerald Group Publishing Limited DOI /

2 JEIM 21,1 82 mining algorithms (Liu and Motoda, 1998) Hence in many cases, a lot of effort and time spent on data pre-processing phase. The application of efficient and sound data pre-processing procedures can reduce the amount of data to be analysed without losing any critical information, improve the quality of the data, enhance the performance of the actual data mining algorithms and reduce the execution time of mining algorithms (Liu and Motoda, 1998). A number of widely used and effective data pre-processing techniques that proved to be useful in practice include: data cleaning, integration, and transformation (Han and Kamber, 2000). In addition to these, feature selection, extraction, construction and discretisation are also widely applied (Han and Kamber, 2000) and (Kantardzic, 2003). This paper focuses on the important issue of dealing with missing values in data pre-processing. The rest of paper is organised as follows. In the next section several well-established methods for handling missing values are introduced. In the section that follows we propose several models to deal with missing values based on naïve Bayesian classifier and information gain method. All the models were applied to a geriatric hospital data set and computational results are reported in experiments section. Finally, conclusions and further work are discussed in the final section. Methods for handling missing values Although some of the mining methods (e.g. naïve Bayesian) are robust to missing values, other methods such as decision tree and K-means clustering cannot be performed directly on data that have missing values (Troyanskaya et al., 2001). Handling missing values improperly may accumulate more errors and proliferate across subsequent runs. Methods for resolving missing values are therefore needed. The following methods are commonly used to address theses issues:. Eliminating instances that contain missing values (Kantardzic, 2003). Although this method is the simplest, it is only effective on datasets containing a small account of instances with missing values. Its performance is especially poor when the percentage of missing values per attribute varies considerably (Han and Kamber, 2000). Further if the instances with missing values account for a large amount in the dataset, its omission will destroy the integrity of the original dataset.. Replacing missing values with a global constant or attribute mean (Kantardzic, 2003). The global constant or attribute mean is essentially a new value for the attribute. However, in many cases the replaced values may not be correct. Poor substitution of missing values may introduce inaccuracies and bias to the data and hence this method is not considered to be very effective.. K nearest neighbours (KNN) algorithm (Troyanskaya et al., 2001). In this method all the missing values in the dataset are approximated with means. For each instance, the distances to all other instances are computed and the k instances with the smallest distances to it are selected. The missing value in the instance is then replaced with the weighted average of the k values belonging to the k nearest instances. The main advantage of this method is that it can provide estimates for both qualitative and quantitative attributes. The major drawback of this method is that for each instance the whole dataset will have to be searched.

3 . Prediction model (Kantardzic, 2003). These methods create a predictive model to estimate values that can be used to replace the missing values in the dataset. For example, using the present datum in the dataset, a decision tree or a Bayesian classifier to predict the missing values can be constructed. In contrast to the previous methods, it uses the most information from the present data to predict missing values (Han and Kamber, 2000) and for this reason it is increasingly becoming more popular. Data mining algorithms 83 Models for handling missing values Model building According to the above methods for dealing with missing values, four models for handling missing values are evaluated and explained in this section. Estimating missing values using naïve Bayesian classifier is also discussed. Model 1 basic model. Missing values are ignored and upheld in this model. The original dataset, called dataset 0, which contains the missing values, is used as a training subset. A decision tree using the C4.5 algorithm (Quinlan, 1993) is built to evaluate the performance of the model. This model is used as the benchmark for comparing the performance of other models. Model 2 Bayesian estimation model. Missing values in this model are replaced by the values that are estimated by a naïve Bayesian classifier. Dataset 0 is used as training set to build naïve Bayesian classifier (Hand et al., 2001). This classifier is used to estimate all the missing values in dataset 0. This new complete dataset is called dataset B. Model 3 Bayesian estimation iteration model. Bayesian estimation model uses dataset 0 to estimate all the missing values in the dataset without utilizing the values that have been estimated earlier. For classifiers, the larger the training dataset the more effective the classifier is. Therefore this iteration model utilizes the estimated values fully. The algorithm consists of the following steps: (1) An attribute with missing values is chosen. The missing values of this attribute are estimated by Bayesian classifier, which is built using dataset 0. A new dataset, called C1, is created from dataset 0 by replacing missing values in the chosen attribute with the estimated values. (2) Dataset C1 is used to estimate missing values in the next attribute and replace these missing values with the estimated values in dataset C1 to create a new dataset, called C2. (3) Repeat step 2 until all the missing values in dataset 0 have been replaced and a new complete dataset without any missing values called dataset C has been created. Missing values are estimated on the basis of former estimation in this model. The values that have been estimated earlier are fully utilised. However, different order of estimation will result in different datasets in the Bayesian estimation iteration model. Thus, ten groups of estimate orders are created randomly. Each order builds one Bayesian estimation iteration model. Prediction results of these ten models are different and thus careful order of estimation is needed to find a better solution for this prediction problem.

4 JEIM 21,1 84 Model 4 Bayesian estimation iteration model based on information gain. In the Bayesian estimation model, only one dataset, dataset 0 is used as the training subset. Therefore the order of the estimation has no influence on the final dataset. Regardless of which attribute is estimated first or second, the final dataset is the same. However, model 4 is different from model 3 insofar that the order of the estimation plays an important role in the process of the missing value handling. The aim of this process is to overcome some inherited inefficiency in model 3. More specifically:. The accuracy of the value estimated early affects the accuracy of the value estimated later. If the incorrect values are used to estimate the remaining missing values, the inaccuracy of values estimated later is higher. If, for instance, the estimation accuracy of the first attribute is x per cent, then the estimation accuracy of the second attribute is (x per cent) 2, and the estimation accuracy of the nth attribute is (x per cent) n. If, for example, there are nine attributes containing missing values in a dataset and the estimation accuracy of the first attribute is 90 per cent then the estimation accuracy of the ninth attribute is roughly 38 per cent. That is to say, the more values we estimate, the lower the authenticity and reliability are. Clearly this is not the best way to estimate and replace all the missing values in the dataset.. In model 3, different choice of attribute at every stage of the estimation leads to different dataset. For every new dataset, new naïve Bayesian classifier has to be built and the values to be estimated after will subsequently change. Accordingly, there are 362,880 (9!) groups of different orders if nine attributes need to be estimated. These 362,880 groups of different orders will result in 362,880 different datasets. The computational cost associated with calculating all these orders to find the optimal dataset is very high. In the light of these inherited inefficiencies, an effective criterion for choosing an attribute that contains missing values is essential. Here we choose information gain as the criterion for attribute choice. Information theory (Witten and Frank, 2000) provides us with many effective heuristic choices such as gain ratio, distance measure, and relevance for selecting such an attribute. Information gain is a mature criterion, broadly accepted by many researchers. The widely applied decision tree algorithm C4.5 also uses information gain to choose nodes. According to the descending order of the information gain, we use a Bayesian estimation iteration model to estimate and replace one by one each attribute that contains missing values. The final dataset built under this criterion is called dataset D. The above models are summarized in Table I. Model Handling of missing values Estimate orders Table I. Models comparison Model 1 Keep N/A Model 2 Estimate and replace Random Model 3 Estimate and replace Random and iterated Model 4 Estimate and replace Ordered and iterated

5 Evaluation of the models Naïve Bayesian classifier, decision tree C4.5 algorithm and information gain are used in this paper. The performance of missing-value handling is evaluated by a decision tree C4.5 prediction model and ten-fold cross-validation (Witten and Frank, 2000). If the prediction model performs satisfactorily then the performance of missing-value handling model is considered to be acceptable. Decision trees have shown to be particularly robust in healthcare applications compared with neural networks, regression models, and discriminate analysis (Harper, 2005). The two concepts, prediction accuracy and prediction profit (Liu et al., 2004), which are used for measuring performance of data mining algorithms are defined as follows. Prediction accuracy can be applied at two levels, model level and class level: Data mining algorithms 85 Prediction accuracy of model ¼ Prediction accuracy of class ¼ Number of correctly categorised instances 100% Total number of instances Number of correctly categorised instances in the class 100% Total number of instances in the class The overall prediction accuracy is a relative value for the original distribution of the dataset. For an original dataset, the maximal class distribution will be the highest prediction accuracy of the dataset, if all the instances in the dataset are predicted to be in this class. Let a be the maximal class distribution of the original dataset. It is of no benefit if the absolute value of the prediction accuracy of the prediction model is very high but not higher than a. In contrast, if the absolute value of the prediction accuracy of the prediction model is very low, but still higher than a then the model is considered acceptable. In order to express this concept, the prediction profit of the model is introduced as follows: Prediction accuracy of the model 2 a Prediction profit of model ¼ 100% a The prediction profit can also be extended to class level as expressed below: Prediction profit of class ¼ Prediction accuracy of class 2 Prediction accuracy of class of basic model 100% Prediction accuracy of class of basic model Model experiments Data mining has a wide use in the healthcare domain. One of the main concerns in the healthcare area is the measurement of flow of patients through hospitals and other healthcare facilities. For instance if the inpatient length of stay (LOS) can be predicted efficiently, the planning and management of hospital resources can be greatly enhanced (Marshall et al., 2005). Data mining algorithms have also been successfully applied to predict LOS (for example see Harper, 2002; McClean and Glackin, 2000). However, most healthcare datasets contain a lot of missing values. The accuracy of

6 JEIM 21,1 86 prediction will benefit from an improvement of the dataset in the pre-processing stage. For these reasons we apply all the missing handling models discussed earlier to a real-life dataset to improve the accuracy of predictive models of LOS, especially for those patient spells that have very long LOS. Hospital LOS of in-patients is frequently used as a proxy for measuring the consumption of hospital resources and therefore it is essential to develop accurate models for the prediction of inpatients LOS. Clinics dataset The clinics dataset contains data from a clinical computer system that was used between 1994 and 1997 for the management of patients in a geriatric medicine department of a metropolitan teaching hospital in the UK (Marshall et al., 2001). It contains 4,722 patient records including patient demographic details, admission reasons, discharge details, outcome, and LOS. Patient LOS has an average of 85 days and a median of 17 days. For ease of analysis, the duration of stay variable was categorised into three groups: 0-14 days, days and 61 þ days (variable LOS GROUP). The boundaries LOS groups were chosen in agreement with clinical judgment to help describe the stages of care in such a hospital department. The first short-stay group (0-14 days) roughly corresponds to patients receiving acute care, i.e. patients admitted in a critical condition and only staying in hospital for a short period of time. The second, medium-stay group (15-60 days) corresponds to patients who undergo a period of rehabilitation. The third, long-term group (61 þ days) refers to the 11 per cent of patients who stay in hospital for a long period of time. According to Marshall et al. (2001) on this subject, we extract 28 attributes for pre-processing. Attribute-generalization (Carter and Hamilton, 1998) is applied to attributes SEASON (the season of admission) and AGE85 (patients over 85 years of age). Additionally, a grouped Barthel score, which is explained below, is introduced to substitute the original ten different Barthel scores. The Barthel score is composed of various indices that assess the patient s ability to do every day activities and their dependence on others for support (Mahony and Barthel, 1965). The score consists of ten different elements feeding, grooming, bathing, mobility, stairs, dressing, transfer, toilet, bladder, and bowels (Figure 1). Each patient is assessed using scales of dependency ranging from 0 to 3. A low score generally means a high level of patient dependency on others while a high score Figure 1. Attributes depicted in hierarchical structure

7 reflects those patients who are independent and require little assistance from medical staff. In order to simplify this scoring system for patient dependency, a grouped Barthel score is introduced as in Marshall et al. (2001). This revised scoring system is defined as follows: heavily dependent (Barthel score #1), very dependent (Barthel score: 2-10), slightly dependent (Barthel score: 11-19) and independent (Barthel score $20). These procedures have reduced the total number of attributes from 28 to 19 in the original data set without loosing any critical information in the making up of clinics dataset 0 (Liu et al., 2004). There are many missing values in the clinics data set. More specifically, 3,017 instances (63.89 per cent) contain missing values. The proportion of missing values for LOS GROUP is 63.29, and per cent, respectively for groups: short-, mediumand long-stay. The accuracy of the prediction model and especially the accuracy related to the long stay group can be increased by handling the missing values appropriately. Data mining algorithms 87 Practice and analysis The prediction accuracies of the models for handling missing values as described in the previous section and the associated accuracy of each class are listed in Table II, while the prediction profits are listed in Table III. The prediction profit of each class is compared with that of the basic model while calculating the prediction profit of a class. The clinics dataset consists of approximately per cent of patients who stay in hospital for a short period of time (short-stay), per cent of patients who stay in hospital for a medium period of time (medium-stay) and per cent of patients who stay in hospital for a long period of time (long-stay). That is to say, the prediction accuracy of attribute LOS will reach per cent, if we conclude all the patients are medium stay. Obviously there is a lot of room for improvement. In Table II, although the overall prediction accuracy of model 2 and the medium stay class of model 2 is a little lower than that of model 1, the prediction accuracy of the Accuracy of class Model Accuracy of model (%) Short-stay (%) Medium-stay (%) Long-stay (%) Model Model Model Model Table II. The prediction accuracy of all the models and classes Prediction profits of class Model Prediction profits of model (%) Short-stay (%) Medium-stay (%) Long-stay (%) Model Model Model Model Table III. Prediction profits of three missing values handling models

8 JEIM 21,1 88 long-stay has improved a lot. The prediction accuracy of the long stay is 10 per cent in model 1 before the missing values were estimated and replaced. After the missing values were estimated and replaced in model 2, the prediction accuracy of the long-stay has increased to 17 per cent, accounting for a 70 per cent prediction profit. This means that model 2 (Bayesian estimation model) has an active effect on the prediction of the long-stay class, which contains more than 60 per cent missing values. For model 3 Bayesian estimation iteration model, ten attribute orders were created randomly. According to these order groups ten groups of model and class accuracies were achieved, see Table IV. From Table II, the average prediction accuracy of the Bayesian estimation iteration model (model 3) is found to be higher than that of model 1, especially for the long stay category, i.e per cent in Table III. Model 3 is therefore more efficient than model 1 and model 2. However, the order of the attribute estimation is a big problem in this model as already explained in section 3. There are 40,320 different orders for eight attributes and only ten of these were showed in Table IV. Hence it is computationally impossible to calculate all the orders to find the optimal values. As explained earlier, model 4 (Bayesian estimation iteration model) that was developed from model 3 is based on information gain. In this model the attributes that we felt are important to the prediction of LOS were estimated earlier. In this way, the reliability of these important attributes can be assured. The prediction accuracy is also improved through the enhancement of the quality of the dataset. The information gain of attributes in Clinics dataset 0 is showed in Table V. According to the orders in Table V, Bayesian estimation iteration model based on information gain was built and the resulting prediction accuracies are listed in Table VI (note that 0 * in Table VI is the prediction accuracies of the basic model without estimating any missing values). The prediction accuracy of the model is the highest when the third attribute ADMGRP has been estimated (see Figure 2). The sequence of the estimation has a great effect on the results of the prediction model. A good sequence can improve the efficiency of the model and reduce the time to estimate performance. Hence, model 4 stopped when the first three attributes with missing data have been filled in and this semi-complete dataset D is then used as training subset for the predictive model. The Accuracy of class No. Accuracy of model (%) Short-stay (%) Medium-stay (%) Long-stay (%) Table IV. Results of ten Bayesian estimation iteration model Average

9 Attribute BARTHEL: Barthel grade OUTGRP: destination, i.e. dead/home/transfer ADMGRP: admission method LIVEGRP: lives alone or not YEARGRP: year of admission KINGRP: next of kin OTHERGRP: unknown MARGRP: marital status SEASON: season of admission Gain Data mining algorithms 89 Table V. Information gain of attributes (descending) Accuracy of class Order Attribute Accuracy of model (%) Short-stay (%) Medium-stay (%) Long-stay (%) 0 * BARTHEL OUTGRP ADMGRP LIVEGRP YEARGRP KINGRP MARGRP SEASON Table VI. Process of Bayesian estimation iteration model on information gain (descending) Figure 2. Prediction accuracy of long stay for every stage in descending order of information gain

10 JEIM 21,1 90 prediction accuracies of the model and each category can be seen on the third line of Table VI. If this algorithm proved to be robust to missing values then we only need to estimate the first three attributes instead of all the attributes containing missing values. If not, we should replace all the missing data in dataset to finish a complete one. In order to reduce the computational cost, model 4 and model 2 are combined here. First, model 4 is applied to attribute BARTHEL, OUTGRP and ADMGRP to make dataset D and then, model 2 is applied to dataset D to substitute the remaining missing data and finish the final complete dataset D. Dataset D is the training set for the building and evaluating of decision tree C4.5. The prediction accuracy of the model is per cent and the accuracy of each class is 45.4, 69.7, and 29.9 per cent, respectively. Comparing with the model 1, this method is more efficient. Bayesian estimation iteration model based on information gain can reach a greater improvement under a lower cost. Especially for the long stay, the improvement is significant with the prediction profit reaching 211 per cent. The prediction accuracies of the model and each class are higher than that of the Bayesian estimation iteration model based on random attribute order. This is because this model takes the classification contribution of each attribute that is information gain into account. All the accuracies of models and classes in Table II are summarized in Figure 3. Models dealing with missing values proposed in this paper can improve the prediction accuracy of the whole model, especially the accuracy of the long stay category. The biggest prediction profit for the long stay is 211 per cent. Obviously, the performance of the missing value handling model is very effective. It is critical to deal with the missing values for the improvement of the prediction accuracy of the long stay, in which the missing values account for more than a half. Conclusions Data mining procedures, pre-processing approaches and several different missing-value estimation methods are introduced in this paper. We apply data mining approaches to the healthcare domain in order to predict the inpatient length of stay (LOS). Figure 3. The prediction accuracy of all the models and classes

11 In this paper, four models are developed to improve the efficiency of the prediction these are: Basic model; Bayesian estimation model; Bayesian estimation iteration model and Bayesian estimation iteration model based on information gain. The performance and suitability of these models are evaluated using a real-life dataset. To make full use of every data at hand the values that have been estimated earlier, are used to enrich the present dataset and estimate remaining missing values as in model 3 Bayesian estimation iteration model. It is very important to decide the order of the missing value estimation. Introducing the concept of entropy in information theory, information gain is used for selecting attributes choice as in model 4 Bayesian estimation iteration model based on information gain. This model gives a better solution in terms of prediction profits and computational cost. Decision tree C4.5 and ten-folds cross-validation are used to estimate the performances of each model. It is shown that the use of naive Bayesian classifier to predict missing values iteratively in descending order of information gain is the most effective especially for the long stay patients. The proportion of these patients is normally small however they tend to stay for very long time in comparison with the short stay or medium stay patients and hence predicting their LOS can only improve the planning and management of hospital resources. However, there are some other elements, besides information gain, that may affect the efficient of the methods for missing value handling. An example is the percentage of the missing values for each attribute. If we take this into account, the sequence of the attribute for estimation will change. Further work is needed to find a better order criterion for estimation in order to improve the models in this paper. Data mining algorithms 91 References Cabena, P., Hadjinian, P., Stadler, R., Verhees, J. and Zanasi, A. (1998), Discovering Data Mining: From Concepts to Implementation, Prentice-Hall, Upper Saddle River, NJ. Carter, C. and Hamilton, H. (1998), Efficient attribute-oriented generalization for knowledge discovery from large databases, IEEE Transactions on Knowledge and Data Engineering, Vol. 10, pp Ceglowski, A., Churilov, L. and Wassertheil, J. (2005), Knowledge discovery through mining emergency department data, proceedings of the 38th Hawaii International Conference on System Sciences. Han, J. and Kamber, M. (2000), Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, San Matel, CA. Hand, D., Mannila, H. and Smyth, P. (2001), Principles of Data Mining, MIT Press, Cambridge, MA. Harper, P. (2002), A framework for operational modelling of hospital resources, Health Care Management Science, Vol. 5, pp Harper, P. (2005), A review and comparison of classification algorithms for medical decision making, Health Policy, Vol. 71, pp Isken, M. and Rajagopalan, B. (2002), Data mining to support simulation modeling of patient flow in hospitals, Journal of Medical Systems, Vol. 26, pp Kantardzic, M. (2003), Data Mining Concepts, Models, Methods and Algorithms, Wiley-IEEE Computer Society Press, New York, NY.

12 JEIM 21,1 92 Krzysztof, C. and Kurgan, L. (2002), Trends in data mining and knowledge discovery, in Pal, N.R., Jain, L.C. and Teoderesku, N. (Eds), Knowledge Discovery in Advanced Information Systems, Springer, Berlin. Liu, H. and Motoda, H. (1998), Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic, Boston, MA. Liu, P., El-Darzi, E., Vasilakis, C., Chountas, P., Huang, W. and Lei, L. (2004), Comparative analysis of data mining algorithms for predicting inpatient length of stay, in Chih-Ping Wei (Ed.), Proceedings of the Eighth Pacific-Asia Conference on Information Systems, Information Systems Adoption and Business Productivity, pp McClean, S. and Glackin, M. (2000), Using genetic algorithms to discover event history rules, The Fourth International Conference and Exhibition on The practical Application of Knowledge Discovery and Data Mining (PADD), pp Mahony, F. and Barthel, D. (1965), Functional evaluation: the Barthel index, Maryland State Medical Journal, Vol. 14, pp Marshall, A., McClean, S., Shapcott, C., Hastie, I. and Millard, P. (2001), Developing a Bayesian belief network for the management of geriatric hospital care, Health Care Management Science, Vol. 4, pp Marshall, A., Vasilakis, C. and El-Darzi, E. (2005), Length of stay-based patient flow models: recent developments and future directions, Health Care Management Science Journal, Vol. 8, pp Quinlan, J.R. (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, San Matel, CA. Ridley, S., Jones, S., Shahini, A., Brampton, W., Nielsen, M. and Rowan, K. (1998), Classification trees: a possible method for ISO-resource grouping in intensive care, Anaesthesia, Vol. 53, pp Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. (2001), Missing value estimation methods for DNA, Bioinformatics, Vol. 17, pp Witten, I. and Frank, E. (2000), Data Mining Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, San Francisco, CA. Corresponding author Elia El-Darzi can be contacted at: eldarze@westminster.ac.uk To purchase reprints of this article please reprints@emeraldinsight.com Or visit our web site for further details: