Elia El-Darzi School of Computer Science, University of Westminster, London, UK

Size: px
Start display at page:

Download "Elia El-Darzi School of Computer Science, University of Westminster, London, UK"

Transcription

1 The current issue and full text archive of this journal is available at Applying data mining algorithms to inpatient dataset with missing values Peng Liu School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai, China Elia El-Darzi School of Computer Science, University of Westminster, London, UK Lei Lei School of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai, China, and Christos Vasilakis, Panagiotis Chountas and Wei Huang School of Computer Science, University of Westminster, London, UK Data mining algorithms 81 Abstract Purpose Data preparation plays an important role in data mining as most real life data sets contained missing data. This paper aims to investigate different treatment methods for missing data. Design/methodology/approach This paper introduces, analyses and compares well-established treatment methods for missing data and proposes new methods based on naïve Bayesian classifier. These methods have been implemented and compared using a real life geriatric hospital dataset. Findings In the case where a large proportion of the data is missing and many attributes have missing data, treatment methods based on naïve Bayesian classifier perform very well. Originality/value This paper proposes an effective missing data treatment method and offers a viable approach to predict inpatient length of stay from a data set with many missing values. Keywords Health services, Patients, Data analysis Paper type Research paper Introduction In recent years data mining (DM) approaches have been widely applied in the field of healthcare (Ceglowski et al., 2005; Isken and Rajagopalan, 2002; Ridley et al., 1998). Typically, DM process comprises of six steps: understanding the problem domain, understanding the data, preparing the data, data mining, discovering knowledge evaluation, and finally using the discovered knowledge (Krzysztof and Kurgan, 2002). Cabena et al. (1998) estimates that about 20 per cent of the effort is spent on business objective determination, about 60 per cent on data preparation and about 10 per cent on data mining and analysis of knowledge and knowledge assimilation steps, respectively. Why is more than half of the project effort spent on data preparation? Actually, there are a lot of serious data quality problems in real-world datasets. Problems that often encountered include: incomplete, redundant, inconsistent, or noisy data. These serious quality problems if not addressed will certainly reduce the performance of data Journal of Enterprise Information Management Vol. 21 No. 1, 2008 pp q Emerald Group Publishing Limited DOI /

2 JEIM 21,1 82 mining algorithms (Liu and Motoda, 1998) Hence in many cases, a lot of effort and time spent on data pre-processing phase. The application of efficient and sound data pre-processing procedures can reduce the amount of data to be analysed without losing any critical information, improve the quality of the data, enhance the performance of the actual data mining algorithms and reduce the execution time of mining algorithms (Liu and Motoda, 1998). A number of widely used and effective data pre-processing techniques that proved to be useful in practice include: data cleaning, integration, and transformation (Han and Kamber, 2000). In addition to these, feature selection, extraction, construction and discretisation are also widely applied (Han and Kamber, 2000) and (Kantardzic, 2003). This paper focuses on the important issue of dealing with missing values in data pre-processing. The rest of paper is organised as follows. In the next section several well-established methods for handling missing values are introduced. In the section that follows we propose several models to deal with missing values based on naïve Bayesian classifier and information gain method. All the models were applied to a geriatric hospital data set and computational results are reported in experiments section. Finally, conclusions and further work are discussed in the final section. Methods for handling missing values Although some of the mining methods (e.g. naïve Bayesian) are robust to missing values, other methods such as decision tree and K-means clustering cannot be performed directly on data that have missing values (Troyanskaya et al., 2001). Handling missing values improperly may accumulate more errors and proliferate across subsequent runs. Methods for resolving missing values are therefore needed. The following methods are commonly used to address theses issues:. Eliminating instances that contain missing values (Kantardzic, 2003). Although this method is the simplest, it is only effective on datasets containing a small account of instances with missing values. Its performance is especially poor when the percentage of missing values per attribute varies considerably (Han and Kamber, 2000). Further if the instances with missing values account for a large amount in the dataset, its omission will destroy the integrity of the original dataset.. Replacing missing values with a global constant or attribute mean (Kantardzic, 2003). The global constant or attribute mean is essentially a new value for the attribute. However, in many cases the replaced values may not be correct. Poor substitution of missing values may introduce inaccuracies and bias to the data and hence this method is not considered to be very effective.. K nearest neighbours (KNN) algorithm (Troyanskaya et al., 2001). In this method all the missing values in the dataset are approximated with means. For each instance, the distances to all other instances are computed and the k instances with the smallest distances to it are selected. The missing value in the instance is then replaced with the weighted average of the k values belonging to the k nearest instances. The main advantage of this method is that it can provide estimates for both qualitative and quantitative attributes. The major drawback of this method is that for each instance the whole dataset will have to be searched.

3 . Prediction model (Kantardzic, 2003). These methods create a predictive model to estimate values that can be used to replace the missing values in the dataset. For example, using the present datum in the dataset, a decision tree or a Bayesian classifier to predict the missing values can be constructed. In contrast to the previous methods, it uses the most information from the present data to predict missing values (Han and Kamber, 2000) and for this reason it is increasingly becoming more popular. Data mining algorithms 83 Models for handling missing values Model building According to the above methods for dealing with missing values, four models for handling missing values are evaluated and explained in this section. Estimating missing values using naïve Bayesian classifier is also discussed. Model 1 basic model. Missing values are ignored and upheld in this model. The original dataset, called dataset 0, which contains the missing values, is used as a training subset. A decision tree using the C4.5 algorithm (Quinlan, 1993) is built to evaluate the performance of the model. This model is used as the benchmark for comparing the performance of other models. Model 2 Bayesian estimation model. Missing values in this model are replaced by the values that are estimated by a naïve Bayesian classifier. Dataset 0 is used as training set to build naïve Bayesian classifier (Hand et al., 2001). This classifier is used to estimate all the missing values in dataset 0. This new complete dataset is called dataset B. Model 3 Bayesian estimation iteration model. Bayesian estimation model uses dataset 0 to estimate all the missing values in the dataset without utilizing the values that have been estimated earlier. For classifiers, the larger the training dataset the more effective the classifier is. Therefore this iteration model utilizes the estimated values fully. The algorithm consists of the following steps: (1) An attribute with missing values is chosen. The missing values of this attribute are estimated by Bayesian classifier, which is built using dataset 0. A new dataset, called C1, is created from dataset 0 by replacing missing values in the chosen attribute with the estimated values. (2) Dataset C1 is used to estimate missing values in the next attribute and replace these missing values with the estimated values in dataset C1 to create a new dataset, called C2. (3) Repeat step 2 until all the missing values in dataset 0 have been replaced and a new complete dataset without any missing values called dataset C has been created. Missing values are estimated on the basis of former estimation in this model. The values that have been estimated earlier are fully utilised. However, different order of estimation will result in different datasets in the Bayesian estimation iteration model. Thus, ten groups of estimate orders are created randomly. Each order builds one Bayesian estimation iteration model. Prediction results of these ten models are different and thus careful order of estimation is needed to find a better solution for this prediction problem.

4 JEIM 21,1 84 Model 4 Bayesian estimation iteration model based on information gain. In the Bayesian estimation model, only one dataset, dataset 0 is used as the training subset. Therefore the order of the estimation has no influence on the final dataset. Regardless of which attribute is estimated first or second, the final dataset is the same. However, model 4 is different from model 3 insofar that the order of the estimation plays an important role in the process of the missing value handling. The aim of this process is to overcome some inherited inefficiency in model 3. More specifically:. The accuracy of the value estimated early affects the accuracy of the value estimated later. If the incorrect values are used to estimate the remaining missing values, the inaccuracy of values estimated later is higher. If, for instance, the estimation accuracy of the first attribute is x per cent, then the estimation accuracy of the second attribute is (x per cent) 2, and the estimation accuracy of the nth attribute is (x per cent) n. If, for example, there are nine attributes containing missing values in a dataset and the estimation accuracy of the first attribute is 90 per cent then the estimation accuracy of the ninth attribute is roughly 38 per cent. That is to say, the more values we estimate, the lower the authenticity and reliability are. Clearly this is not the best way to estimate and replace all the missing values in the dataset.. In model 3, different choice of attribute at every stage of the estimation leads to different dataset. For every new dataset, new naïve Bayesian classifier has to be built and the values to be estimated after will subsequently change. Accordingly, there are 362,880 (9!) groups of different orders if nine attributes need to be estimated. These 362,880 groups of different orders will result in 362,880 different datasets. The computational cost associated with calculating all these orders to find the optimal dataset is very high. In the light of these inherited inefficiencies, an effective criterion for choosing an attribute that contains missing values is essential. Here we choose information gain as the criterion for attribute choice. Information theory (Witten and Frank, 2000) provides us with many effective heuristic choices such as gain ratio, distance measure, and relevance for selecting such an attribute. Information gain is a mature criterion, broadly accepted by many researchers. The widely applied decision tree algorithm C4.5 also uses information gain to choose nodes. According to the descending order of the information gain, we use a Bayesian estimation iteration model to estimate and replace one by one each attribute that contains missing values. The final dataset built under this criterion is called dataset D. The above models are summarized in Table I. Model Handling of missing values Estimate orders Table I. Models comparison Model 1 Keep N/A Model 2 Estimate and replace Random Model 3 Estimate and replace Random and iterated Model 4 Estimate and replace Ordered and iterated

5 Evaluation of the models Naïve Bayesian classifier, decision tree C4.5 algorithm and information gain are used in this paper. The performance of missing-value handling is evaluated by a decision tree C4.5 prediction model and ten-fold cross-validation (Witten and Frank, 2000). If the prediction model performs satisfactorily then the performance of missing-value handling model is considered to be acceptable. Decision trees have shown to be particularly robust in healthcare applications compared with neural networks, regression models, and discriminate analysis (Harper, 2005). The two concepts, prediction accuracy and prediction profit (Liu et al., 2004), which are used for measuring performance of data mining algorithms are defined as follows. Prediction accuracy can be applied at two levels, model level and class level: Data mining algorithms 85 Prediction accuracy of model ¼ Prediction accuracy of class ¼ Number of correctly categorised instances 100% Total number of instances Number of correctly categorised instances in the class 100% Total number of instances in the class The overall prediction accuracy is a relative value for the original distribution of the dataset. For an original dataset, the maximal class distribution will be the highest prediction accuracy of the dataset, if all the instances in the dataset are predicted to be in this class. Let a be the maximal class distribution of the original dataset. It is of no benefit if the absolute value of the prediction accuracy of the prediction model is very high but not higher than a. In contrast, if the absolute value of the prediction accuracy of the prediction model is very low, but still higher than a then the model is considered acceptable. In order to express this concept, the prediction profit of the model is introduced as follows: Prediction accuracy of the model 2 a Prediction profit of model ¼ 100% a The prediction profit can also be extended to class level as expressed below: Prediction profit of class ¼ Prediction accuracy of class 2 Prediction accuracy of class of basic model 100% Prediction accuracy of class of basic model Model experiments Data mining has a wide use in the healthcare domain. One of the main concerns in the healthcare area is the measurement of flow of patients through hospitals and other healthcare facilities. For instance if the inpatient length of stay (LOS) can be predicted efficiently, the planning and management of hospital resources can be greatly enhanced (Marshall et al., 2005). Data mining algorithms have also been successfully applied to predict LOS (for example see Harper, 2002; McClean and Glackin, 2000). However, most healthcare datasets contain a lot of missing values. The accuracy of

6 JEIM 21,1 86 prediction will benefit from an improvement of the dataset in the pre-processing stage. For these reasons we apply all the missing handling models discussed earlier to a real-life dataset to improve the accuracy of predictive models of LOS, especially for those patient spells that have very long LOS. Hospital LOS of in-patients is frequently used as a proxy for measuring the consumption of hospital resources and therefore it is essential to develop accurate models for the prediction of inpatients LOS. Clinics dataset The clinics dataset contains data from a clinical computer system that was used between 1994 and 1997 for the management of patients in a geriatric medicine department of a metropolitan teaching hospital in the UK (Marshall et al., 2001). It contains 4,722 patient records including patient demographic details, admission reasons, discharge details, outcome, and LOS. Patient LOS has an average of 85 days and a median of 17 days. For ease of analysis, the duration of stay variable was categorised into three groups: 0-14 days, days and 61 þ days (variable LOS GROUP). The boundaries LOS groups were chosen in agreement with clinical judgment to help describe the stages of care in such a hospital department. The first short-stay group (0-14 days) roughly corresponds to patients receiving acute care, i.e. patients admitted in a critical condition and only staying in hospital for a short period of time. The second, medium-stay group (15-60 days) corresponds to patients who undergo a period of rehabilitation. The third, long-term group (61 þ days) refers to the 11 per cent of patients who stay in hospital for a long period of time. According to Marshall et al. (2001) on this subject, we extract 28 attributes for pre-processing. Attribute-generalization (Carter and Hamilton, 1998) is applied to attributes SEASON (the season of admission) and AGE85 (patients over 85 years of age). Additionally, a grouped Barthel score, which is explained below, is introduced to substitute the original ten different Barthel scores. The Barthel score is composed of various indices that assess the patient s ability to do every day activities and their dependence on others for support (Mahony and Barthel, 1965). The score consists of ten different elements feeding, grooming, bathing, mobility, stairs, dressing, transfer, toilet, bladder, and bowels (Figure 1). Each patient is assessed using scales of dependency ranging from 0 to 3. A low score generally means a high level of patient dependency on others while a high score Figure 1. Attributes depicted in hierarchical structure

7 reflects those patients who are independent and require little assistance from medical staff. In order to simplify this scoring system for patient dependency, a grouped Barthel score is introduced as in Marshall et al. (2001). This revised scoring system is defined as follows: heavily dependent (Barthel score #1), very dependent (Barthel score: 2-10), slightly dependent (Barthel score: 11-19) and independent (Barthel score $20). These procedures have reduced the total number of attributes from 28 to 19 in the original data set without loosing any critical information in the making up of clinics dataset 0 (Liu et al., 2004). There are many missing values in the clinics data set. More specifically, 3,017 instances (63.89 per cent) contain missing values. The proportion of missing values for LOS GROUP is 63.29, and per cent, respectively for groups: short-, mediumand long-stay. The accuracy of the prediction model and especially the accuracy related to the long stay group can be increased by handling the missing values appropriately. Data mining algorithms 87 Practice and analysis The prediction accuracies of the models for handling missing values as described in the previous section and the associated accuracy of each class are listed in Table II, while the prediction profits are listed in Table III. The prediction profit of each class is compared with that of the basic model while calculating the prediction profit of a class. The clinics dataset consists of approximately per cent of patients who stay in hospital for a short period of time (short-stay), per cent of patients who stay in hospital for a medium period of time (medium-stay) and per cent of patients who stay in hospital for a long period of time (long-stay). That is to say, the prediction accuracy of attribute LOS will reach per cent, if we conclude all the patients are medium stay. Obviously there is a lot of room for improvement. In Table II, although the overall prediction accuracy of model 2 and the medium stay class of model 2 is a little lower than that of model 1, the prediction accuracy of the Accuracy of class Model Accuracy of model (%) Short-stay (%) Medium-stay (%) Long-stay (%) Model Model Model Model Table II. The prediction accuracy of all the models and classes Prediction profits of class Model Prediction profits of model (%) Short-stay (%) Medium-stay (%) Long-stay (%) Model Model Model Model Table III. Prediction profits of three missing values handling models

8 JEIM 21,1 88 long-stay has improved a lot. The prediction accuracy of the long stay is 10 per cent in model 1 before the missing values were estimated and replaced. After the missing values were estimated and replaced in model 2, the prediction accuracy of the long-stay has increased to 17 per cent, accounting for a 70 per cent prediction profit. This means that model 2 (Bayesian estimation model) has an active effect on the prediction of the long-stay class, which contains more than 60 per cent missing values. For model 3 Bayesian estimation iteration model, ten attribute orders were created randomly. According to these order groups ten groups of model and class accuracies were achieved, see Table IV. From Table II, the average prediction accuracy of the Bayesian estimation iteration model (model 3) is found to be higher than that of model 1, especially for the long stay category, i.e per cent in Table III. Model 3 is therefore more efficient than model 1 and model 2. However, the order of the attribute estimation is a big problem in this model as already explained in section 3. There are 40,320 different orders for eight attributes and only ten of these were showed in Table IV. Hence it is computationally impossible to calculate all the orders to find the optimal values. As explained earlier, model 4 (Bayesian estimation iteration model) that was developed from model 3 is based on information gain. In this model the attributes that we felt are important to the prediction of LOS were estimated earlier. In this way, the reliability of these important attributes can be assured. The prediction accuracy is also improved through the enhancement of the quality of the dataset. The information gain of attributes in Clinics dataset 0 is showed in Table V. According to the orders in Table V, Bayesian estimation iteration model based on information gain was built and the resulting prediction accuracies are listed in Table VI (note that 0 * in Table VI is the prediction accuracies of the basic model without estimating any missing values). The prediction accuracy of the model is the highest when the third attribute ADMGRP has been estimated (see Figure 2). The sequence of the estimation has a great effect on the results of the prediction model. A good sequence can improve the efficiency of the model and reduce the time to estimate performance. Hence, model 4 stopped when the first three attributes with missing data have been filled in and this semi-complete dataset D is then used as training subset for the predictive model. The Accuracy of class No. Accuracy of model (%) Short-stay (%) Medium-stay (%) Long-stay (%) Table IV. Results of ten Bayesian estimation iteration model Average

9 Attribute BARTHEL: Barthel grade OUTGRP: destination, i.e. dead/home/transfer ADMGRP: admission method LIVEGRP: lives alone or not YEARGRP: year of admission KINGRP: next of kin OTHERGRP: unknown MARGRP: marital status SEASON: season of admission Gain Data mining algorithms 89 Table V. Information gain of attributes (descending) Accuracy of class Order Attribute Accuracy of model (%) Short-stay (%) Medium-stay (%) Long-stay (%) 0 * BARTHEL OUTGRP ADMGRP LIVEGRP YEARGRP KINGRP MARGRP SEASON Table VI. Process of Bayesian estimation iteration model on information gain (descending) Figure 2. Prediction accuracy of long stay for every stage in descending order of information gain

10 JEIM 21,1 90 prediction accuracies of the model and each category can be seen on the third line of Table VI. If this algorithm proved to be robust to missing values then we only need to estimate the first three attributes instead of all the attributes containing missing values. If not, we should replace all the missing data in dataset to finish a complete one. In order to reduce the computational cost, model 4 and model 2 are combined here. First, model 4 is applied to attribute BARTHEL, OUTGRP and ADMGRP to make dataset D and then, model 2 is applied to dataset D to substitute the remaining missing data and finish the final complete dataset D. Dataset D is the training set for the building and evaluating of decision tree C4.5. The prediction accuracy of the model is per cent and the accuracy of each class is 45.4, 69.7, and 29.9 per cent, respectively. Comparing with the model 1, this method is more efficient. Bayesian estimation iteration model based on information gain can reach a greater improvement under a lower cost. Especially for the long stay, the improvement is significant with the prediction profit reaching 211 per cent. The prediction accuracies of the model and each class are higher than that of the Bayesian estimation iteration model based on random attribute order. This is because this model takes the classification contribution of each attribute that is information gain into account. All the accuracies of models and classes in Table II are summarized in Figure 3. Models dealing with missing values proposed in this paper can improve the prediction accuracy of the whole model, especially the accuracy of the long stay category. The biggest prediction profit for the long stay is 211 per cent. Obviously, the performance of the missing value handling model is very effective. It is critical to deal with the missing values for the improvement of the prediction accuracy of the long stay, in which the missing values account for more than a half. Conclusions Data mining procedures, pre-processing approaches and several different missing-value estimation methods are introduced in this paper. We apply data mining approaches to the healthcare domain in order to predict the inpatient length of stay (LOS). Figure 3. The prediction accuracy of all the models and classes

11 In this paper, four models are developed to improve the efficiency of the prediction these are: Basic model; Bayesian estimation model; Bayesian estimation iteration model and Bayesian estimation iteration model based on information gain. The performance and suitability of these models are evaluated using a real-life dataset. To make full use of every data at hand the values that have been estimated earlier, are used to enrich the present dataset and estimate remaining missing values as in model 3 Bayesian estimation iteration model. It is very important to decide the order of the missing value estimation. Introducing the concept of entropy in information theory, information gain is used for selecting attributes choice as in model 4 Bayesian estimation iteration model based on information gain. This model gives a better solution in terms of prediction profits and computational cost. Decision tree C4.5 and ten-folds cross-validation are used to estimate the performances of each model. It is shown that the use of naive Bayesian classifier to predict missing values iteratively in descending order of information gain is the most effective especially for the long stay patients. The proportion of these patients is normally small however they tend to stay for very long time in comparison with the short stay or medium stay patients and hence predicting their LOS can only improve the planning and management of hospital resources. However, there are some other elements, besides information gain, that may affect the efficient of the methods for missing value handling. An example is the percentage of the missing values for each attribute. If we take this into account, the sequence of the attribute for estimation will change. Further work is needed to find a better order criterion for estimation in order to improve the models in this paper. Data mining algorithms 91 References Cabena, P., Hadjinian, P., Stadler, R., Verhees, J. and Zanasi, A. (1998), Discovering Data Mining: From Concepts to Implementation, Prentice-Hall, Upper Saddle River, NJ. Carter, C. and Hamilton, H. (1998), Efficient attribute-oriented generalization for knowledge discovery from large databases, IEEE Transactions on Knowledge and Data Engineering, Vol. 10, pp Ceglowski, A., Churilov, L. and Wassertheil, J. (2005), Knowledge discovery through mining emergency department data, proceedings of the 38th Hawaii International Conference on System Sciences. Han, J. and Kamber, M. (2000), Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, San Matel, CA. Hand, D., Mannila, H. and Smyth, P. (2001), Principles of Data Mining, MIT Press, Cambridge, MA. Harper, P. (2002), A framework for operational modelling of hospital resources, Health Care Management Science, Vol. 5, pp Harper, P. (2005), A review and comparison of classification algorithms for medical decision making, Health Policy, Vol. 71, pp Isken, M. and Rajagopalan, B. (2002), Data mining to support simulation modeling of patient flow in hospitals, Journal of Medical Systems, Vol. 26, pp Kantardzic, M. (2003), Data Mining Concepts, Models, Methods and Algorithms, Wiley-IEEE Computer Society Press, New York, NY.

12 JEIM 21,1 92 Krzysztof, C. and Kurgan, L. (2002), Trends in data mining and knowledge discovery, in Pal, N.R., Jain, L.C. and Teoderesku, N. (Eds), Knowledge Discovery in Advanced Information Systems, Springer, Berlin. Liu, H. and Motoda, H. (1998), Feature Extraction, Construction and Selection: A Data Mining Perspective, Kluwer Academic, Boston, MA. Liu, P., El-Darzi, E., Vasilakis, C., Chountas, P., Huang, W. and Lei, L. (2004), Comparative analysis of data mining algorithms for predicting inpatient length of stay, in Chih-Ping Wei (Ed.), Proceedings of the Eighth Pacific-Asia Conference on Information Systems, Information Systems Adoption and Business Productivity, pp McClean, S. and Glackin, M. (2000), Using genetic algorithms to discover event history rules, The Fourth International Conference and Exhibition on The practical Application of Knowledge Discovery and Data Mining (PADD), pp Mahony, F. and Barthel, D. (1965), Functional evaluation: the Barthel index, Maryland State Medical Journal, Vol. 14, pp Marshall, A., McClean, S., Shapcott, C., Hastie, I. and Millard, P. (2001), Developing a Bayesian belief network for the management of geriatric hospital care, Health Care Management Science, Vol. 4, pp Marshall, A., Vasilakis, C. and El-Darzi, E. (2005), Length of stay-based patient flow models: recent developments and future directions, Health Care Management Science Journal, Vol. 8, pp Quinlan, J.R. (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, San Matel, CA. Ridley, S., Jones, S., Shahini, A., Brampton, W., Nielsen, M. and Rowan, K. (1998), Classification trees: a possible method for ISO-resource grouping in intensive care, Anaesthesia, Vol. 53, pp Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. (2001), Missing value estimation methods for DNA, Bioinformatics, Vol. 17, pp Witten, I. and Frank, E. (2000), Data Mining Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann Publishers, San Francisco, CA. Corresponding author Elia El-Darzi can be contacted at: eldarze@westminster.ac.uk To purchase reprints of this article please reprints@emeraldinsight.com Or visit our web site for further details:

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

Healthcare Data Mining: Prediction Inpatient Length of Stay

Healthcare Data Mining: Prediction Inpatient Length of Stay 3rd International IEEE Conference Intelligent Systems, September 2006 Healthcare Data Mining: Prediction Inpatient Length of Peng Liu, Lei Lei, Junjie Yin, Wei Zhang, Wu Naijun, Elia El-Darzi 1 Abstract

More information

WestminsterResearch http://www.wmin.ac.uk/westminsterresearch

WestminsterResearch http://www.wmin.ac.uk/westminsterresearch WestminsterResearch http://www.wmin.ac.uk/westminsterresearch Healthcare data mining: predicting inpatient length of stay. Peng Liu 1 Lei Lei 1 Junjie Yin 1 Wei Zhang 1 Wu Naijun 1 Elia El-Darzi 2 1 School

More information

A Review of Missing Data Treatment Methods

A Review of Missing Data Treatment Methods A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Healthcare Measurement Analysis Using Data mining Techniques

Healthcare Measurement Analysis Using Data mining Techniques www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik

More information

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress) DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress) Leo Pipino University of Massachusetts Lowell Leo_Pipino@UML.edu David Kopcso Babson College Kopcso@Babson.edu Abstract: A series of simulations

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao ABSTRACT Department of Computer Engineering, Fr.

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Subject Description Form

Subject Description Form Subject Description Form Subject Code Subject Title COMP417 Data Warehousing and Data Mining Techniques in Business and Commerce Credit Value 3 Level 4 Pre-requisite / Co-requisite/ Exclusion Objectives

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: Jerzy.Stefanowski@cs.put.poznan.pl Data Mining a step in A KDD Process Data mining:

More information

Clustering Marketing Datasets with Data Mining Techniques

Clustering Marketing Datasets with Data Mining Techniques Clustering Marketing Datasets with Data Mining Techniques Özgür Örnek International Burch University, Sarajevo oornek@ibu.edu.ba Abdülhamit Subaşı International Burch University, Sarajevo asubasi@ibu.edu.ba

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

A decision support system for bed-occupancy management and planning hospitals

A decision support system for bed-occupancy management and planning hospitals IMA Journal of Mathematics Applied in Medicine & Biology (1995) 12, 249-257 A decision support system for bed-occupancy management and planning hospitals SALLY MCCLEAN Division of Mathematics, School of

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Performance Study on Data Discretization Techniques Using Nutrition Dataset

Performance Study on Data Discretization Techniques Using Nutrition Dataset 2009 International Symposium on Computing, Communication, and Control (ISCCC 2009) Proc.of CSIT vol.1 (2011) (2011) IACSIT Press, Singapore Performance Study on Data Discretization Techniques Using Nutrition

More information

2.1. Data Mining for Biomedical and DNA data analysis

2.1. Data Mining for Biomedical and DNA data analysis Applications of Data Mining Simmi Bagga Assistant Professor Sant Hira Dass Kanya Maha Vidyalaya, Kala Sanghian, Distt Kpt, India (Email: simmibagga12@gmail.com) Dr. G.N. Singh Department of Physics and

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

More information

Assessing Data Mining: The State of the Practice

Assessing Data Mining: The State of the Practice Assessing Data Mining: The State of the Practice 2003 Herbert A. Edelstein Two Crows Corporation 10500 Falls Road Potomac, Maryland 20854 www.twocrows.com (301) 983-3555 Objectives Separate myth from reality

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Data mining in the e-learning domain

Data mining in the e-learning domain Data mining in the e-learning domain The author is Education Liaison Officer for e-learning, Knowsley Council and University of Liverpool, Wigan, UK. Keywords Higher education, Classification, Data encapsulation,

More information

College information system research based on data mining

College information system research based on data mining 2009 International Conference on Machine Learning and Computing IPCSIT vol.3 (2011) (2011) IACSIT Press, Singapore College information system research based on data mining An-yi Lan 1, Jie Li 2 1 Hebei

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Towards applying Data Mining Techniques for Talent Mangement

Towards applying Data Mining Techniques for Talent Mangement 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,

More information

Cost Drivers of a Parametric Cost Estimation Model for Data Mining Projects (DMCOMO)

Cost Drivers of a Parametric Cost Estimation Model for Data Mining Projects (DMCOMO) Cost Drivers of a Parametric Cost Estimation Model for Mining Projects (DMCOMO) Oscar Marbán, Antonio de Amescua, Juan J. Cuadrado, Luis García Universidad Carlos III de Madrid (UC3M) Abstract Mining is

More information

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES Irron Williams Northwestern University IrronWilliams2015@u.northwestern.edu Abstract--Data science is evolving. In

More information

Introduction to Data Mining Techniques

Introduction to Data Mining Techniques Introduction to Data Mining Techniques Dr. Rajni Jain 1 Introduction The last decade has experienced a revolution in information availability and exchange via the internet. In the same spirit, more and

More information

Three Perspectives of Data Mining

Three Perspectives of Data Mining Three Perspectives of Data Mining Zhi-Hua Zhou * National Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China Abstract This paper reviews three recent books on data mining

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Feature Construction for Time Ordered Data Sequences

Feature Construction for Time Ordered Data Sequences Feature Construction for Time Ordered Data Sequences Michael Schaidnagel School of Computing University of the West of Scotland Email: B00260359@studentmail.uws.ac.uk Fritz Laux Faculty of Computer Science

More information

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Data Mining for Customer Service Support Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Traditional Hotline Services Problem Traditional Customer Service Support (manufacturing)

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Improving students learning process by analyzing patterns produced with data mining methods

Improving students learning process by analyzing patterns produced with data mining methods Improving students learning process by analyzing patterns produced with data mining methods Lule Ahmedi, Eliot Bytyçi, Blerim Rexha, and Valon Raça Abstract Employing data mining algorithms on previous

More information

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil brunorocha_33@hotmail.com 2 Network Engineering

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Data Mining based on Rough Set and Decision Tree Optimization

Data Mining based on Rough Set and Decision Tree Optimization Data Mining based on Rough Set and Decision Tree Optimization College of Information Engineering, North China University of Water Resources and Electric Power, China, haiyan@ncwu.edu.cn Abstract This paper

More information

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

Evaluating Data Mining Models: A Pattern Language

Evaluating Data Mining Models: A Pattern Language Evaluating Data Mining Models: A Pattern Language Jerffeson Souza Stan Matwin Nathalie Japkowicz School of Information Technology and Engineering University of Ottawa K1N 6N5, Canada {jsouza,stan,nat}@site.uottawa.ca

More information

Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Enhanced Boosted Trees Technique for Customer Churn Prediction Model IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction

More information

The treatment of missing values and its effect in the classifier accuracy

The treatment of missing values and its effect in the classifier accuracy The treatment of missing values and its effect in the classifier accuracy Edgar Acuña 1 and Caroline Rodriguez 2 1 Department of Mathematics, University of Puerto Rico at Mayaguez, Mayaguez, PR 00680 edgar@cs.uprm.edu

More information

Management Science Letters

Management Science Letters Management Science Letters 4 (2014) 905 912 Contents lists available at GrowingScience Management Science Letters homepage: www.growingscience.com/msl Measuring customer loyalty using an extended RFM and

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

A Brief Tutorial on Database Queries, Data Mining, and OLAP

A Brief Tutorial on Database Queries, Data Mining, and OLAP A Brief Tutorial on Database Queries, Data Mining, and OLAP Lutz Hamel Department of Computer Science and Statistics University of Rhode Island Tyler Hall Kingston, RI 02881 Tel: (401) 480-9499 Fax: (401)

More information

Network Intrusion Detection Using a HNB Binary Classifier

Network Intrusion Detection Using a HNB Binary Classifier 2015 17th UKSIM-AMSS International Conference on Modelling and Simulation Network Intrusion Detection Using a HNB Binary Classifier Levent Koc and Alan D. Carswell Center for Security Studies, University

More information

College of Health and Human Services. Fall 2013. Syllabus

College of Health and Human Services. Fall 2013. Syllabus College of Health and Human Services Fall 2013 Syllabus information placement Instructor description objectives HAP 780 : Data Mining in Health Care Time: Mondays, 7.20pm 10pm (except for 3 rd lecture

More information

The Effect of Clustering in the Apriori Data Mining Algorithm: A Case Study

The Effect of Clustering in the Apriori Data Mining Algorithm: A Case Study WCE 23, July 3-5, 23, London, U.K. The Effect of Clustering in the Apriori Data Mining Algorithm: A Case Study Nergis Yılmaz and Gülfem Işıklar Alptekin Abstract Many organizations collect and store data

More information

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA 315 DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA C. K. Lowe-Ma, A. E. Chen, D. Scholl Physical & Environmental Sciences, Research and Advanced Engineering Ford Motor Company, Dearborn, Michigan, USA

More information

Predicting Students Final GPA Using Decision Trees: A Case Study

Predicting Students Final GPA Using Decision Trees: A Case Study Predicting Students Final GPA Using Decision Trees: A Case Study Mashael A. Al-Barrak and Muna Al-Razgan Abstract Educational data mining is the process of applying data mining tools and techniques to

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining José Hernández ndez-orallo Dpto.. de Systems Informáticos y Computación Universidad Politécnica de Valencia, Spain jorallo@dsic.upv.es Horsens, Denmark, 26th September 2005

More information

Data Mining in the Application of Criminal Cases Based on Decision Tree

Data Mining in the Application of Criminal Cases Based on Decision Tree 8 Journal of Computer Science and Information Technology, Vol. 1 No. 2, December 2013 Data Mining in the Application of Criminal Cases Based on Decision Tree Ruijuan Hu 1 Abstract A briefing on data mining

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

A Mechanism for Selecting Appropriate Data Mining Techniques

A Mechanism for Selecting Appropriate Data Mining Techniques A Mechanism for Selecting Appropriate Data Mining Techniques Rose Tinabo School of Computing Dublin institute of technology Kevin Street Dublin 8 Ireland rose.tinabo@mydit.ie ABSTRACT: Due to an increase

More information

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand

More information

Rule based Classification of BSE Stock Data with Data Mining

Rule based Classification of BSE Stock Data with Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

Evaluating an Integrated Time-Series Data Mining Environment - A Case Study on a Chronic Hepatitis Data Mining -

Evaluating an Integrated Time-Series Data Mining Environment - A Case Study on a Chronic Hepatitis Data Mining - Evaluating an Integrated Time-Series Data Mining Environment - A Case Study on a Chronic Hepatitis Data Mining - Hidenao Abe, Miho Ohsaki, Hideto Yokoi, and Takahira Yamaguchi Department of Medical Informatics,

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Data Mining: Motivations and Concepts

Data Mining: Motivations and Concepts POLYTECHNIC UNIVERSITY Department of Computer Science / Finance and Risk Engineering Data Mining: Motivations and Concepts K. Ming Leung Abstract: We discuss here the need, the goals, and the primary tasks

More information

Overview Applications of Data Mining In Health Care: The Case Study of Arusha Region

Overview Applications of Data Mining In Health Care: The Case Study of Arusha Region International Journal of Computational Engineering Research Vol, 03 Issue, 8 Overview Applications of Data Mining In Health Care: The Case Study of Arusha Region 1, Salim Diwani, 2, Suzan Mishol, 3, Daniel

More information

Standardization of Components, Products and Processes with Data Mining

Standardization of Components, Products and Processes with Data Mining B. Agard and A. Kusiak, Standardization of Components, Products and Processes with Data Mining, International Conference on Production Research Americas 2004, Santiago, Chile, August 1-4, 2004. Standardization

More information

A Comparative Study of clustering algorithms Using weka tools

A Comparative Study of clustering algorithms Using weka tools A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering

More information

Automatic Resolver Group Assignment of IT Service Desk Outsourcing

Automatic Resolver Group Assignment of IT Service Desk Outsourcing Automatic Resolver Group Assignment of IT Service Desk Outsourcing in Banking Business Padej Phomasakha Na Sakolnakorn*, Phayung Meesad ** and Gareth Clayton*** Abstract This paper proposes a framework

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

Perspectives on Data Mining

Perspectives on Data Mining Perspectives on Data Mining Niall Adams Department of Mathematics, Imperial College London n.adams@imperial.ac.uk April 2009 Objectives Give an introductory overview of data mining (DM) (or Knowledge Discovery

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Prediction and Diagnosis of Heart Disease by Data Mining Techniques

Prediction and Diagnosis of Heart Disease by Data Mining Techniques Prediction and Diagnosis of Heart Disease by Data Mining Techniques Boshra Bahrami, Mirsaeid Hosseini Shirvani* Department of Computer Engineering, Sari Branch, Islamic Azad University Sari, Iran Boshrabahrami_znu@yahoo.com;

More information