An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

Size: px
Start display at page:

Download "An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset"

Transcription

1 P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang 2 1 PSchool of Information Management and Engineering, Shanghai University of Finance and Economics, Shanghai, , P.R. China Care Computing Group, School of Computer Science, University of Westminster, London, Northwick Park, HA1 3TP, UK {eldarze, chounp, vasilc, 2 Abstract. It is well accepted that many real-life datasets are full of missing data. In this paper we introduce, analyze and compare several well known treatment methods for missing data handling and propose new methods based on Naive ian classifier to estimate and replace missing data. We conduct extensive experiments on datasets from UCI to compare these methods. Finally we apply these models to a geriatric hospital dataset in order to assess their effectiveness on a real-life dataset. 1 Introduction Data Mining (DM) is the process of discovering interesting knowledge from a large amounts of data stored either in databases, data warehouse, or other information repositories [1]. According to [2], about 20% of the effort is spent on the problem and data understanding, about 60% on data preparation and about 10% on data mining and analysis of knowledge, respectively. Why is more than half of the project effort spent on data preparation? Actually, there are a lot of serious data quality problems in realworld datasets. Problems often encountered include incomplete, redundant, inconsistent or noisy data [2]. These serious quality problems if not addressed reduce the performance of data mining algorithms. Hence, in many cases a lot of effort is spent on the data preparation phase in order to achieve a good result. Missing data is a common issue in many real-life datasets. Rates of less than 1% missing data are generally considered trivial, 1-5% manageable. However, 5-15% requires sophisticated methods to handle, and more than 15% may severely impact any kind of interpretation [3]. This paper discusses and evaluates some treatment methods for missing data. Missing mechanism and the guidelines for treatment are presented in Section 2. Section 3 introduces some popular treatment methods of missing data and proposes a new model based on Naive ian Classifier and information gain. Experimental analysis and model comparison are described in Section 4. The proposed models are applied to a hospital dataset and the results are reported in Section 5. Conclusions and further work are discussed in Section 6. X. Li, S. Wang, and Z.Y. Dong (Eds.): ADMA 2005 LNAI 3584, pp , Springer-Verlag Berlin Heidelberg 2005

2 584 P. Liu et al. 2 Missing Mechanism and Guidelines for Treatment The effect of the missing data treatment methods mainly depend on the missing mechanism. According to [4], missing data can be classified into three: missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR). Some data mining approaches treat missing data with internal algorithms, say,. However it is still significantly important to construct complete datasets with treatment methods for missing data for two reasons [5]. First, any data mining approaches can be used with a complete dataset and second, it can prove a basic point for the comparison of the data mining approaches. In general, treatment methods for missing data can be divided into three common approaches [3]: Ignore or Discard the Instances Which Contain Missing Data. Parameter Estimation. In this approach, variants of Expectation-Maximization algorithm are used in the maximum likelihood procedures to estimate the parameters for missing data. These methods are normally superior to ignore or discard methods. However, there are not widely applied mainly because the assumption of variable distributions can not be easily derived and the high degree of complexity for calculation [6]. Imputation Techniques. It uses the present information of the dataset to estimate and replace missing data correspondingly. It aims to recognize the relationships among the values in dataset and to estimate missing data based on these relationships. In general, missing data treatment methods should satisfy three rules. First, any missing data treatment method should not change the distribution of the dataset. Second, the relationship among the attributes should be retained. Third, it must not be too complex or computationally costly for a method to be applied in real life. 3 Missing Data Treatment Methods In this section we introduce some popular missing data treatment methods and our proposed models which are based on the concept of information gain and Naive ian Classifier. 3.1 Popular Treatment Methods Case Deletion. This method discards the cases with missing data for at least one attribute. A variation of this method is to delete the instances or attributes which have high missing rate. However before deleting any attribute, it is necessary to run relevance analysis. Constant Replacement. The mean (for numeric data) or mode (for nominal data) is used to replace missing data. To reduce the influence of exceptional data, the median can also be used. Internal Treatment Method for. This method uses a probabilistic approach to handle missing data [7].

3 An Analysis of Missing Data Treatment Methods and Their Application 585 K-Nearest Neighbor Imputation. This method uses k-nearest neighbor (KNN) algorithms to replace missing data. Computational efficiency is thought to be a big problem for this method. While the KNN look for the most similar instances, the whole dataset should be searched. On the other hand, the selection of the value k and the measure of similarity will greatly influence the results. 3.2 Naive ian Imputation (NBI) Naive ian Classifier is a popular classifier, not only for its good performance, but also for its simple form [6]. It is not sensitive to missing data. In this case, the prediction accuracy of Naive ian Classifier is always higher than that of [1]. Missing data treatment methods based on Naive ian Classifier and concept of information gain, named Naïve ian Imputation (NBI), consist of two phases. In phase1, the order of the attributes to be treated is defined. This process treats attributes with missing data one by one. It is an iterative process. The order of the attribute to be treated affects the overall results. Each attribute s importance for classification is different. The performance of random order of attributes is therefore weak and unreliable. To overcome this inherent deficiency in random order we propose several methods: (1) Using the original data to estimate all the missing data. It is independent of the estimate order of attributes. (2) On descending order of missing rate. Initially, it replaces missing data in the attribute with the highest missing rate, then uses the modified dataset to estimate and replace missing in the next attribute. (3) On descending order of information gain. For the classification task of data mining, information gain reflects the importance of the attributes for the classification task. (4) On descending order of the weighted index by missing rate and information gain. This method combines missing rate and information gain concepts. It is proved to be effective. (5) In experiments, while using methods 3 or 4, the performance of the algorithm is best when the first n (usually 3~4) attributes have been treated. Therefore, treatment of the first three or four attributes will be enough. Method 1 can be applied to the remaining attributes. In this way, the quality and efficiency can be balanced. In phase2, the Naive ian classifier is built to estimate and replace the missing data, using the attribute defined in the first phase as class attribute and the whole dataset as the training subset. According to the different estimate order of attributes, NBI has five different combinations, namely Model 1, Model 2, Model 3, Model 4, and Model 5. 4 Experimental Analysis To compare the methods introduced in Section 3, we use three datasets from UCI [8], namely, Nursery, Crx and German (Table 1). Table 1. Datasets summary dataset Inst. Attar. class Nursery Crx German

4 586 P. Liu et al. In order to evaluate the performance of missing data treatment methods, Decision tree classifier is built on the modified dataset. If the performance of the classifier is turned out to be satisfactory, the performance of missing data treatment model is considered to be satisfactory. Predictive accuracy of model, predictive accuracy of class and predictive profit (against internal method) which are used in the paper for measuring performance of data mining algorithm are defined as follows: = Number of correct categorized instances Prediction accuracy of model 100% Total number of instances (1) = Number of correct categorize d instances in the class Prediction accuracy of class 100 % Total number of instances in the class = Prediction accuracy - prediction accuracy of intern al method Prediction profit 100% prediction accuracy of internal method (2) (3) The number of correct categorized instances and the number of correct categorized instances in the class are calculated by using the modified dataset and the prediction accuracy of internal method comes from the classifier which uses the modified dataset built by internal missing data treatment method. The experiments are as follows: Firstly, the datasets are randomly divided into two subsets: 66% of records as a training subset and the remaining 33% as a testing subset. Missing data are artificially implanted in different rates, from 10% to 60% of records into the training subsets in order to maintain the integrity of testing subsets. Decision tree is used as a classifier in this paper and the analysis for most representative node attributes is desirable. Hence attributes with high information gain are selected to be inserted with missing data. Finally, three methods are applied into the training subsets these are: replacing method, internal method and Model 4 on Naive ian Classifier. The experiments are repeated three times for each method and the average error rate is calculated. Due to the lack of space, only part of the results is presented in Tables 2 and 3, and the comparative performances are graphed in Figures 1 and 2. From Figures 1 and 2 we can see that, in most cases Model 4 of NBI is superior to mean replacing method and internal method. For dataset Nursery, while the Table 2. Error rates for Dataset Nursery Missing Att.: heal, par Att.: heal, par, fina proportion (%) (%) (%) (%) (%) (%) 0% 3.8± ± % 4.2± ± ± ± ± ±0.6 20% 4.7± ± ± ± ± ±0.2 30% 5.7± ± ± ± ± ±0.1 40% 7.4± ± ± ± ± ±0.8 50% 10.8± ± ± ± ± ±0.4 60% 11.0± ± ± ± ± ±0.9

5 An Analysis of Missing Data Treatment Methods and Their Application 587 Table 3. Error rates for Dataset Crx Missing Att.: A5, A6 Att.: A5, A6, A2 proportion (%) (%) (%) (%) (%) (%) 0% 13.5± ± % 13.5± ± ± ± ± ±0.3 21% 13.4± ± ± ± ± ±1.1 35% 14.6± ± ± ± ± ±0.9 50% 13.8± ± ± ± ± ±1.1 60% 17.7± ± ± ± ± ±1.2 proportion of missing data is small, the performances of the three methods are similar. As the missing rate goes beyond 40%, the difference among the three methods becomes more obvious. The performance of the mean replacing method worsens as the missing rate increases. However, in the tree structure, the node attributes and their levels do not change markedly. Increases in the proportion of the missing data proportion do not influence the structure of the decision tree, because the information gain ratio of the attribute heal, par and fina are much bigger than that of other attributes. The performance of internal methods is very similar to the performance of model 4. Results for dataset Crx are illustrated in Figure 2. As the proportion of missing data increases, the classification error rates of both mean replacing methods and internal methods increase. However results for Model 4 remain stable and very close to the rate of the original dataset which does not have missing data. In this experiment, attribute A5, A6 and A2 were found to be strongly dependent on other attributes. In order to find the dependency relationship among attributes, each attribute, one by one, is predicted by the other attributes in dataset. If the error rate for a classifier is low then the attribute has a strong relationship with other attributes. For Crx, error rates of these three attributes are very low, about 25%. It makes the new dataset completed by Model 4 very close to the original dataset which does not have missing values. Therefore, the predictive error rates of Model 4 always fluctuate around that of the original dataset. We also find that an increase in the missing data does not affect the predictive error rates of missing data, but an increase in the number of attributes with missing data will influence the performance of the missing treatment methods. For dataset German, the performed well when missing data are inserted into one attribute. When missing data are inserted into three attributes, the performances of the three methods are similar to each other. For both Crx and German datasets, as the proportion of missing data increases, the nodes of the decision tree changed. The high levels attributes became lower levels or are not even selected as nodes. Using the attributes with weak classifying power will reduce the performance of the decision tree [3]. After changing the structure of the decision tree, the performance of Model 4 is better than that of internal methods.

6 588 P. Liu et al. 10% 8% 6% 4% 2% Missing data on attribute heal 18% 17% 15% 13% Missing data on attribute A5 10% 8% 6% 4% 2% Missing data on attribute heal & par Missing data 18% on attribute A5 & A6 17% 15% 13% 10% 8% 6% 4% 2% Missing data on attribute heal,par & fina Fig. 1. Comparative results for Nursery 18% Missing data on attribute A5,A6 & A2 17% 15% 13% Fig. 2. Comparative results for Crx 5 Application in Healthcare The approaches of data mining have a wide use in the healthcare domain. If the inpatient length of (LOS) can be predicted efficiently, the planning of hospital resources can be greatly enhanced [9]. However, most healthcare datasets contain a lot of missing data. Treatment methods for missing data discussed earlier in this paper are applied to a real life dataset to improve the accuracy of predictive models of LOS. 5.1 Clinics Dataset The Clinics dataset contains data from a clinical computer system that was in use between 1994 and 1997, for the management of patients in a Geriatric Medicine department of a metropolitan teaching hospital in the UK [10]. It contains 4722 patient records including patient demographic details, admission reasons and LOS. For ease of analysis, LOS was categorized into three groups: short- group (0-14 days), medium- group (15-60 days) and long-term group (61+ days) (variable LOS GROUP). These boundaries are chosen in agreement with clinical judgment to

7 An Analysis of Missing Data Treatment Methods and Their Application 589 help describe the stages of care in such a hospital department. The missing data account for a lot in the Clinics dataset. There are 3017 instances (63.89%) that contain missing data. The proportion of missing data per LOS GROUP is 63.29%, 61.81% and 74.86%, respectively. There are 8 attributes with missing data. 5.2 Practice and Analysis After Applying these five models based on Naive ian Classifier proposed in this paper to Clinics dataset we obtained all the prediction accuracy and the prediction profits as in Table 4. Table 4. The prediction profits against and prediction accuracy of all models and classes accuracy accuracy of class prediction prediction profits of class Model of model Short Medium Long profits of model Short Medium Long 52.44% 44% 69% 10% Model % 46% 69% 29% 6.10% 4.5% 0.0% 190.0% Model % 46% 68% 30% 5.30% 4.5% -1.4% 200.0% Model % 48% 68% 31% 6.50% 9.1% -1.4% 210.0% Model % 47% 69% 31% 6.50% 6.8% 0.0% 210.0% Model % 45% 70% 30% 5.90% 2.3% 1.5% 200.0% From Table 4, we can see that the average prediction accuracy of NBI is higher than the internal model, especially for the long category. In Clinics dataset, 6 attributes with missing data were treated. The missing proportion for two of them is about 20% and for one above 40%. In this case, NBI outperform internal method. Among these five models, Model 4 and Model 5 performed better than the others. Furthermore, for Model 3, 4 and 5, three or four attributes were enough to obtain a good result. NBI can improve the prediction accuracy of the whole model, especially for the long category. The highest prediction profit for the long has reaches 210%. In the cases where the missing data proportion is large, there are many attributes with missing data and a strong relationship among attributes is exhibited, treatment methods for missing data based on Naive ian Classifier perform well. 6 Conclusions This paper presents a comparative analysis of several well known missing data treatments and proposes an efficient and effective missing data predictive model, NBI. These methods were tested on the Nursery, Crx, German datasets from UCI and Clinics dataset from a geriatric department. NBI performs better than internal model. The type of the attributes with missing data affects the results of the treatment methods. While the important attribute for classifying contains fewer missing data or none, internal model perform very well. However, treatment methods based on Naive ian Classifier are more commonly used. Comparatively, in the case of high missing data proportion and many attributes with missing data, NBI will perform more satisfactorily.

8 590 P. Liu et al. References 1. Han J. and Kamber M., Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, Cios K.J. and Kurgan L., Trends in Data Mining and Knowledge Discovery. In N.R. Pal, L.C. Jain, and Teoderesku N., editors, Knowledge Discovery in Advanced Information Systems. Springer, Acuna E. and Rodriguez C., The treatment of missing values and its effect in the classifier accuracy. In D. Banks, L. House, F.R. McMorris,P. Arabie, W. Gaul (Eds).Classification, Clustering and Data Mining Applications. Springer-Verlag Berlin-Heidelberg, (2004). 4. Little R. J. and Rubin D.B., Statistical Analysis with Missing Data. Second Edition. John Wiley and Sons, New York. (2002). 5. Magnani M., Techniques for Dealing with Missing Data in Knowledge Discovery Tasks, accessed at on Aug. 28, Hand D., Mannila H., and Smyth P., Principles of data mining. MIT Press, Quinlan J. R., Programs for Machine Learning. Morgan Kaufmann, CA, Merz C. J. and Murphy P. M., UCI Repository of Machine Learning Datasets, mlearn/mlrepository.html. 9. Marshall, A. Vasilakis, C. and El-Darzi, E. Modelling Hospital Patient Flow: Recent Developments and Future Directions. (accepted) Health Care Management Science. 10. Marshall, A. McClean, S. Shapcott, C. Hastie, I. and Millard, P. Developing a ian Belief Network for the Management of Geriatric Hospital Care. Health Care Management Science, 4(1), pp 25-30, 2001.

A Review of Missing Data Treatment Methods

A Review of Missing Data Treatment Methods A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common

More information

Healthcare Data Mining: Prediction Inpatient Length of Stay

Healthcare Data Mining: Prediction Inpatient Length of Stay 3rd International IEEE Conference Intelligent Systems, September 2006 Healthcare Data Mining: Prediction Inpatient Length of Peng Liu, Lei Lei, Junjie Yin, Wei Zhang, Wu Naijun, Elia El-Darzi 1 Abstract

More information

WestminsterResearch http://www.wmin.ac.uk/westminsterresearch

WestminsterResearch http://www.wmin.ac.uk/westminsterresearch WestminsterResearch http://www.wmin.ac.uk/westminsterresearch Healthcare data mining: predicting inpatient length of stay. Peng Liu 1 Lei Lei 1 Junjie Yin 1 Wei Zhang 1 Wu Naijun 1 Elia El-Darzi 2 1 School

More information

Elia El-Darzi School of Computer Science, University of Westminster, London, UK

Elia El-Darzi School of Computer Science, University of Westminster, London, UK The current issue and full text archive of this journal is available at www.emeraldinsight.com/1741-0398.htm Applying data mining algorithms to inpatient dataset with missing values Peng Liu School of

More information

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

An Analysis of Four Missing Data Treatment Methods for Supervised Learning An Analysis of Four Missing Data Treatment Methods for Supervised Learning Gustavo E. A. P. A. Batista and Maria Carolina Monard University of São Paulo - USP Institute of Mathematics and Computer Science

More information

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

More information

T3: A Classification Algorithm for Data Mining

T3: A Classification Algorithm for Data Mining T3: A Classification Algorithm for Data Mining Christos Tjortjis and John Keane Department of Computation, UMIST, P.O. Box 88, Manchester, M60 1QD, UK {christos, jak}@co.umist.ac.uk Abstract. This paper

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

The treatment of missing values and its effect in the classifier accuracy

The treatment of missing values and its effect in the classifier accuracy The treatment of missing values and its effect in the classifier accuracy Edgar Acuña 1 and Caroline Rodriguez 2 1 Department of Mathematics, University of Puerto Rico at Mayaguez, Mayaguez, PR 00680 edgar@cs.uprm.edu

More information

Healthcare Measurement Analysis Using Data mining Techniques

Healthcare Measurement Analysis Using Data mining Techniques www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Roulette Sampling for Cost-Sensitive Learning

Roulette Sampling for Cost-Sensitive Learning Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Analysis of Various Techniques to Handling Missing Value in Dataset Rajnik L. Vaishnav a, Dr. K. M. Patel b a

Analysis of Various Techniques to Handling Missing Value in Dataset Rajnik L. Vaishnav a, Dr. K. M. Patel b a Available online at www.ijiere.com International Journal of Innovative and Emerging Research in Engineering e-issn: 2394-3343 e-issn: 2394-5494 Analysis of Various Techniques to Handling Missing Value

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013. Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Assessing Data Mining: The State of the Practice

Assessing Data Mining: The State of the Practice Assessing Data Mining: The State of the Practice 2003 Herbert A. Edelstein Two Crows Corporation 10500 Falls Road Potomac, Maryland 20854 www.twocrows.com (301) 983-3555 Objectives Separate myth from reality

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Better Healthcare with Data Mining

Better Healthcare with Data Mining Technical report Better Healthcare with Data Mining Philip Baylis Shared Medical Systems Limited, UK Table of contents Abstract... 2 Introduction... 2 Inpatient length of stay... 2 Patient data... 3 Detect

More information

Performance Analysis of Decision Trees

Performance Analysis of Decision Trees Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Decision Tree Learning on Very Large Data Sets

Decision Tree Learning on Very Large Data Sets Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa

More information

A Framework for Dynamic Faculty Support System to Analyze Student Course Data

A Framework for Dynamic Faculty Support System to Analyze Student Course Data A Framework for Dynamic Faculty Support System to Analyze Student Course Data J. Shana 1, T. Venkatachalam 2 1 Department of MCA, Coimbatore Institute of Technology, Affiliated to Anna University of Chennai,

More information

Subject Description Form

Subject Description Form Subject Description Form Subject Code Subject Title COMP417 Data Warehousing and Data Mining Techniques in Business and Commerce Credit Value 3 Level 4 Pre-requisite / Co-requisite/ Exclusion Objectives

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: Jerzy.Stefanowski@cs.put.poznan.pl Data Mining a step in A KDD Process Data mining:

More information

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India lav_dlr@yahoo.com

More information

A Data Generator for Multi-Stream Data

A Data Generator for Multi-Stream Data A Data Generator for Multi-Stream Data Zaigham Faraz Siddiqui, Myra Spiliopoulou, Panagiotis Symeonidis, and Eleftherios Tiakas University of Magdeburg ; University of Thessaloniki. [siddiqui,myra]@iti.cs.uni-magdeburg.de;

More information

Performance Study on Data Discretization Techniques Using Nutrition Dataset

Performance Study on Data Discretization Techniques Using Nutrition Dataset 2009 International Symposium on Computing, Communication, and Control (ISCCC 2009) Proc.of CSIT vol.1 (2011) (2011) IACSIT Press, Singapore Performance Study on Data Discretization Techniques Using Nutrition

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Three Perspectives of Data Mining

Three Perspectives of Data Mining Three Perspectives of Data Mining Zhi-Hua Zhou * National Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China Abstract This paper reviews three recent books on data mining

More information

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress) DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress) Leo Pipino University of Massachusetts Lowell Leo_Pipino@UML.edu David Kopcso Babson College Kopcso@Babson.edu Abstract: A series of simulations

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Missing Data Dr Eleni Matechou

Missing Data Dr Eleni Matechou 1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

Data Mining based on Rough Set and Decision Tree Optimization

Data Mining based on Rough Set and Decision Tree Optimization Data Mining based on Rough Set and Decision Tree Optimization College of Information Engineering, North China University of Water Resources and Electric Power, China, haiyan@ncwu.edu.cn Abstract This paper

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

College information system research based on data mining

College information system research based on data mining 2009 International Conference on Machine Learning and Computing IPCSIT vol.3 (2011) (2011) IACSIT Press, Singapore College information system research based on data mining An-yi Lan 1, Jie Li 2 1 Hebei

More information

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Enhanced Boosted Trees Technique for Customer Churn Prediction Model IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction

More information

A Lightweight Solution to the Educational Data Mining Challenge

A Lightweight Solution to the Educational Data Mining Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three

More information

ClusterOSS: a new undersampling method for imbalanced learning

ClusterOSS: a new undersampling method for imbalanced learning 1 ClusterOSS: a new undersampling method for imbalanced learning Victor H Barella, Eduardo P Costa, and André C P L F Carvalho, Abstract A dataset is said to be imbalanced when its classes are disproportionately

More information

Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms

Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms Y.Y. Yao, Y. Zhao, R.B. Maguire Department of Computer Science, University of Regina Regina,

More information

Towards applying Data Mining Techniques for Talent Mangement

Towards applying Data Mining Techniques for Talent Mangement 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,

More information

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS 1 PANKAJ SAXENA & 2 SUSHMA LEHRI 1 Deptt. Of Computer Applications, RBS Management Techanical Campus, Agra 2 Institute of

More information

Improved Fuzzy C-means Clustering Algorithm Based on Cluster Density

Improved Fuzzy C-means Clustering Algorithm Based on Cluster Density Journal of Computational Information Systems 8: 2 (2012) 727 737 Available at http://www.jofcis.com Improved Fuzzy C-means Clustering Algorithm Based on Cluster Density Xiaojun LOU, Junying LI, Haitao

More information

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS Kalpesh Adhatrao, Aditya Gaykar, Amiraj Dhawan, Rohit Jha and Vipul Honrao ABSTRACT Department of Computer Engineering, Fr.

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

Introduction to Data Mining Techniques

Introduction to Data Mining Techniques Introduction to Data Mining Techniques Dr. Rajni Jain 1 Introduction The last decade has experienced a revolution in information availability and exchange via the internet. In the same spirit, more and

More information

Application of Data Mining Methods in Health Care Databases

Application of Data Mining Methods in Health Care Databases 6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. Application of Data Mining Methods in Health Care Databases Ágnes Vathy-Fogarassy Department of Mathematics and

More information

Addressing the Class Imbalance Problem in Medical Datasets

Addressing the Class Imbalance Problem in Medical Datasets Addressing the Class Imbalance Problem in Medical Datasets M. Mostafizur Rahman and D. N. Davis the size of the training set is significantly increased [5]. If the time taken to resample is not considered,

More information

Overview Applications of Data Mining In Health Care: The Case Study of Arusha Region

Overview Applications of Data Mining In Health Care: The Case Study of Arusha Region International Journal of Computational Engineering Research Vol, 03 Issue, 8 Overview Applications of Data Mining In Health Care: The Case Study of Arusha Region 1, Salim Diwani, 2, Suzan Mishol, 3, Daniel

More information

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble Dr. Hongwei Patrick Yang Educational Policy Studies & Evaluation College of Education University of Kentucky

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

Data Mining for Knowledge Management in Technology Enhanced Learning

Data Mining for Knowledge Management in Technology Enhanced Learning Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 115 Data Mining for Knowledge Management in Technology Enhanced Learning

More information

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring 714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: raghavendra_bk@rediffmail.com

More information

Comparison of Classification Techniques for Heart Health Analysis System

Comparison of Classification Techniques for Heart Health Analysis System International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-04, Issue-02 E-ISSN: 2347-2693 Comparison of Classification Techniques for Heart Health Analysis System Karthika

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk

More information

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann

More information

First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms

First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms Azwa Abdul Aziz, Nor Hafieza IsmailandFadhilah Ahmad Faculty Informatics & Computing

More information

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods Effective Analysis and Predictive Model of Stroke Disease using Classification Methods A.Sudha Student, M.Tech (CSE) VIT University Vellore, India P.Gayathri Assistant Professor VIT University Vellore,

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

DATA ANALYSIS USING BUSINESS INTELLIGENCE TOOL. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

DATA ANALYSIS USING BUSINESS INTELLIGENCE TOOL. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment DATA ANALYSIS USING BUSINESS INTELLIGENCE TOOL A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Computer Science

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Rule based Classification of BSE Stock Data with Data Mining

Rule based Classification of BSE Stock Data with Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 4, Number 1 (2012), pp. 1-9 International Research Publication House http://www.irphouse.com Rule based Classification

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 2, Number 1 (2015), pp. 1-6 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Enhancing Quality of Data using Data Mining Method

Enhancing Quality of Data using Data Mining Method JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad

More information

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

More information

Reducing multiclass to binary by coupling probability estimates

Reducing multiclass to binary by coupling probability estimates Reducing multiclass to inary y coupling proaility estimates Bianca Zadrozny Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0114 zadrozny@cs.ucsd.edu

More information

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Predicting Students Final GPA Using Decision Trees: A Case Study

Predicting Students Final GPA Using Decision Trees: A Case Study Predicting Students Final GPA Using Decision Trees: A Case Study Mashael A. Al-Barrak and Muna Al-Razgan Abstract Educational data mining is the process of applying data mining tools and techniques to

More information

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu

More information

Data Mining Approach For Subscription-Fraud. Detection in Telecommunication Sector

Data Mining Approach For Subscription-Fraud. Detection in Telecommunication Sector Contemporary Engineering Sciences, Vol. 7, 2014, no. 11, 515-522 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.4431 Data Mining Approach For Subscription-Fraud Detection in Telecommunication

More information

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream

More information