Imputation of Missing Data: A Semi-Supervised Clustering methodology

Size: px
Start display at page:

Download "Imputation of Missing Data: A Semi-Supervised Clustering methodology"

Transcription

1 Imputation of Missing Data: A Semi-Supervised Clustering methodology Ilango Paramasivam 1, Hemalatha Thiagarajan 2, Poonkuntran Shanmugam 3, Nickolas Savarimuthu 4, 1 PhD Research Scholar, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, India. Assistant Professor, School of Computing Sciences, VIT University, Vellore, India. 2 Professor, Department of Mathematics, National Institute of Technology, Tiruchirappalli, India. 3 Lecturer, School of Computing Sciences, VIT University, Vellore, India. 4 Assistant Professor, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, India. {ilangosarojini@gmail.com} {hema@nitt.edu} {s_poonkuntran@yahoo.co.in} {nickolas@nitt.edu} Abstract: Data mining is being applied with success in different fields of human endeavor including Marketing, Customer Relationship Management and Healthcare. Real world data sets are always accompanied by missing data, a major factor affecting data quality. Missing data has been a pervasive problem in data analysis since the origin of data collection. In the knowledge discovery process, missing data introduces bias in the model evaluation and leads to inaccurate mining results. The objective of this research is to propose a semi-supervised clustering methodology for the imputation of missing data in databases. For this purpose, missing data are simulated on the complete Pima Indians Type II Diabetes dataset in order to evaluate the performance of the proposed algorithm. The performance is compared with other existing imputation methods. The comparative analysis shows that the proposed method produces stable results and shows less variance in the error rate over different percentages of missing data compared to other methods. Keywords: Data mining, Missing data, Imputation methods, Semi-supervised clustering, Average Imputation Error. I. INTRODUCTION The revolution in storage technologies made storage of huge amounts of data in databases feasible. Specialized tools for access, analysis, knowledge discovery and effective use of stored knowledge of the data are required to handle such extensive data. The relatively young and growing fields of data mining and knowledge discovery generate valuable information for business processes dealing with such data. A data mining process typically consists of six steps: understanding the problem domain, understanding the data, data preparation, data mining, evaluation of the discovered knowledge and finally using the knowledge thus obtained (Hair et al. 1998). Real-world datasets are suffering from questionable quality owing to incomplete, redundant and inaccurate data. These problems, if not addressed, will certainly affect the performance of data 1

2 mining processes and affect the validity of the discovered patterns or knowledge, leading to erroneous predictions. Hence, a lot of effort and time are spent on pre-processing the data. It is estimated that about 20 per cent of the effort is spent on business objective determination, about 60 per cent on data preparation and about 10 per cent on data mining and analysis of knowledge and knowledge assimilation steps respectively (Cabena et al. 1998). The application of efficient and sound data pre-processing procedures can improve the quality of the data without losing any critical information. The better the quality of the data, the higher is the quality of knowledge extracted from it. A number of widely used, effective data pre-processing techniques that proved to be useful in practice include: data cleaning, integration, and transformation (Han et al. 2006). In addition to these, feature selection, feature extraction, model formulation and discretization are also widely applied (Kantardzic 2003) to extract highly informative features of the database relevant to the application. Data cleaning, an important preliminary step in data mining, seeks to improve the quality of the data, making the data more reliable for effective mining process. Data cleaning algorithms attempt to smooth noise in the data, identify and eliminate inconsistencies and remove missing values or replace them with values imputed from the rest of the data. Missing data handling is one of the major tasks in data cleaning. Datasets are frequently characterized by their incompleteness. The data incompleteness is the result of mal-functioning equipment, inadvertent deletion of data instances (or records), inconsistency with other recorded data, data being considered unimportant at the time of entry, data excluded owing to misunderstanding, and refusal to respond to queries. The absence of complete data then hampers the decision-making processes because of the dependence of decisions on complete information (Stefanakos and Athanassoulis 2001; Marwala et al. 2006). Generally, the presence of 1% missing data in the database is considered trivial and 1% to 5% is manageable. However, sophisticated methods and tools are required to handle 5-15% missing data. Many data handling methods for missing data are described in the literature (Han and Kamber 2006). They fall into one of the following categories: i) Use of only complete data by deleting the incomplete instances, ii) Imputation of missing data. The strategy of deleting incomplete instances from the databases could be used when the size of missing data is small. Such an approach leads to significant information loss, especially as the amount of missing data increases (Allison 2001; Little et al. 2002). Ignoring or deleting the cases containing missing data from analysis and concentrating only on sample units for which complete data are available (complete case analysis) might seem reasonable at first sight, and indeed it might appear that there is no other option available. However, in doing so, the analysts often throw away a large part of their data, especially when a 2

3 data set contains many variables and whole records are deleted based on only one or two variables not being measured. Hence, the knowledge derived by the data mining algorithms using the database after the deletion of incomplete instances, will be biased. In response to these issues, significant effort has been devoted to developing and evaluating various imputation methodologies. Imputation is a method of filling in the missing values by attributing values derived from other available data to them. Imputation is defined as the process of estimating missing data of an observation, based on valid values of other variables (Hair et al. 1998). Imputation minimizes bias in the mining process, and preserves expensive to collect data, that would otherwise be discarded (Marvin et al. 2003). It is important that the estimates for the missing values are accurate, as even a small number of biased estimates may lead to inaccurate and misleading results in the mining process. This paper proposes a semi-supervised clustering methodology to impute the missing values in the database. In order to combine the benefits of supervised and unsupervised learning methods, semi-supervised clustering (Watanabe 1985) has been proposed to guide the clustering process in the effective imputation of missing data in the incomplete records. The incomplete records are designated as centre points for generating clusters. The missing-attribute value(s) is/are then imputed with the mean value of the respective attribute(s) of the complete records in the respective clusters. The performance of the proposed method is evaluated by estimating the Average Imputation Error (Mariso 2005) which is measured as the difference between the imputed value and the actual value. To ensure the reliability of the evaluation, the experiments are performed on the complete Pima Indians Type II Diabetes dataset. As the complete dataset has been taken up for the experimental study, different percentages of missing data were generated using random labeling feature(s) and to overcome the bias due to the artificial introduction of missing values randomly in the dataset, twenty simulations were conducted. The random introduction of missing data and multiple simulations provide an unbiased platform to evaluate the efficacy of the imputation process. For comparison purpose, the five most popular imputation methods have been investigated in this work namely k-nearest Neighbours (k-nn), Mean-based imputation, two correlation-based methods known as LSImpute_Rows and EMImpute_Columns (Trond et al. 2004), and a multiple imputation (MI) method referred to as NORM (Schafer 1999), which is based on the expectation maximization (EM) algorithm. The performance of the proposed method is compared with these imputation methods in terms of their average imputation errors, for different percentages of missing data. 3

4 A review of literature on different methods of handling missing data in various applications is presented in Section II. Section III describes the Semi-Supervised Clustering methodology for the imputation of missing data and the experimental set up for evaluating its performance. Section IV presents performance analysis and Section VI Concludes the research work. II. LITERATURE REVIEW The data mining process deals primarily with prediction, estimation, classification, pattern recognition and development of association rules. The reliability of the outcome of the data mining process depends heavily on the quality of the data and on the chosen sample data used for model training and testing (Brown and Kros 2003). The databases in real life applications are usually large, and so the problems of incomplete, inaccurate and inconsistent data are inevitable (Laurence and Jeremy 2006). The presence of missing data in the databases negatively impacts the performance of the data mining process (Hair et al. 1998). Missing values may appear either in conditional attributes or in class attribute (target attribute) in the dataset. Numerous case studies are found in the literature regarding the imputation of missing data (Barnard and Meng 1999; Van et al. 1999). More importantly, research (Batista and Monard 2003) indicates that a meaningful treatment of missing data should always be independent of the problem being investigated. Many approaches to deal with missing values are described (Han and Kamber 2006), for instance: Ignore objects containing missing values; Fill the missing values manually based on the dependant value; Substitute the missing values by a global constant or the mean of the objects. Though the first approach is simple, it may result in the loss of too much useful data, whereas the second one is time consuming and leads to logical inconsistencies. So these approaches are not feasible for many applications. The third approach assumes that all the missing data have the same value which would probably lead to considerable distortions in data distribution. Moreover, these methods can be used only when the percentage of missing data is below 5%. Another common technique for dealing with missing data is to create a new data value (similar to missing data ) and use it to represent missing data. However, this has the unfortunate side effect that data mining algorithms may try to use missing as a legal value, which is inappropriate. It also sometimes has the effect of artificially inflating the accuracy of some data mining algorithms on some datasets (Friedman et al. 1996). The best solution to handle the incomplete data is to get the most probable value to fill the missing values. The method of imputation, however, is a popular strategy and it uses as much information as possible from the observed data to predict missing values (Zhang 2005). Imputation techniques range from fairly simple ideas such as using the mean or mode of the 4

5 attribute as replacement for a missing value (Little and Rubin 1986; Clark and Niblett 1989) to more sophisticated ones that use statistical models like regression (Hu et al. 1998) or the machine learning approaches such as Bayesian networks (Beaumont 2000) and decision-tree induction (Coppola et al. 2000). Using the mean or mode is generally considered as a poor choice, because it distorts other statistical properties of the data such as the variance and does not take dependencies between attributes into account. Hot-deck imputation is another type of an imputation method which (Lakshminarayan et al. 1999) fills in missing data using values from other rows of the database that are similar to the row with the missing data and its performance depends completely on identifying the rows similar to the row with missing data. Traditionally, the techniques for the imputation of missing value can be roughly classified into parametric imputation (e.g. linear regression) and non-parametric imputation (e.g. non-parametric Kernel-Based Regression methods, Nearest Neighbours method referred to as NN). The parametric regression imputation is superior if a dataset can be adequately modeled parametrically, or if users can correctly specify the parametric forms for the dataset. For instance, the linear regression methods are effective with the continuous target attribute which is a linear combination of the conditional attributes. However, when we do not know the actual relation between the conditional attributes and the target attribute, the performance of linear regression for imputing missing values is very poor. In real applications, if the model is ill-specified due to lack of knowledge about the distribution of the real dataset, the estimations of parametric method may be highly biased and the optimal control factor settings may be miscalculated. Non-parametric imputation can provide superior fit by capturing the structure of the dataset and it offers a nice alternative if users have no idea about the actual distribution of a dataset. For example, the NN method is regarded as one of the non-parametric techniques used to compensate for missing values in sample surveys (Chen and Shao 2001) where the nature and distribution of the data is not known. Using a non-parametric algorithm is beneficial when the form of relationship between the conditional attributes and the target attribute is not known a-priori (Lall and Sharma 1996). In the K Nearest Neighbours (KNN) method of estimating the missing data, a distance is assigned between all pairs of points in a dataset. Distance is defined as the Euclidean distance between two points. From these distances, a distance matrix is constructed among all possible pairings of points (x, y). Each data point in the dataset has a class label in the set C = {C 1... C n }. The data points K- closest neighbors are then found by analyzing the distance matrix. The K-closest data points are then analyzed to determine which of the class labels is the most common in the set. The most common class label is then assigned to the data point under consideration. The disadvantage is that 5

6 if two or more class labels occur an equal number of times for a specific data point within its K- closest neighbors, then the KNN test is inconclusive. A general approach to handle missing data is to create data mining algorithms that internally handle missing data and still produce good results. For example, the CART decision-tree learning algorithm (Breiman et al. 1984) internally handles missing data essentially using an implicit form of imputation, based on regression. Regression imputation (Beaumont 2000) imputes missing data with predicted values derived from a regression equation based on variables in the dataset that contain no missing data. Regression assumes a specific relationship between attributes that may not hold good for all datasets. The widely used missing data imputation techniques such as KNN, NORM, EMImpute Columns, LSImpute and Mean Imputation are investigated for comparative analysis in this paper. KNN imputes missing data by analyzing the K-closest data points and identifies the most common one as the replacement for the missing data. NORM implements missing value estimation based on the expectation maximization algorithm (Schafer 1999). EMImpute_Columns and LSImpute_Rows are feature based extraction methods. In the LSImpute_Rows method, the least squares principle is used to estimate missing data using correlations between the reference record and other records. The least squares principle is based on minimizing the sum of squared errors of a regression model. EMImpute_Columns performs imputation using the relevant columns of the records. Mean Imputation method fills the missing value with the arithmetic mean value of the respective attribute of the dataset. It may unduly affect the other statistical properties of the dataset. In this research work, the performance of the proposed method is compared with those of KNN, NORM, EMImpute Columns, LSImpute and Mean Imputation, all of which are evaluated on the same dataset. III. SEMI-SUPERVISED CLUSTERING METHODOLOGY In this paper, a Semi-Supervised clustering methodology for imputing missing values in the database is proposed. Clustering is an unsupervised data mining technique used for discovering patterns in the database. It is a process of dividing a set of objects into unknown groups, where objects within each group should be highly similar to each other and dissimilar to objects in any other group. In order to combine the benefits of supervised and unsupervised learning methods, semi-supervised clustering (Watanabe 1985) has been proposed, incorporating knowledge specific to the problem under analysis to guide the clustering process. The incomplete record(s) is/are assigned as centre point(s) for generating clusters. It facilitates the detection of those instances that should be placed in the same cluster, and those that should be separated to different clusters (Han 6

7 and Kamber 2006). The methodology is referred to as Semi-supervised, since clusters are generated for every incomplete record by assigning them as centre or seed points. As instances in the same cluster are similar to each other, they share certain properties. The value(s) of the missing attribute(s) in each centre point record of the cluster is imputed by computing the mean value of the respective attribute(s) of the records in the cluster. Algorithm SESU_CLUST_IMPUTE ( ) // D: Dataset M: Number of records N: Number of attributes B: Block(s) of records with missing data Attr: Attributes C: Clusters R: Record // Do IMPUTE (D) { For i = 1 to N do { // formulate b i with set of records R with the i th attribute missing // b i = Ø; For j = 1 to M do { If Value (Attr i (R)) =? then b i = b i U {R}; } } For i = 1 to N { CLUSTER (D, C, b i ); IMPUTE (C, b i )); } } Testing dataset The database is searched to identify and group the records with same set of missing attribute(s) to form blocks. A maximum of (2 n - 2) blocks can be generated, where n is the total number of attributes in the dataset. In our experiments only one attribute at a time was considered as missing. This was repeated for each attribute resulting in eight experiments with 20 replications for each attribute. These records with a single missing attribute are collectively treated as the testing dataset. The incomplete records in these blocks are passed as centre points for the clustering process. The remaining complete records form the training dataset. 7

8 Weight Generation and Normalization Weight is used as a similarity measure to form clusters. Weight specifies the number of similar attribute(s) between an incomplete record of the testing dataset and a complete record of the training dataset. The similarity of an attribute is measured using a function which uses the dispersion of the attribute i.e. the standard deviation, in the complete dataset. The statistical parameter which describes the spread of the values in a dataset is standard deviation (σ) (Ward 2004). An attribute in a complete record of the training dataset is considered to be similar to the corresponding attribute in the incomplete record of the testing dataset if and only if the attribute value in the complete record lies in the range, attribute value of the incomplete record ± 1σ, where σ is the standard deviation of the considered attribute. In our work, searching for similar records in terms of attributes is done by keeping the incomplete record as centre point. So, the attribute of the record in the training dataset is said to be similar to the attribute of the incomplete record only if it is within the spread from the attribute of the incomplete record. The weights are then assigned as follows using equation (1). Weight _ of _ record _ in _ Train _ dataset R, W( R) W( R ) (1) where, W(R i ) = 1 if σ β + σ and zero otherwise. The summation is taken over the nonmissing attributes of the record in the testing dataset, where the attribute value of the incomplete record, β is the attribute value of the complete record in the training dataset and σ is the standard deviation of the respective attribute. A normalization step is carried out on the weights obtained as above, so as to generalize the similarity measurement between records to any dataset, irrespective of the number of attributes. The weights are normalized using max-difference normalization. The assigned weights defined in equation (1) are normalized by dividing the weights by (n- missing (a)) as given in equation (2). i W(R) Normalized Weight of R with respect to a, NW a (R) (2) (n - missing(a) ) where W(R) is the weight generated on a complete record R of the training dataset with respect to the incomplete record a chosen as seed point, n is the number of attributes in the dataset and missing(a) refers to the number of missing attributes in the incomplete record a. Cluster generation and Imputation of missing data Finally, clusters are generated with every incomplete record in every block of the testing dataset as centre point. The normalized weight plays a central role in cluster formation. If a is a member of the testing dataset (it has missing attributes) then a complete record R belongs to the 8

9 cluster C a centered at a, if NW a (R) 0.6. The threshold value of 0.6 was used so that the members of the cluster are highly similar to the centre point. The clusters thus formed will have the incomplete record as centre point and a set of similar complete records as members. The value(s) of the missing attribute(s) in the centre point record of each cluster is imputed by computing the mean value of the respective attribute(s) of the records in the cluster. The difference between the Mean Imputation method and the proposed method is that in this method mean is computed from the set of similar records that are grouped into clusters. Experimental Set up Machine learning datasets are described using conditional attributes and a decision attribute. The conditional attributes characterize the different parameters of the entity and are interdependent. The decision attribute is derived based on the values of the conditional attributes for every entity in the dataset. The value of every attribute plays a vital role in deriving the value of the decision attribute. The records of a dataset are classified into different target groups based on the values of the attributes. The objective of this research work is to evaluate the proposed methodology on the imputation of missing values against other known methods. The reliability of the performance evaluation is very important in any research work. Hence, a complete dataset, Pima Indian Type II Diabetes Dataset, from the UCI repository of machine learning databases (Asuncion and Newman 2007), was taken for experimental study and performance evaluation. The dataset includes 768 complete instances described by eight features labeled as Number of times pregnant, Glucose tolerance test, Diastolic blood pressure, Triceps skin fold thickness, 2-hour serum insulin, Body Mass Index, Diabetes pedigree function and Age. The predictive class value 1 is interpreted as "tested positive for diabetes" and class value 0 is interpreted as tested negative for diabetes. Class value 0 was found in 500 instances and 1 in 268 instances. The dataset has only numerical attributes and each transaction is viewed as a multi dimensional vector. All the attributes of the dataset are considered in turn for the experiments, as the decision attribute is derived using these attributes. Different percentages of missing data (from 5% to 80%) are generated on a complete dataset using random labeling feature which forms the testing dataset and the rest of the complete records collectively forms the training dataset. Twenty simulations were conducted to overcome the bias on the performance of the proposed methodology due to the random introduction of missing values in the dataset. The results are validated by estimating average imputation error(e). 9

10 E m n k 1 i 1 O ij Iij / max where n is the number of imputed values, m is the number of random simulations for each missing value, O ij is the original value, I ij is the imputed value of the attribute j in the i th imputation, Max j is the maximum value of the of the attribute j, Min j is the minimum value of the of the attribute j, j is the attribute under consideration. The performance of the method is compared with other widely used existing imputation methods namely 10-NN, NORM, EMImpute Columns, LSImpute_Rows and Mean Imputation. j min j /n /m IV PERFORMANCE ANALYSIS The performance of semi-supervised clustering methodology on imputation of missing data is shown in Table I. It shows the computed average imputation error for all the eight attributes for different percentages of missing data. The average imputation error varies significantly from attribute to attribute due to the nature and distribution of the respective attribute of the dataset. The overall average imputation error varies from a minimum of to a maximum of over the different percentages of missing data from 5% to 80%. Size of Missing Data Table I: Average Imputation Error for the attributes of Pima Indians Type II dataset Number of times pregnant Glucose tolerance test Diastolic blood pressure Triceps skin fold thickness 2 hour serum insulin Body mass index Diabetes pedigree function Age Average 5% % % % % % % % % % % % % % % %

11 Figure 1: Performance of Semi-Supervised Clustering methodology on the imputation of missing data Figure-1 shows the impact of the size of the missing data on the estimation of average imputation error. Within each attribute the average imputation error shows very little variation with respect to the size of missing data. With an increase in the size of missing data, the availability of complete set of records in the training dataset is reduced. The method is robust as it is capable of formulating clusters with similar records when there is missing data up to 50%, with no noticeable increase in error levels. In the presence of missing data above 50%, the cluster formation is not as effective. It is due to the sparse availability of similar records in the training dataset when the size of incomplete records is above 50%. Hence, the imputation of missing data will be less consistent with the observed values when the size of missing data is above 50%. Comparative Analysis The performance of the proposed method is compared with other imputation methods namely 10-NN, NORM, EMImpute Columns, LSImpute_Rows, Mean Imputation and is shown in Table-II. As a sample performance, up to 35% of missing data is taken up for comparison. The imputation process in all the methods uses the complete record(s) of the dataset. Normally, the accuracy of the imputed values depends on the size of missing data and the availability of complete instances for the imputation process. Hence, the missing data is simulated up to one-third of the dataset, i.e. a maximum of 35% missing data is taken up for the comparative performance analysis. Table II shows the Mean Imputation error ± Standard Deviation across the simulations for each method and each size of missing data, correct to first decimal place. 11

12 Table II Comparative performance of various imputation methods Method Size of Missing Data 5% 10% 15% 20% 25% 30% 35% 10-NN 10.2± ± ± ± ± ± ±10.5 NORM 16.0± ± ± ± ± ± ±13.9 EMImpute_Columns 8.5± ± ± ± ± ± ±22.3 LSImpute_Rows 10.9± ± ± ± ± ± ±23.3 Mean Imputation 12.5± ± ± ± ± ± ±10.3 SESU_ CLUST 11.0± ± ± ± ± ± ±7.4 Figure 2 Comparative Performance of various Methods of Missing Data Imputation Average Imputation Error Figure 3 Comparative Performance of various Methods of Missing Data Imputation Standard Deviation Table-II shows the comparative performance of the imputation methods on the Pima Indians Type II diabetes dataset for different percentages of missing data up to 35%. It is observed from Table-II that NORM method produces the highest mean error rate with the least accurate estimation results. On the other hand, EMImpute_columns shows lowest mean imputation error rate among the existing methods, with stability in the imputation process for different sizes of missing data. The 12

13 proposed methodology performs better than the Mean Imputation and NORM with an average imputation error ranging from 10.9 to Though, the 10-NN and LSImpute_Rows methods show better performance in terms of average imputation error than the proposed Semi-Supervised Clustering methodology SESU_CLUST, a noticeable increase in the mean error rate is found in these methods as the size of the missing data increases. It is observed from Table II and Figure 2 that the proposed methodology SESU_CLUST is able to achieve stable error rate for increasing percentages of missing data. Figures 2 and 3 represent the comparative performance of various missing data imputation methods in terms of average imputation error and standard deviation respectively. Standard deviation is the valid measure of dispersion from the centre point of any data set. The higher the value of standard deviation, the higher is the interval of distribution of imputed values (Ward 2004). It is observed from Table II and Figure 3 that LSImpute_Rows shows the highest standard deviation which varies from 23.3 to 24.0 and hence the mean error rate is widely dispersed in the range mean error rate ± σ. EMImpute_Columns stands next to LSImpute_Rows, with its standard deviation varying from 22.3 to Though this method resulted in the lowest average error rate among the existing five methods, its range of mean error rate is wide as the standard deviation is higher. In the NORM method of imputation the standard deviation varies from 13.4 to 13.9 with the mean error rate varying from 15.7 to The Mean Imputation performs with the mean error rate varying from 12.3 to 12.5 and the standard deviation varying from 10.3 to 10.5 which is higher than the proposed methodology SESU_CLUST. The range for the proposed methodology SESU_CLUST varies from 3.8 to 18.2 (Minimum range 14.4) attained at 5% missing data, to 3.5 to 18.5 (Maximum range 15) reached in the 30% case. The ranges for the other sizes of missing data vary from 14.6 to 15. There is significant reduction in the lengths of the intervals, which is due to the uniformly smaller standard deviation values of SESU_CLUST compared to those of 10- NN, even though the mean values are slightly on the higher side. The proposed methodology SESU_CLUST has the least standard deviation when compared to all the other methods. As the method imputes missing values using a set of similar complete records formulated as clusters, it resulted in lowest standard deviation. The lower value of standard deviation indicates that the imputed values are less dispersed from its centre point of the data series. As the imputation process completely depends on the clusters which are generated using the incomplete record as the centre point, the less dispersed imputed value around the centre point (mean value) of the cluster is a valid representation or replacement for the missing attribute(s) in the incomplete records. 13

14 V. CONCLUSION The effective use of information technology is crucial for organizations to stay competitive in today s complex, evolving environment. The organizations face a lot of challenges when trying to deal with large, diverse, and often complex databases. They adopt several strategies to improve the quality of data in the database. The missing data, one of the pervasive problems in data analysis, is handled using various strategies based on the problem context. The proposed SESU_CLUST method imputes the missing values by exploiting the concept of semi-supervised clustering. The method exhibits relatively better performance compared to other methods as it produces stable results up to 50% of the missing data with lowest standard deviation of the error. The weight normalization facilitates the application of the method to any dataset irrespective of the number of attributes. The performance of the method can be improved by generating clusters with high intracluster-similarity for the imputation process. REFERENCE Allison, P.D. (2001). Missing Data Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks, Sage, CA. Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science. Barnard J., Meng X. (1999), Applications of multiple imputations in medical studies: from AIDES to NHANES, Statistical Methods in Medical Research, Vol. 8, pp Batista, G., Monard, M. (2003), An analysis of four missing data treatment methods for supervised learning, Applied Artificial Intelligence, 17 (5/6), Beaumont, J-F. (2000), On regression imputation in the presence of non-ignorable non-response, in: Proceedings of the Survey Research 570 Methods Section, ASA, pp Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984), Classification and Regression Trees, Chapman and Hall. Brown, M. L., Kros, J. F. (2003), The impact of missing data on data mining, In Data Mining: Opportunities and Challenges, J. Wang, Ed. IGI Publishing, Hershey, PA, Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., Zanasi, A. (1998), Discovering Data Mining: From Concepts to Implementation, Prentice-Hall, Upper Saddle River, NJ. Chen, J., Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation, Journal of American Statistics Association, Vol.96: Clark, P., Niblett, T. (1989), The CN2 induction algorithm, Machine Learning, 3 (4),

15 Friedman, J.H., Kohavi, R., Yun, Y. (1996), Lazy decision trees, in: Proceedings of 13th AAAI and 8th IAAI, pp Hair, J., Anderson, R., Tathan, R., Black, W. (1998), Multivariate Data Analysis, Upper Saddle River, NJ. Prentice Hall, Han, J., Kamber, M. (2006), Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, San Matel, CA. Hu, M., S.M. Salvucci, S.M., Cohen, M.P. (1998), Evaluation of some popular imputation algorithms, in: The Survey Research Methods Section of the ASA, pp Kantardzic, M. (2003), Data Mining Concepts, Models, Methods and Algorithms, Wiley-IEEE Computer Society Press, New York, NY. Krzysztof, C., Kurgan, L. (2002), Trends in data mining and knowledge discovery, Knowledge Discovery in Advanced Information Systems, Springer, Berlin. Coppola L, Di Zio M, Luzi O, Ponti A and M. Scanu. (2000), Bayesian networks for imputation in official statistics: a case study, in: 575 DataClean Conference, pp Lakshminarayan, K., Harp, S.A., Samad, T. (1999), Imputation of missing data in industrial databases, Applied Intelligence 11 (3), Lall, U., and Sharma, A. (1996), A nearest-neighbor bootstrap for re-sampling hydrologic time series Water Resource. Res. 2001, Vol.32: Laurance, Jeremy. (2006), Breast cancer cases rise 80% since Seventies, The Independent, Retrieved on Little, R.J.A., Rubin, D.B. (1986) Statistical Analysis with Missing Data, John Wiley & Sons, Inc., USA. Little, R.J.A., Rubin, D.B., Statistical Analysis with Missing Data, second ed. John Wiley and Sons, Hoboken, NJ. Mariso Giardina, Yongyang Huo, Francisco Azuaje, Paul McCullagh and Roy Harper. (2005), A Missing Data Estimation Analysis in Type II Diabetes Databases, Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems (CBMS 05) /05, IEEE. Marvin L.Brown, John F.Kros. (2003), Data mining and the impact of missing data, Industrial Management & Data systems, 103/8, Marwala, T., Chakraverty, S., and Mahola, U. (2006). Fault classification using multi-layer perceptrons and support vector machines. International Journal of Engineering Simulation, 7(1), Ocean Research, 23, Schafer, J.L (1999), NORM: multiple Imputations of incomplete multivariate data under a normal model, version 2.03, software for Windows 95, 98, NT, Website: http;// Stefanakos, C., & Athanassoulis, G. A. (2001). A unified methodology for analysis, completion and simulation of non-stationary time series with missing values, with application to wave data. Applied Ocean Research, 23,

16 Trond. H, Bjarte. D, Inge. J(2004), LSimpute: accurate estimation of missing values in microarray data with least squares methods, Nucleic Acids Research, 32(3). Van Buren.S, Boshuizen.H, Knook.D. (1999), Multiple imputation of missing blood pressure covariates in survival analysis, Statistics in Medicine, Vol.18, pp Ward, B. (2004), The Best of Both Worlds: A Hybrid Statistics Course, Journal of Statistics Education [Online], 12(3). Watanabe. S, Pattern Recognition: Human and Mechanical, John Wiley and Sons, Inc., New York, USA, Zhang, S.C., et al. (2005), Missing is useful: Missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering, 2005, Vol. 17(12):

17 Authors Profile Ilango Paramasivam received his Masters Degree in Computer Applications from Alagappa University, Karaikudi, India. He is pursuing Ph.D in Data Mining in the Department of Computer Applications, National Institute of Technology (NIT), Tiruchirappalli. He has 16 years of Post Graduate teaching experience in Computer Science & Engineering. He is currently working as Assistant Professor in the Intelligent Systems Division, School of Computing Sciences, VIT University, Vellore. His major research interests include Data Mining, Machine Learning and Distributed Computing. Hemalatha Thiagarajan received her Ph.D degree from the University of Texas at Austin. She has 25 years of graduate teaching experience in Mathematics and Computer Science. She is currently working as Professor in the Department of Mathematics, National Institute of Technology (NIT), Tiruchirappalli. Her major research interests are Algorithms, Operations Research. Poonkuntran Shanmugam received B.E in Information Technology from Bharathidasan University, Tiruchirapalli, India and M.Tech in Computer and Information Technology from Manonmaniam Sundaranar University, Tirunelveli. He is currently pursuing Ph.D in the Department of Computer Science & Engineering, Manonmaniam Sundaranar University, Tirunelveli. He has 4 years of experience in teaching and research. Currently, working on Medical image processing -security models. His areas of interest are digital image processing, soft computing and energy aware computing in computer vision. Nickolas Savarimuthu received his M.E.and Ph.D. degrees in Computer Science both from the National Institute of Technology (NIT), Tiruchirappalli. He is an Assistant Professor in the Department of Computer Applications, NIT. Tiruchirappalli. He is having more than 20 years of teaching and research experience. His major research interests include Data Preprocessing and Data mining in general and applications of data mining in particular. 17

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,

More information

A Review of Missing Data Treatment Methods

A Review of Missing Data Treatment Methods A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common

More information

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

An Analysis of Four Missing Data Treatment Methods for Supervised Learning An Analysis of Four Missing Data Treatment Methods for Supervised Learning Gustavo E. A. P. A. Batista and Maria Carolina Monard University of São Paulo - USP Institute of Mathematics and Computer Science

More information

The treatment of missing values and its effect in the classifier accuracy

The treatment of missing values and its effect in the classifier accuracy The treatment of missing values and its effect in the classifier accuracy Edgar Acuña 1 and Caroline Rodriguez 2 1 Department of Mathematics, University of Puerto Rico at Mayaguez, Mayaguez, PR 00680 edgar@cs.uprm.edu

More information

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang

More information

Data Mining using Artificial Neural Network Rules

Data Mining using Artificial Neural Network Rules Data Mining using Artificial Neural Network Rules Pushkar Shinde MCOERC, Nasik Abstract - Diabetes patients are increasing in number so it is necessary to predict, treat and diagnose the disease. Data

More information

Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random [Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013. Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

More information

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values

Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values International Journal of Computer Applications (5 ) Evaluation of three Simple Imputation s for Enhancing Preprocessing of Data with Missing Values R.S. Somasundaram Research Scholar Research and Development

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Introduction. Chapter 1. 1.1 Before you start. 1.1.1 Formulation

Introduction. Chapter 1. 1.1 Before you start. 1.1.1 Formulation Chapter 1 Introduction 1.1 Before you start Statistics starts with a problem, continues with the collection of data, proceeds with the data analysis and finishes with conclusions. It is a common mistake

More information

Cost Drivers of a Parametric Cost Estimation Model for Data Mining Projects (DMCOMO)

Cost Drivers of a Parametric Cost Estimation Model for Data Mining Projects (DMCOMO) Cost Drivers of a Parametric Cost Estimation Model for Mining Projects (DMCOMO) Oscar Marbán, Antonio de Amescua, Juan J. Cuadrado, Luis García Universidad Carlos III de Madrid (UC3M) Abstract Mining is

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Random Forest Based Imbalanced Data Cleaning and Classification

Random Forest Based Imbalanced Data Cleaning and Classification Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem

More information

Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Elia El-Darzi School of Computer Science, University of Westminster, London, UK

Elia El-Darzi School of Computer Science, University of Westminster, London, UK The current issue and full text archive of this journal is available at www.emeraldinsight.com/1741-0398.htm Applying data mining algorithms to inpatient dataset with missing values Peng Liu School of

More information

An Introduction to Neural Networks

An Introduction to Neural Networks An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Analysis of Various Techniques to Handling Missing Value in Dataset Rajnik L. Vaishnav a, Dr. K. M. Patel b a

Analysis of Various Techniques to Handling Missing Value in Dataset Rajnik L. Vaishnav a, Dr. K. M. Patel b a Available online at www.ijiere.com International Journal of Innovative and Emerging Research in Engineering e-issn: 2394-3343 e-issn: 2394-5494 Analysis of Various Techniques to Handling Missing Value

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

More information

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Data Mining for Customer Service Support Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin Traditional Hotline Services Problem Traditional Customer Service Support (manufacturing)

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data Athanasius Zakhary, Neamat El Gayar Faculty of Computers and Information Cairo University, Giza, Egypt

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

Towards applying Data Mining Techniques for Talent Mangement

Towards applying Data Mining Techniques for Talent Mangement 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations Binomol George, Ambily Balaram Abstract To analyze data efficiently, data mining systems are widely using datasets

More information

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University

More information

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

Financial Trading System using Combination of Textual and Numerical Data

Financial Trading System using Combination of Textual and Numerical Data Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,

More information

ETL PROCESS IN DATA WAREHOUSE

ETL PROCESS IN DATA WAREHOUSE ETL PROCESS IN DATA WAREHOUSE OUTLINE ETL : Extraction, Transformation, Loading Capture/Extract Scrub or data cleansing Transform Load and Index ETL OVERVIEW Extraction Transformation Loading ETL ETL is

More information

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM Ms.Barkha Malay Joshi M.E. Computer Science and Engineering, Parul Institute Of Engineering & Technology, Waghodia. India Email:

More information

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India lav_dlr@yahoo.com

More information

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

More information

Neural Networks in Data Mining

Neural Networks in Data Mining IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V6 PP 01-06 www.iosrjen.org Neural Networks in Data Mining Ripundeep Singh Gill, Ashima Department

More information

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms

A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st

More information

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods Effective Analysis and Predictive Model of Stroke Disease using Classification Methods A.Sudha Student, M.Tech (CSE) VIT University Vellore, India P.Gayathri Assistant Professor VIT University Vellore,

More information

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Enhanced Boosted Trees Technique for Customer Churn Prediction Model IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

Grid Density Clustering Algorithm

Grid Density Clustering Algorithm Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering

More information

The Probit Link Function in Generalized Linear Models for Data Mining Applications

The Probit Link Function in Generalized Linear Models for Data Mining Applications Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

More information

A Divided Regression Analysis for Big Data

A Divided Regression Analysis for Big Data Vol., No. (0), pp. - http://dx.doi.org/0./ijseia.0...0 A Divided Regression Analysis for Big Data Sunghae Jun, Seung-Joo Lee and Jea-Bok Ryu Department of Statistics, Cheongju University, 0-, Korea shjun@cju.ac.kr,

More information

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Selection of Optimal Discount of Retail Assortments with Data Mining Approach Available online at www.interscience.in Selection of Optimal Discount of Retail Assortments with Data Mining Approach Padmalatha Eddla, Ravinder Reddy, Mamatha Computer Science Department,CBIT, Gandipet,Hyderabad,A.P,India.

More information

A Survey on classification & feature selection technique based ensemble models in health care domain

A Survey on classification & feature selection technique based ensemble models in health care domain A Survey on classification & feature selection technique based ensemble models in health care domain GarimaSahu M.Tech (CSE) Raipur Institute of Technology,(R.I.T.) Raipur, Chattishgarh, India garima.sahu03@gmail.com

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Data Cleaning and Missing Data Analysis

Data Cleaning and Missing Data Analysis Data Cleaning and Missing Data Analysis Dan Merson vagabond@psu.edu India McHale imm120@psu.edu April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS

More information

Application of Data Mining Methods in Health Care Databases

Application of Data Mining Methods in Health Care Databases 6 th International Conference on Applied Informatics Eger, Hungary, January 27 31, 2004. Application of Data Mining Methods in Health Care Databases Ágnes Vathy-Fogarassy Department of Mathematics and

More information

A Hybrid Model of Hierarchical Clustering and Decision Tree for Rule-based Classification of Diabetic Patients

A Hybrid Model of Hierarchical Clustering and Decision Tree for Rule-based Classification of Diabetic Patients A Hybrid Model of Hierarchical Clustering and Decision Tree for Rule-based Classification of Diabetic Patients Norul Hidayah Ibrahim 1, Aida Mustapha 2, Rozilah Rosli 3, Nurdhiya Hazwani Helmee 4 Faculty

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets

A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets University of Nebraska at Omaha DigitalCommons@UNO Computer Science Faculty Publications Department of Computer Science -2002 A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random

More information

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery Index Contents Page No. 1. Introduction 1 1.1 Related Research 2 1.2 Objective of Research Work 3 1.3 Why Data Mining is Important 3 1.4 Research Methodology 4 1.5 Research Hypothesis 4 1.6 Scope 5 2.

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia

Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia V.Sathya Preiya 1, M.Sangeetha 2, S.T.Santhanalakshmi 3 Associate Professor, Dept. of Computer Science and Engineering, Panimalar Engineering

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Implementation of Data Mining Techniques to Perform Market Analysis

Implementation of Data Mining Techniques to Perform Market Analysis Implementation of Data Mining Techniques to Perform Market Analysis B.Sabitha 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, P.Balasubramanian 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information