A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets
|
|
|
- Neal McDonald
- 9 years ago
- Views:
Transcription
1 University of Nebraska at Omaha Computer Science Faculty Publications Department of Computer Science A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets Xiaolu Huang Qiuming Zhu University of Nebraska at Omaha, [email protected] Follow this and additional works at: Part of the Computer Sciences Commons Recommended Citation Huang, Xiaolu and Zhu, Qiuming, "A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets" (2002). Computer Science Faculty Publications. Paper This Article is brought to you for free and open access by the Department of Computer Science at DigitalCommons@UNO. It has been accepted for inclusion in Computer Science Faculty Publications by an authorized administrator of DigitalCommons@UNO. For more information, please contact [email protected].
2 A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets Xiaolu Huang and Qiuming Zhu* Department of Computer Science University of Nebraska at Omaha Omaha, NE * Contact Author Voice: (402) Fax: (402) [email protected] November 7, 200
3 A Pseudo- Nearest Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets Abstract Missing data handling is an important preparation step for most data discrimination or mining tasks. Inappropriate treatment of missing data may cause large errors or false results. In this paper, we study the effect of a missing data recovery method, namely the pseudo- nearest neighbor substitution approach, on Gaussian distributed data sets that represent typical cases in data discrimination and data mining applications. The error rate of the proposed recovery method is evaluated by comparing the clustering results of the recovered data sets to the clustering results obtained on the originally complete data sets. The results are also compared with that obtained by applying two other missing data handling methods, the constant default value substitution and the missing data ignorance (non-substitution) methods. The experiment results provided a valuable insight to the improvement of the accuracy for data discrimination and knowledge discovery on large data sets containing missing values. Key words: Missing data, Missing data recovery, Data imputation, Data Clustering, Gaussian data distribution, Data mining. Introduction The ever-growing data sets stored in large amount of databases and data warehouses are treasure mines with precious information (knowledge) hidden in them. In order to retrieve those information, tools for data mining and knowledge discovery such as On-line Analysis Process (OLAP), statistical analyzers, and hierarchical clustering are widely used by businesses, government and scientific research institutions [2].
4 In most databases and data warehouses, row data are not ready to be processed by data mining tools because they may contain a lot of irrelevant, inconsistent, or missing data items. Therefore, data discrimination and mining is often a multistage process in which people use some formal or informal methods to evaluate the appropriateness of the problems, define processing stages and expected solutions, implement technical approaches and strategies, and produce measurable results. For example, in bio-informatics a typical process for gene expression data discrimination and mining involves roughly the stages of data collection and preparation, cleansing and filtering, clustering and synthesizing, and then the stages of knowledge extraction and representation. Each of these stages has a specific objective and a set of functions to perform. The data preparation stage aims to getting rid of erroneous data and find the most accurate ways to represent the uncertain information. The absence of certain values for relevant data attributes in data items can seriously affect the accuracy of data mining results. Missing data handling is one of the main issues often dealt with in the data preparation steps. In most cases, missing data should be pre-processed (recovered) so as to allow the whole data set to be processed by a datamining tool. It has also been known that data preparation and filtering steps take considerable amount of processing time in many data mining projects []. While attributes in most data sets can be distinguished in categories of randomly distributed or non-randomly distributed, the missing data can also be distinguished in these two categories: () non-randomly distributed, and (2) randomly distributed. That is, the mechanisms underlying the situations of certain data being missing can be characterized as either random or non-random. But this randomness is by no means related to the randomness of the attribute in the original data set, or at least we do not assume that in this study. Randomly distributed missing data are the most commonly encountered cases in scientific, economic and business data mining applications []. In this paper, we focus on the methods for handling the randomly distributed missing data only. We also focus on the kind of data sets that are in Gaussian random distributions. That is, we study the effects of randomly distributed missing data handling methods on randomly distributed data set. However, it must be pointed 2
5 out that the randomness of the missing data is unknown in our study and is possibly totally different from the randomness of the original data set. According to Little and Rubin [7], the procedures for treating the randomly missing data can be grouped into four categories in general: ). Ignorance-based Procedures. This is a non-recovery method and is the most trivial approach. When some variables are not recorded for some of the data attributes, a simple expedient is to discard the incompletely recorded units entirely and to analyze only the units with complete data. It is generally easy to carry out and may be satisfactory with certain data analysis tasks. However, it can lead to serious biases, especially when missing data are randomly distributed. Moreover, it is usually very difficult to evaluate the errors caused by the discarded data records []. Notice that this method is different from the non-substitution methods. It throws out the entire data point rather than just ignore the missing data values. 2). Weighting-based Procedures. This is also a non-substitution (also non-recovery) procedure and is most commonly used in the inferences from sample survey data that contains nonresponse answers. The weights are designed such that they are inversely proportional to the probability of data presence in selections according to some empirical results. The purpose of the method is to reduce the effect of attributes with large percentage of missing values. The procedure is more applicable to non-randomly distributed missing data [9]. 3). Model-Based Procedures. This is a missing data recovery method. A missing data replacement is generated by defining a model for the partially missing data and biasing inferences on the likelihood under that model, with parameters estimated by procedures such as maximum likelihood. Advantages of this approach are the flexibility and divergence. One example of the application of this approach is seen in Krishnamoorthy and Pannala s [4], where they proposed three simple exact tests as alternatives to the traditional likelihood ratio test to assess the accuracy of this missing data reconstruction procedures. However, the complexity of these procedures prevented their applications to data mining that deal with very large data sets. It has also been known that the Model-Based Procedures are more suitable to data that maintain certain non-static regularities, such as the time series data sets that are not common to most data mining applications [6]. 3
6 4). Imputation-Based Procedures. This is the type of missing data substitution methods we discuss in this paper. In this approach, the missing values are filled in by certain means of approximation and the resultant completed data are analyzed by standard statistical analytical methods [3]. Commonly used procedures for imputation include: (a) Hot deck imputation, where recorded units in the sample are substituted by a value obtained from the present data set following certain rules, for example the value from the nearest data record [2]. (b) Default value imputation, where a constant is used to substitute the missing values, for example all missing values being replaced by value zero or the median of the value range []. (c) Statistical imputation, where the missing values are substituted by a statistically inspired value that has a high likelihood for the true occurrence, for example the mean values computed from the set of non-missing data records [3]. (d) Regression imputation, where the missing variables for a unit are estimated by values derived from the known variables according to a given function or some functional forms [0]. One example of the application of regression imputation based missing data handling approach is V. Letfus s paper [5]. Problem with the regression imputation is that it raises another critical issue of how to verify the legitimacy of the underlying function assumed for the regression. Besides the above four procedures, there are also some other missing data handling methods, such as the induction substitution approaches and technically skipping missing data approaches. Strictly speaking, induction substitution approaches also belong to imputation-based procedures [2, 3]. However, induction substitution approaches are more individual data object specific among the missing data handling approaches. In this paper, we will focus our study on the above-mentioned imputation-based procedures for missing data recovery. In a previous study on the effect of missing data, Zhu has derived an analytic form for estimating the error probability of classifications made on the partially available data sets versus that on the complete sets by using the Bhattacharyya bounds [4]. It has been shown that an upper bound of minimum probability of error, under the condition that the data attributes are independently distributed, is established at 4
7 n R( xi ) i= k+ p( x ω i ) p( xi ω 2 ) dx i. (.) Where a complete data item is given as x = [x, x 2,, x k, x k+,, x n ] and a data item with missing attribute values is given as x k = [x, x 2,, x k ]. The ω and ω 2 are two symbols denoting two distinct classes of the data sets. In the cases that the data attributes are in independent Gaussian distributions, the minimum probability of error of classification with data item x k versus x is bounded by n i= k+ 2 σ σ 2 i σ 2 i σ 2 i2 i2 ( µ µ ) 4 σ e 2 i i i + σ i 2 (.2) Where (µ i, µ i2 ) and (σ i, σ i2 ) are the mean and variance values of the i th attributes for the data items in classes ω and ω 2 respectively. We need to mention that the assumption for attribute independence is quite strong in many data discrimination and mining applications. It is common that for data in real applications most of them have correlated features. So the above metric may not be measurable directly and precisely in real applications. As indicated, the metric only gives a theoretically minimum probability of error under the independence assumption. When dealing with real applications, the error rate varies depending on the correlativeness of the data attributes and methods used to handle the missing data, such as the proposed approaches discussed in this paper. In this paper, we study the practical effects of the data processing errors when some missing data recovery methods are applied to the data sets. Assumptions in our study include: () the locations of the missing data in the data set are random with an unknown distribution, (2) the values of the missing data are random with an unknown distribution, (3) the data records are not labeled, i.e., no categorical information about the data items is given, (4) the missing data are numerically valued, and (5) the data attributes of the data sets are uncorrelated. That is, each data attribute has its own distribution of possible values. However, the values of each attribute may be governed by a Gaussian or non-gaussian distribution (the exact parameters of these distributions are unknown). In other words, each data attribute is governed by a univariate Gaussian distribution 5
8 in case that the data set is Gaussian randomly distributed. The assumption (5) can be relaxed if we concentrate on the comparisons of the missing data handling methods, rather than a quantitative measurement of the probability of error of each method precisely. In fact, the correlativeness of data attributes could be used to assist the missing data recovery, especially for the model-based and regression methods. But these are not the main concentration of this paper. Our work is focused on the missing data substitution methods. In the methods we studied, the correlativeness has less effect. The main reason we list the assumption (5) is to indicate that our methods do not make use of the attribute correlations. The results of our missing data recovery methods are evaluated by comparing the clustering results to that obtained by employing certain other missing data recovery techniques. These methods include that making the use of constant default and statistical imputations, as well as skipping (ignoring) the attributes that have missing values. The method is also evaluated on the basis of the clustering results made on the complete set of the data items (non-missing data sets). Three major parameters are used to generate the testing data sets in the evaluation: () missing data rate, ranging from 5% to 40% measured on the data set, (2) number of classes (or clusters) in the data set, and (3) the ranges of Gaussian variances in the experimental Gaussian distributed data sets. The paper is organized as follows. Section two describes the basis of the pseudo- nearest neighbor substitution method and the procedure. Section three presents our experimental results of the proposed method and compares it with three other missing data imputation and recovery/non-recovery methods. Section four contains conclusion remarks. 2. Computation of pseudo- nearest neighbor To derive the pseudo- nearest neighbor method for missing data recovery, here we are going to first introduce the concept of pseudo-similarity (or dissimilarity measurement) between a data record x with missing values and a data record without missing values, as well as between two data records with different number of missing values. 6
9 Let x = [x, x 2,, x n ] be a data record in a data set {x}. A data record x with missing values means that some of the elements x i x, i =,, n, have no valid attribute values present at the time the vector x is to be processed as an input to a data mining system. To facilitate the expression and computation, we use a NULL symbol to represent the missing value of x i. That is, when the value of an attribute x i is missing, we say that it has a value NULL. Note that in random missing cases, any one of the n elements of x could have a NULL value. Without losing generality and for convenience of expression and computation, we use x k = [x, x 2,, x k, NULL k+,, NULL n ], k < n, to represent a data record x with (n-k) missing attribute values. That is, we always move the missing elements of x to its right end and assume that the missing values of x k with respect to x are [x k+,, x n ]. In brief, we write x k = [x, x 2,, x k ], which stands for a vector x with k non-missing elements. We also say x k is an incomplete data record. Let {c} be a set of categorical centers of data set {x}. That is, {c} is a set of complete data record, c = [c, c 2,, c n ]. With the same re-ordering of data elements as x k for c, a pseudosimilarity between a data record x with missing values (here actually the x k ) and a complete data record c can be defined as S p (x k, c) = k i= Φ(x i k, c i ). (2.) Where Φ(.) is a certain kind of similarity (or distance metric) measurement function. The S p (x k, c) is useful in data clustering. When performing clustering of the data set {x}, the S p (x k, c) with respect to each cluster center c in the collection is compared with each other to determine the belonging of x k. Note that the categorical center c always has a complete set of attributes that can be computed on the basis of the presented values of data records or by an initial random selection. Let x k and x l be two incomplete data records of data set {x}. Again, we can re-arrange the order of the data elements in x k and x l, in such a way respectively, that () if an element has its value missing in both x k and x l, then it is placed toward the right end of the vectors, (2) if an element has its value present (i.e, non-missing) in both x k and x l, then it is placed toward the left end of the vector. Let us use a symbol # to represent the value that is missing in one of the vectors x k 7
10 and x l but not in both, then we can have the x k and x l be expressed as x k = [x, x 2,, x d, # d+,, # k, NULL k+,, NULL n ] and x l = [x, x 2,, x d, # d+,, # l, NULL l+,, NULL n ], where d min(k, l). Note that it does not matter whether k equals to l or not. The pseudo-similarity between x k and x l is defined as S p (x k, x l ) = d i= Φ(x i k, x i l ) = d* d i= ( d x i= k i x k i x k i d x i= l i x x l i l i ). (2.2) The measurement is actually a weighted correlation value between the two vectors with partially missing element values. It takes count of () the number of commonly present elements, and gives more weight on the vectors having more present elements, and (2) the correlation on the present element values. Thus, if two vectors have the same correlation value, then a larger Pseudo similarity S p (x k, x l ) is given to the vectors having less missing elements. Table 2. shows some examples of the S p (x k, x l ) measurements. Nearest neighbor substitution is a typical hot deck imputation method to handle missing data. Let x k = [x, x 2,, x k, NULL k+,, NULL n ] be the data record with missing values to be recovered. The method first searches for a data record x l within the data set {x} such that () x l has the presence of value x k+, (2) x l has the largest Pseudo-similarity value, based on the present data attribute values, (3) the present value x k+ of x l is used to replace the NULL k+ in x k. Since the pseudo-similarity measurement is used in this evaluation, we call the x k and x l pseudonearest-neighbors, and thus the name for the missing data recovery method. It needs to point out that the term pseudo-nearest-neighbor was also used by Mojirsheibani [8] to describe an approach for combining different classifiers in order to construct more effective classification rules. The principle of the technique used there is actually the same as we use here, except that it is used here to identify the most similar data points for missing data recovery. A procedure of the pseudo-nearest-neighbors substitution method for missing data recovery is presented as follows. Procedure pseudo- nearest neighbor method for missing-data recovery 8
11 Pre-condition: a data set {x} with members in format of x k = [x, x2,, xk, NULLk+,, NULLn] Post-condition: a data set {x} with members in format of x k = [x, x2,, xn], i.e., the missing values being substituted by corresponding values of the pseudo-nearest-neighbors. Computation: For each vector x k { For each NULL valued element xi k of x k } { For each x l {x} - x k ( If the xi l value is non-missing Compute S p (x k, x l ) } Find the x l* that has the largest value of S p (x k, x l ) among all x l examined } Replace the element xi k of x k by the xi l value of x l* 3. Experimental results 3. Compared methods We experimented with three imputation methods for recovery of randomly distributed missing data. () In hot deck imputation, the randomly distributed missing data at a dimension of a data object is filled with the non-null value from the pseudo-nearest-neighbor of the data set, as described above. The procedure is named mneighbor. (2) In default value imputation, the randomly distributed missing data at a dimension of a data object is filled with the median value of the whole data set. A procedure is named mmedian for the median value substitution. The programs are composed of two steps: in the first part, the median values of the attributes are computed according to the present values in the data records. In the second step, it converts all the missing data in the data set into the median value of the data set. (3) In statistical imputation, the randomly distributed missing data at a dimension of a data object is filled with dimension mean of the whole data set, calculated as such: 9
12 M j = T i= V kj / n (3.) where T is the total number of data objects in a data set; V kj is the valid data value of the k s data object at dimension j; N j is the total number of data objects that have data missing at j s dimension, n =T-N j. A procedure is named mgmean to carry out this computation. The performances of above methods are also compared with a missing data ignorance (nonsubstitution) approach. In this approach, a set of categorical centers {c} for the data set {x} is assumed where, each member of {c} is a complete data record c = [c, c 2,, c n ] of a data set {x}. When performing clustering of the data set, the S p (x k, c) on the presented data values of the x are computed with respect to c instead of computing S p (x, c), where x k is a data point of {x} with n-k missing attribute values. That is, the missing values are skipped (ignored) in computing the similarity of the data point with respect to the cluster centers. A procedure named mskipping carries this computation Data sets We evaluate the missing data recovery schemes with respect to data sets having an underlying Gaussian distribution. The Gaussian distributed data sets are generated using random number generators with the following given parameters: (a) the number of clusters (2-50), (b) the number of data attributes (dimensions) in each cluster (2 500), (c) the number of data objects (points) to be generated for each cluster ( ), and (d) the ranges of the Gaussian mean and Gaussian variance values for each dimension of the cluster. Table 3. gives an example of the selected Gaussian Means and Gaussian Variances for a data set with 0 clusters and 20 attributes, where capital letters A, B, C, are class labels. The data set has the mean range from 0 to 50 and variance range from 0.02 to 0. The test data generation procedure then does the following: () Generate a data file that contains the original (no missing values) data sets generated according to the parameters above. Each data object also contains a label to indicate the original cluster the data object belongs to (e.g, A, B, C,, etc); and (2) Convert the data sets in the original file to data sets that contains 5%, 0%, 5%, 20%, 25%, 30%, 35%, and up to 40%, respectively, of randomly distributed null values as the data sets of missing values. 0
13 3.3 Results We used a k-means clustering algorithm as a means to qualitatively and quantitatively evaluate the above approaches. We first examine the experimental results on Gaussian distributed data sets with relatively larger mean value ranges so that the clusters in the data sets are relatively separated. That is, the distributions of the data sets of different clusters have only little overlapping regions in the data space. Each test case is done with respect to varying missing data rate that ranges from 5% up to 40%. Figure 3. shows the clustering errors for each of the missing data handling method with respect to the percentage of missing values. The horizontal axis denotes the percentage of data values absent in the records, while the vertical axis denotes the clustering error percentage compared with the original labels of the data set. We then examine the experimental results on Gaussian distributed data sets with relatively smaller mean value ranges so that the clusters in the data sets are relatively mixed, that is, there are some considerable amount of regions in the data space where data distributions of different clusters have overlapped. Again the tests are done with respect to varying missing data rate that ranges from 5% up to 40%. Figure 3.2 shows the clustering errors for each of the missing data handling method with respect to the percentage of missing values. The experimental results show that the median, neighbor and mean substitution methods all outperformed the skipping methods. Among the three substitution methods, the nearest neighbor substitution has the best performance. The median substitution and mean substitution has almost the same amount of clustering error. This is understandable because of the closeness of these two values in the data sets of Gaussian distributions. Figure 3.3 shows the experimental results of the missing data handling methods with respect to the percentage of missing values on Gaussian distributed data sets of the same mean positions but different variance values. The notation ug002_0b, ug5_0b, and ug0_0b stand for uniform Gaussian distributions of 0 clusters with variance value ranges of 2, 5 and 0, respectively. The missing data rate changes from 5% up to 40% for each of the test data set. Again, it shows that the pseudo-nearest neighbor method has the best performance than the other
14 methods because the pseudo-nearest neighbor method captures the essence of pattern similarities in the original data set. 4. Conclusions In this paper, we studied a new method, namely the pseudo- nearest neighbor substitution, for missing data handling in preparation of data sets for data discrimination and mining applications. Performance of the method is compared with other substitution and non-substitution approaches for dealing with data sets containing randomly missing data attribute values. The experimentation results have provided following insights: () there is a tendency of increasing classification error rate along the increase of the cluster number k in the data set for all the missing data handling approaches; (2) there is a tendency of increasing classification error rate along the increase of the Gaussian variance ratio for all the missing data handling approaches; (3) The non-substitution (ignorance by skipping the attribute) approach is an inferior missing data handling approach in dealing with Gaussian randomly distributed data sets, and (4) the pseudo- nearest neighbor approach provides the best results to Gaussian random data sets among the substitution and non-substitution methods evaluated in our experiments. The application of these results to data mining and knowledge discovery could help the selection of missing data handling method during the data preparation step for different data structures and enable a more reliable and efficient decision making under uncertainties and incompleteness of data collections presented. Acknowledgement This work described in this paper was partly supported by AFOSR Grant No: F The authors thank the anonymous reviewers for their valuable and very detailed comments that helped improve the presentation of this paper. References 2
15 [] A. Afifi and R. Elashoff, Missing observations in multivariate statistics I: Review of the literature, Journal of American Statistics Association. 6, pp [2] R. Agrawal, M. Metha, J. Shafer, R. Srikant, A. Arning, T. Bollkinger, The Quest Data Mining System, Proceedings of 996 International Conference on Data Mining and Knowledge Discovery (KDD 96), Portland, Oregon, pp , august 996. [3] H. Hartley and R. Hocking, The analysis of incomplete data, Biometrics 27, pp [4] K. Krishnamoorthy and K. Maruthy, Some simple test procedures for normal mean vector with incomplete data, Annual Institutional Statistics and Mathematics, 50, 3, pp , 998. [5] V. Letfus. Daily relative sunspot numbers : reconstruction of missing observations, Solar Physics, 84, pp. 20-2, 999. [6] R. Little, Models for Nonresponse in Sample Surveys, Journal of American Statistics Association, 77, pp , 982. [7] R. Little and R. Rubin, Statistical analysis with missing data, New York: Wiley, pp , 987. [8] M. Mojirsheibani, A Pseudo-Nearest-Neighbor Combined Classifier, Proceedings of the 3st Symposium on the Interface: Models, Predictions, and Computing, pp [9] T. Orchard, and M. Woodbury, A missing information principle: Theory and applications, Proc. 6 th Berkeley Symposium on Mathematics, Statistics, and Probability,, pp , 972. [0] M. Pawlak, Kernel Classification Rules from Missing Data, IEEE Transactions on Information Theory, Vol. 39, No. 3, May 993. [] D. Pyle, Data Preparation for Data Mining, Morgan Kaufmann Publishers, 999. [2] I. Sande, Hot deck imputation procedures, Incomplete Data in Sample Surveys, Vol. III: Symposium on Incomplete Data, Proceedings, W.G. Madow and I. Olkin, Eds. pp ,996. [3] D. Titterington and J. Sedransk, Imputation of missing values using density estimation, Statist. Probab. Lett., Vol. 8, pp.4-48, 989. [4] Q. Zhu, On the Minimum Probability of error of Classification With Incomplete Patterns, Pattern Recognition, 23(), pp ,
16 Tables and Figures Table 2. Examples of Pseudo-similarity measurement S p (x k, x l ) x k x l S p (x k, x l ) # # # # # # # # # # # # 2 0 # # # # # # # # # # # #.4 0 # # # # # 0 # # # # #.5 0 # # # # # # # # # # # # # # 0 # # # # # # # # # # # # # # # 0 # # # # # # # # # 4.47 Table 3.. An example of Gaussain Means and Gaussain Variances of a data set Dimension µ σ 2 µ σ 2 µ σ 2 µ σ 2 µ σ 2 µ σ 2 µ σ 2 µ σ 2 µ σ 2 µ σ Cluster A Cluster B Cluster C Cluster D Cluster E Cluster F Cluster G Cluster H Cluster I Cluster J
17 clustering error rate kmedian kneighbor 20 kgmean 5 kskipping missing data rate clustering error rate missing data rate kmedian kneighbor kgmean kskipping Figure 3.. Experimental results showing clustering errors versus the percentage of missing values for the missing data handling methods on a data set with 0 clusters of fixed mean values but different variances: (a) variance range of 5, (b) variance range of 0. 5
18 classification error rate a mmedian mneighbor mgmean mskipping missing data rate classification error rate b mmedian mneighbor mgmean mskipping missing data rate Figure 3.2. Experimental results showing clustering errors versus the percentage of missing values for the missing data handling methods on a data set with different cluster numbers and varying mean and variance values. (a) 5 clusters with variance range of 5, (b) 0 clusters with variance range of 0. 6
19 Kneighbor method clustering error missing rate ug002_0b ug5_0b ug0_0b kgmean method clustering error missing rate ug002_0b ug5_0b ug0_0b kskipping method clustering error missing rate ug002_0b ug5_0b ug0_0b Figure 3.3. Experimental results showing clustering errors versus the percentage of missing values for the missing data handling methods on a data set with fixed cluster numbers and mean values but varying variance value ranges. 7
A pseudo-nearest-neighbor approach for missing data recovery on Gaussian random data sets
Pattern Recognition Letters 23(2002) 1613 1622 www.elsevier.com/locate/patrec A pseudo-nearest-neighbor approach for missing data recovery on Gaussian random data sets Xiaolu Huang, Qiuming Zhu * Digital
Problem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;
Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random
[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
An Analysis of Four Missing Data Treatment Methods for Supervised Learning
An Analysis of Four Missing Data Treatment Methods for Supervised Learning Gustavo E. A. P. A. Batista and Maria Carolina Monard University of São Paulo - USP Institute of Mathematics and Computer Science
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
A Review of Missing Data Treatment Methods
A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common
The treatment of missing values and its effect in the classifier accuracy
The treatment of missing values and its effect in the classifier accuracy Edgar Acuña 1 and Caroline Rodriguez 2 1 Department of Mathematics, University of Puerto Rico at Mayaguez, Mayaguez, PR 00680 [email protected]
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
6.4 Normal Distribution
Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms
A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms Ian Davidson 1st author's affiliation 1st line of address 2nd line of address Telephone number, incl country code 1st
Chapter 4. Probability and Probability Distributions
Chapter 4. robability and robability Distributions Importance of Knowing robability To know whether a sample is not identical to the population from which it was selected, it is necessary to assess the
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Christfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
Introduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy
Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering
1 Prior Probability and Posterior Probability
Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
Algebra 1 2008. Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard
Academic Content Standards Grade Eight and Grade Nine Ohio Algebra 1 2008 Grade Eight STANDARDS Number, Number Sense and Operations Standard Number and Number Systems 1. Use scientific notation to express
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,
Missing data in randomized controlled trials (RCTs) can
EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled
LOGISTIC REGRESSION ANALYSIS
LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic
Context-aware taxi demand hotspots prediction
Int. J. Business Intelligence and Data Mining, Vol. 5, No. 1, 2010 3 Context-aware taxi demand hotspots prediction Han-wen Chang, Yu-chin Tai and Jane Yung-jen Hsu* Department of Computer Science and Information
Up/Down Analysis of Stock Index by Using Bayesian Network
Engineering Management Research; Vol. 1, No. 2; 2012 ISSN 1927-7318 E-ISSN 1927-7326 Published by Canadian Center of Science and Education Up/Down Analysis of Stock Index by Using Bayesian Network Yi Zuo
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION
ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out
Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out Sandra Taylor, Ph.D. IDDRC BBRD Core 23 April 2014 Objectives Baseline Adjustment Introduce approaches Guidance
Permutation Tests for Comparing Two Populations
Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of
Measuring Line Edge Roughness: Fluctuations in Uncertainty
Tutor6.doc: Version 5/6/08 T h e L i t h o g r a p h y E x p e r t (August 008) Measuring Line Edge Roughness: Fluctuations in Uncertainty Line edge roughness () is the deviation of a feature edge (as
Quality Assessment in Spatial Clustering of Data Mining
Quality Assessment in Spatial Clustering of Data Mining Azimi, A. and M.R. Delavar Centre of Excellence in Geomatics Engineering and Disaster Management, Dept. of Surveying and Geomatics Engineering, Engineering
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Lecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools
Paper by W. F. Cody J. T. Kreulen V. Krishna W. S. Spangler Presentation by Dylan Chi Discussion by Debojit Dhar THE INTEGRATION OF BUSINESS INTELLIGENCE AND KNOWLEDGE MANAGEMENT BUSINESS INTELLIGENCE
A logistic approximation to the cumulative normal distribution
A logistic approximation to the cumulative normal distribution Shannon R. Bowling 1 ; Mohammad T. Khasawneh 2 ; Sittichai Kaewkuekool 3 ; Byung Rae Cho 4 1 Old Dominion University (USA); 2 State University
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
Chapter 1 Introduction. 1.1 Introduction
Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS
PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, [email protected]; Third C.
Least-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety
A Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing
Chapter 8 Hypothesis Testing 1 Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing 8-3 Testing a Claim About a Proportion 8-5 Testing a Claim About a Mean: s Not Known 8-6 Testing
Data Cleaning and Missing Data Analysis
Data Cleaning and Missing Data Analysis Dan Merson [email protected] India McHale [email protected] April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS
Statistics for BIG data
Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
UNDERSTANDING THE TWO-WAY ANOVA
UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables
Classification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University
Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary
Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
ETL PROCESS IN DATA WAREHOUSE
ETL PROCESS IN DATA WAREHOUSE OUTLINE ETL : Extraction, Transformation, Loading Capture/Extract Scrub or data cleansing Transform Load and Index ETL OVERVIEW Extraction Transformation Loading ETL ETL is
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
CHI-SQUARE: TESTING FOR GOODNESS OF FIT
CHI-SQUARE: TESTING FOR GOODNESS OF FIT In the previous chapter we discussed procedures for fitting a hypothesized function to a set of experimental data points. Such procedures involve minimizing a quantity
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Toshio Sugihara Abstract In this study, an adaptive
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Handling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis
Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS
CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS Venkat Venkateswaran Department of Engineering and Science Rensselaer Polytechnic Institute 275 Windsor Street Hartford,
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
OUTLIER ANALYSIS. Data Mining 1
OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,
Missing data: the hidden problem
white paper Missing data: the hidden problem Draw more valid conclusions with SPSS Missing Data Analysis white paper Missing data: the hidden problem 2 Just about everyone doing analysis has some missing
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
