Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Size: px

Start display at page:

Download "Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza"

Stanley Price
8 years ago
Views:

1 Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

2 The problem Often in official statistics we have large data sets with many variables and many missing data. However we simply cannot delete incomplete records because this amounts to a substantial loss of costly collected data. In some cases the loss is completely at random (MCAR), i.e. the presence of missing values is unrelated to the values of the variables. In a real MCAR situation almost all methods, managing missing data, work fine

collected data. In some cases the loss is completely at random (MCAR), i.e. the presence of missing values is unrelated to the values of the variables.

3 Missing data: Selective loss A more realistic hypothesis is to assume that the missing data are missing at random (MAR), that is, the probability that an observation is missing may depend on observed data but not on other missing values or non-observed variables In some cases we cannot assume MAR, we have the Missing Not at Random mechanism (MNAR)

missing may depend on observed data but not on other missing values or non-observed

4 Single and Multiple Imputation Single imputation: we create one «completed data-set» This is the obvious choice if you have to distribute the data-set Multiple imputation: we create M «completed data-set» If we want to estimate a model, single imputations cannot reflect the uncertainty for the predictions of the unknown missing values and consequently the variances of the parameter estimates will be biased downward.

we want to estimate a model, single imputations cannot reflect the uncertainty for the predictions of

5 Recommendations from Eurostat In relation to the imputation: 1. The procedure applied to the data should preserve variation of and correlation between variables. Methods that incorporate error components into the imputed values shall be preferable to those that simply impute a predicted value. 2. Methods which take into account the correlation structure (or other characteristics of the joint distribution of the variables) shall be preferable to the marginal or univariate approach.

Methods that incorporate error components into the imputed values shall be preferable to those that simply impute a

6 Properties of a good imputation method Be general enough to handle general non-monotone patterns of missing data and mixed variable types preserve associations between variables having missing values preserve marginal distributions (means, variances and shape) Moreover, the inference procedures on these data should take account of uncertainty due to imputation Monotone missing NON-Monotone missing

preserve marginal distributions (means, variances and shape) Moreover, the inference procedures on

7 Multiple Imputation (Rubin 1987) This method obtains valid statistical inferences, that properly reflect the uncertainty due to missing values, for parameters tests and confidence intervals. Multiple imputation inference involves three distinct phases: The missing data are filled in m times to generate m complete data sets. The m complete data sets are analyzed by using standard procedures. The results from the m complete data sets are combined for the inference using rules that combine within-imputation and betweenimputation variability.

Multiple imputation inference involves three distinct phases: The missing data are filled in m times to generate m complete data sets.

8 Problems with Multiple Imputation Absence of a complete data-matrix, which is convenient to have in many cases. Difficulties in analysing a large number of variables Difficulties in analysing mixed measurement level data Multinormality may be non-realistic MI cannot consider constraints on the imputations MI cannot consider bounds or complex survey designs

Difficulties in analysing a large number of variables Difficulties in analysing mixed

9 IVEware and MICE Sequential Regression for Multiple Imputations, (Raghunathan et al. 2001) is implemented by IVEware software. A similar approach is used by MICE (Van Buuren et al., 2000) Multivariate Imputation by Chained Equations. They require specifying a conditional distribution for the missing data in each incomplete variable, under the assumption that a corresponding multivariate distribution exists It iterates over all conditionally specified imputation models. Advantages with respect to MI: the univariate problems are simpler than multivariate ones and it is possible to consider mixed measurement variables, bounds, constraints between variables, interactions.

They require specifying a conditional distribution for the missing data in each incomplete variable, under the assumption that a corresponding multivariate distribution

10 Hot-deck imputation Hot-deck is an imputation method where a pool of donors is defined for each recipient, and a donor is drawn from the pool at random. The pool is defined so that it contains the subjects who are similar to the recipient. Benefits of hot-deck imputation: 1) imputations tend to be realistic since they are based on values observed elsewhere; 2) imputations will not be outside the range of possible values; 3) it is not necessary to define an explicit model for the distribution of the missing values; 4) It can analyze mixed measurement level variables Because of the simplicity of the hot-deck approach and these desirable properties, it is a popular method of imputation, especially in large sample survey settings where there is a large pool of donors.

Benefits of hot-deck imputation: 1) imputations tend to be realistic since they are based on values observed elsewhere; 2) imputations will not be outside the range of possible values; 3) it is

11 Hot-deck imputation: Weakness Definition of a distance/dissimilarity between the units The definition is very difficult with many variables (curse of dimensionality) and mixed measurement level Relationships among the variables To maintain multivariate relationships, the donors should assign to the receiver all the missing variables. Some relationships can be distorted.

measurement level Relationships among the variables To maintain multivariate relationships,

12 Predictive mean matching (PMM) It is a hot-deck imputation method where we try to overcome the difficulty to define a distance measure. Complete values Y obs are regressed on the set of observed variables, say X. Predicted values are calculated for all Y. Finally, Y mis values are imputed using Y obs values whose predicted values are similar. Bootstrap or Approximate Bayesian bootstrap are methods for incorporating parameter uncertainty into hot-deck imputation models

Predicted values are calculated for all Y.

13 MIDAS (Siddique & Belin 2008) It is a multiple imputation using distance aided selection of donors which implements an iterative predictive mean matching hot-deck for imputing missing data. It can handle continuous and categorical data.

an iterative predictive mean matching hot-deck for

14 Imputation by Decision Trees Decision trees split the sample into more homogeneous subsamples and the variables analysed can be categorical or quantitative. We don t need to define a distance but the units in a leaf can be assumed very close, especially if we don t prune the tree. Moreover, the leafs are expression of the relationships between the target and the predictors. We can use this property to apply a hot deck approach without its weak points!!! Decision trees were proposed by many authors based on prediction. Di Ciaccio (2008), Burgette & Reiter (2010) proposed a Multiple Imputation via Sequential Regression Trees.

Moreover, the leafs are expression of the relationships between the target and the predictors.

15 Algorithm MultiTree for single/multiple imputation 1. For all variables: initialize missing data by random Hot deck 2. Iterate (j=1 to num. of variables) 3. Set the variable j as the target variable 4. Select cases which do not have missing value for variable j 5. Estimate a big decision tree without pruning 6. For each missing value determine the corresponding leaf 7. Estimate missing values of variable j by random hot deck in the leafs 8. Update missing values of variable j. Go to next variable (step 3) 9. If d <0.001 or iterations> T then STOP else go to step (2) To introduce Multiple imputation, as step 0, we can select several bootstrap samples and carry out the analysis for each sample.

For each missing value determine the corresponding leaf 7. Estimate missing values of variable j by random hot deck in the leafs 8. Update missing values of variable j.

16 Simulation (single imputation) 5000 units 8 variables: X1-X6 quantitative; A, B categorical A\B b1 b2 b3 b4 tot a (300) 1000 (300) (400) 3000 a tot ( cluster 1) A=a1, B=b1 or b2, X1-X3 generated by Uniform(50;100), X4-X6 linear combination of X1-X3 ( cluster 2) (A=a2, B=b3 or b4) or (A=a1, B=b4) X1-X6 generated by a MultiNormal with covariance matrix given by E(X i X j ) = 2 ρ i-j with ρ = 0.5 e =30. We inserted missing values randomly in the variables X1-X3 and B. In the table we show the number of missing values for each combination of the categorical variables (in red). Moreover, 1000 observations of B were set to missing (MNAR).

(A=a1, B=b4) X1-X6 generated by a MultiNormal with covariance matrix given by E(X i X j ) = 2 ρ i-j with ρ = 0.5 e =30.

17 Results for the categorical variable B Distribution Categorical variable B Prediction

18 Results for the variable X1 Mean and standard deviation True Mean imputed by Multitree imputed by IVEWARE a1 b a2 b a1 b True std imputed by Multitree imputed by IVEWARE a1 b a2 b a1 b Distribution of X1 given A=a1 & B=b1

19 Comparison of Results: correlations True correlations, given A=a1 & B=b1 X2 X3 X4 X5 X6 X A1B1 X X X X X6 1 IVEWARE correlations, given A=a1 & B=b1 X2 X3 X4 X5 X6 X X X X X X6 1 Multitree correlations, given A=a1 & B=b1 X2 X3 X4 X5 X6 X X X X X X6 1

28 0.17 0.40 0.40 0.07 X2 1 0.32 0.47 0.24 0.43 X3 1 0.36 0.55 0.49 X4 1 0.55 0.41 X5 1 0.

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item