Imputation Strategies and their Evaluation

Transcription

1 Imputation Strategies and their Evaluation Seppo Laaksonen Statistics Finland and University of Tampere Presentation for the Intermediate Chintex Workshop, 30 November 2001, Statistics Finland, Helsinki General Points on Surveys What is Imputation? Why Imputation Use of Imputed data Auxiliary Data Service Imputation Process and Imputation methods - Pre-Imputation vs. Final Imputation Imputation Software s Future and Conclusions All comments, critics and proposals are welcome. Seppo.Laaksonen@ Stat.Fi

2 m missing

3 Tasks for Providing Survey Data for Users Users Needs Survey Design Sampling Design Data Collection Editing and Imputation Initial Weighting (Design Weights, Basic Weights) Re-Weighting (Post-stratification, Response Propensity Modelling, G-Weighting (Ratio, Regression), Outlier Weighting, Calibration (aggregate level) Output Data: Aggregated Macro Data and Micro Data for Special Users (data are flagged if imputed) Dissemination

4 What is imputation? Replacing a missing or incomplete or strange (outlier etc.) value with a more or less artificial value - If only one replacement: Single Imputation - If several replacements (either more units for one data set or several completed data sets): Multiple Imputation

5 Scheme of a Typical Statistical Micro File Seppo Laaksonen, 2001 Statistical Units -N(D) Identifiers -crosssectional -longitudinal -protected X -Variables - for sample selection - other auxiliary variables Y -Variables (Outcome Vbles) Several types such as: - based on various scalings - flag variables (initial, imputed,...) - variously confidential Sampling and Other Weights *Basic & *Comparison GRP Adj. Calibrated Purposes -n(d) 1 i r n Frame Overcoverage Sample Overcoverage Item Nonresponse Sub-sample of unit nonrespondents, Option Short Questionn. Unit Nonresponse N(real) N(true) Symbols: Undercoverage Excluded from the Sample Survey r= number of respondents; N(D) and n(d) =numbers of overcoverage units in a frame and in a sample, the last one may be needed to estimate; n = sample size initial except overcoverage, N's = population sizes (true =target; real =frame population);

6 Why to use Imputation? Unfortunately I have no definite answer But In many situations this operation should have been considered carefully, such as - when item missingness rate is high/significant (key variables) - when units available for multivariate analysis will be reduced much without imputations, see e.g. in the page which is not dramatic at all (but there is a danger for that analysis if imputation is not well done) - partially known values: e.g. - if known an interval (or rounded value) where a correct values lies - if should have chosen from certain categories (some are excluded like in show Who wants to be a millionaire?

7 - helping editing procedures (pre-imputation) - harmonising purposes (Y* = f(x, Y) ) - confidentiality purposes - linking/matching cross-sectional/longitudinal files together, new holes may be appearing and these may be best to fill by imputing

8 Number of missing values for some variables in the Finnish ECHP 1996 p p p p p p p p p p p p p p p p p Any of those

9 Use of Imputed data May be done at macro level but we here discuss micro-level imputations. It is important to note that the partially imputed micro data may be used at several levels, and 1. The requirements for further use are naturally most demanding when the further use will be done at micro level. This requires that real (normally unknown) values are reasonably well preserved at this level, or at least interrelationships between variables used in multivariate analysis are reasonably well preserved. 2. Somewhat less demanding is to use imputed data at distributional levels. E.g. my old exercises (Laaksonen 1991) for Finnish income distributions gave very promising results but a good reweighting procedure may be quite good as well (note: that imputation is in margins of the distribution very often superior to it). 3. Imputed data for tabular use (incl. constructing good time series) is even less demanding. When imputation has been done at micro-level, it is very comfortable to use such data for whatever tabulations. Hence it has been proposed even that all missing data could be imputed (incl. unit missingness).

10 But Whatever Imputation Method is not recommendable. It should have been done for each exploitation level optimally. User should know for which purpose it best to use and where may be arising problems. We go now to look the methods useable for good imputations The first basic point is how good auxiliary data are available. If such data are poor, micro level utilisation may be best to forget.

11 A Typology of Auxiliary Variables in Surveys, examples from business surveys Type of Auxiliary Data Examples (period) Use 1. Sampling design variables from population level Sizeband (t-1), Industry class (t-1), Region (t-1). 2. Non-updated sampling design variables from population level 3. Updated sampling design variables from population level 4. Other population level data from registers or recent surveys (estimated) The same as in type 1, new strata may be done (poststrata); in ABI: AWEIGHTBAND The same as the previous but from period t; Aggregated register turnover, employment (t-1, t); aggregated turnover from RSI (around t) Designing, Design weighting for sampled units Initial or post-stratified weights for respondents excl. overcoverage based on sample information Better weights as in the previous, sample and population overcoverage, undercoverage, deaths, births, mergers, splits, re-constructions Macro editing Macro imputation G-Weights (for each GWEIGHTBAND) based on ratio estimation or advanced

12 5. Micro data at sample level (respondents, overcoverage, nonrespondents) from registers, independent surveys and other external sources 6. Micro data at respondents level from internal sources (same survey) 7. Micro data as a subsample of non-respondents or respondents Categorical: sizeband and industry (t, t-1) Continuous: register turnover (t, t-1), register employment (t, t-1), RSI turnover (around t ) The above ones are available soon (designing time), but some others maybe later (estimation time) In addition to group 5: whatever survey variables from t, e.g. survey turnover, survey employment, survey value added, total output, imputed y value In addition to standard vbles: key variables of the survey concerned (modelling) methods (Calibration) Micro editing: error localization, selective editing, Imputation: modelling and task for crucial variables with missingness Re-weighting: GREG, response propensity modelling Editing incl. selective editing using best guess (preliminary imputed value, previous value) Imputation: modelling using auxiliary vbles either independently for each imputation task or sequentially (imputing first missing values of one vble, then the next) Quality checking Re-weighting, Imputation

13 8. Micro data from the previous waves of the same repeated survey (panel) 9. Super-auxiliary variables for specific small groups at micro level if possible 10. Hypotheses on the behaviour of variables, based on previous experiences from the same survey, international harmonisation purpose, etc. Any categorical and continuous variables for the same unit (if unit changed, this should take into account) from t-1, t-2, Note: also changes in weights Big and other unique businesses are often so special that from the same survey cannot be found reasonably observations for modelling or donors. Hence multi-national data or other super data should be used Distributions (normal, lognormal, binomial, Poisson), link functions, conditions (CMAR, MAR, NMAR), sensitivity, bounds, relevant time series Micro editing Imputation Re-weighting if need for longitudinal analysis (longitudinal weighting) Micro editing: plausibility checking Imputation Outlier weights Models for editing, imputation, weighting, outlier detection

14 NEED for AUXILIARY DATA SERVICE Although this need is recognised, THIS ACTION IS NOT USUALLY FOCUSED in NSI s auxiliary data are used too much ad hoc or following the traditions in this particular statistics data easily available is mostly exploited, but there are problems - in using updated data (for period t, or close to it), - data from other surveys or registers are not used reasonably - data from previous periods of the same survey may be used better - changes in businesses or households may be taken into account better than done. FECHP: register data have been exploited but I am not sure whether in the best way.

15 Key Variables should have recognised and used extensively Both for pre-imputation and final imputation

16 Imputation Process Step 1. The data editing process precedes the imputation process but this should thus be integrated well with real imputation process In any case, the pre-editing process has identified such values, which are required to impute. It is possible that a new editing, post-editing, is needed later in the estimation stage. Note that pre-imputation as described earlier is an essential part of editing, especially if selective or significance editing is wanted to use. Step 2. All auxiliary information potentially helpful for imputation must have been collected and validated for each imputation task. This job will continue during the following tasks if reasonable results have not been achieved with available variables and with their initial forms.

17 Step 3. The imputation model is extremely important in the whole process. Examples: - good guess - known function (logical imputation) - linear regression model with constant term - linear regression model with noise term - linear regression model with constant and noise term - linear regression model with slope (and noise) - linear regression model with constant and slopes (and noise) - logistic regression with different alternatives as above (categorical variables) - multi-level modelling - generalised linear models - non-parametric regression models (including estimation of median and other quantiles) - regression tree, classification tree (WAID software is available but not good for standard business surveys because requires categorical auxiliary variables) - multi-dimensional non-parametric surface - neural nets: self-organising maps (SOM), MLP, AURA, (Euredit project is working with these, results expected in 1-2 years) - rules from editing

18 ALL 1498 GENDER= GENDER= ADULTS=5, ADULTS=2-4,6, ADULTS= ADULTS=1, ADULTS= ADULTS=2,4,6, ADULTS=4, ADULTS=2, SAUNA1= SAUNA1= ADULTS= ADULTS=4,6, SAUNA1= MOBILE= SAUNA1= SAUNA1= MOBILE= SAUNA1= SAUNA1= WAID OLS Tree for DRINKS with 4 Explanatory Variables. The right tree (gender=2) is truncated.

19 Step 4. Imputation itself Two basic alternatives: 1. In the case of model-donor imputation the imputed values are directly derived from a (behavioural) model. 2. In the case of real-donor imputation the imputed values are directly derived from a set of observed values, from a real donor respondent, but still are indirectly derived from a more or less exactly defined model.

20 Group 1: imputed value is a predicted value of the model, adding a noise term if necessary. Group 2: how to choose a donor, it is the big issue: - Generalising: it is always the value from the neighbourhood, even the nearest based on the rules derived from the model - Many names are used, such as random hot decking (random raw with or without replacement), sequential hot decking, nearest neighbour, near neighbour In practice, a best solution may be to use both techniques, one for one part of the data, and another for the rest.

21 Other classifications: A. - Deterministic (model without random noise) - Stochastic These may be included within the previous classification. B. - Single (1 imputed value) - Multiple (3-8 imputed values) This requires some type of stochastic procedure.

22 General example: Simple linear regression model y(t) = α + β 1 x 1 + β 2 x 2 + γ y(t-1) + ε (demographic changes may be added as dummy variables, e.g.) (ε = random noise term, y survey income, x 1 domain (e.g. social group), x 2 register income, t survey period) The estimates for the parameters are denoted a, b 1, b 2, c and e. If this is reduced so that the estimated equation is y(t) = a, then it is called mean imputation, or if variable is of ratio type, ratio imputation (median may be also possible). If the model is reduced to y(t) = ε, then imputation may be done using observed residuals or theoretical residuals assuming that these follow a certain distribution such as normal distribution, but the imputation may be done either using real-donor or model-donor technique.

23 If just these theoretical values have been used, it is a model-donor technique, whereas using observed residuals, the technique is a real-donor one. In this case, the observed values may be used only once (without replacement) or several times (with replacement). The last one is usually called random hot decking. If term β 1 x 1 has been added into the model, methods such as - cell/domain mean imputation, - or cell hot decking may be applied The predicted values of the estimated regression model may be used (i) directly as imputed values, or (ii) adding (theoretical or observed) residuals, or (iii) the values in (i) and (ii) may be used for constructing near(est) neighbour technique, and this has been used when finding a donor for each missing value (regression-based nearest neighbour hot decking).

24 Some other features on Careful Imputation in the case of regression model - Special values (e.g. extreme cases) are useful to impute but not necessarily to use these values in the final data set (except non-key variables). - Final imputation is often useful to do within homogenous imputation cells. - Sampling weights should have been taken into account in final imputation. - Sequential imputation is becoming more common, for example, so that the key variables have been first imputed and the imputed values of these have been used as explanatory variables when imputing non-key variables. - Make results consistent to each other including edit rules - Check the completed results against available benchmarking data (aggregate level)

25 What Method is best, and How we know it? Excellent question The experience helps a lot: - Good imputation model or Predictability of it over the whole distribution helps whether finally imputation done based on real- or model-donors. - If the model not fine, in my experience real-donor methods are preferable, but these will only include observed values (like weighting methods); hence these are not good in such area where are not donors at all, or reasonably. This leads to model-donor techniques if values cannot be found. - Multiple Imputation have some advantages

26 Step 5. When imputations have been provided, the point estimates and the ordinary sampling variance estimates may be computed. Moreover, it is necessary to continue towards the additional variance due to imputation, called imputation variance. Analytical Formula may be developed for certain standard situations Often Replicated methods useful or Multiple imputation is a method for this purpose. It requires a proper technique. Step 6. There are several outputs from imputations. Standard estimation results are enough for most users, but many of them wish to further analyse the micro data file with imputed values. Hence, it is essential to exactly tell for these users, which values are imputed and which are not.

27 Software s for imputation * No extremely good software does not exist * SOLAS for missing values (Statistical Solutions): - Missing data pattern looked easily - Some rather simple single imputation methods, including two techniques for multiple imputation * WAID (AutImp/CBS) - Tree-based methods, not available for continuous explanatory variables in imputation model * SAS: simple imputation methods are easy to implement using SAS, and more demanding ones by a sophisticated user. * Multiple imputation software s are available but not maybe very good for NSI data (biometricians have used those, I suppose)

28 In the future: Some types of EUREDIT developments too: EU/FP5 Project for Website: Workplan EUREDIT is investigating neural networks, support vector machines, model-based and donor-based editing and imputation methods. Of these, the first two represent technologies that have only recently become readily available, and offer promise for providing accurate imputation methods in situations where traditional methods run into difficulties. Model-based methods, particularly those based on multivariate models for the data generation process, will be investigated and further developed. Real-Donor-based methods are already in place in a number of NSIs (National Statistical Institutes) and the project includes this methodology in order to provide a baseline for comparing the performance of the other more novel methods. DataClean 2002 Conference in Jyväskylä, Finland Website:

29 A Technique: Self-Organising Maps (SOM) for editing and imputation The Finnish WP: University of Jyväskylä and Statistics Finland Tree-Structured Self-Organizing Maps, TS-SOM

30 CONCLUSIONS * Some Imputation methods are to be used in close connection to editing * Imputation has been used already in all surveys, not maybe much, although not always recognised; unfortunately the methods used are sometimes subjective (good guesses) and hence not documented. Thus: need for objective methods although not used much. * Imputation could be much more used than done today, because missingness is growing in surveys and censuses. * Good new techniques are under development also in EU projects including EUREDIT which is looking for neural nets, outlier detection and classical techniques. It does not seem to be very simple job.

31 * A user should recognise that imputed data are not a real one, and leads to an additional variance (reduction in accuracy) and this should have estimated. On the other hand, without good imputation, a user will lose much more information. * Imputation is more necessary in economic surveys (e.g. for business and income variables with nonignorable missingness) but also needed when dealing with social phenomena. * Good auxiliary data are needed for successful imputation

32 Some References Heeringa, S.G., Little, R.J. and Raghunatan, T.E. (1997). Bayesian Estimation and Inference for Multivariate Coarsened Data on U.S. Household Income and Wealth. Invited Paper for the 51 st Session of the ISI, Istanbul. Kalton, G. and Kasprzyk, D. (1986). The Treatment of Missing Survey Data. Survey Methodology 12, Laaksonen, S. (1991). Adjustments for Non-response in Two-year Panel Data. The Statistician. Great Britain 40, Laaksonen, S. (1999). Weighting and Auxiliary Variables in Sample Surveys. In: G. Brossier and A-M. Dussaix (eds). "Enquêtes et Sondages. Méthodes, modèles, applications, nouvelles approches," Dunod. Paris. Laaksonen, S. (2000). Regression-Based Nearest Neighbour Hot Decking. Computational Statistics 15, 1, Lawrence, D. and McKenzie, R. (2000). The General Application of Significance Editing. Journal of Official Statistics 16, 3. Little, R. (1988). Missing-Data Adjustments in Large Surveys. Journal of Business & Economic Statistics 6, Little, R. and Rubin, D. (1987). Statistical Analysis with Missing Data. John Wiley & Sons. Rao, J.N.K. and Shao, J. (1992). Jack-knife Variance Estimation With Survey Data Under Hot Deck Imputation. Biometrika 79, Rubin, D. (1987). Multiple Imputation in Surveys. John Wiley & Sons. Rubin, D. and the papers and the discussion by B. Fay, J. Rao, D. Binder, J. Eltinge and D. Judkins (1996). Multiple Imputation After 18+ Years. Journal of the American Statistical Association 91, Särndal, C-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. Springer. Särndal, C-E. (1996). For a Better Understanding Imputation. In: S. Laaksonen (ed.). International Perspectives on Non-response. Statistics Finland Research Reports 219. Pp Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman & Hall. Schulte Nordholt, E. (1998). Imputation: Methods, Simulation, Experiments and Practical Examples. International Statistical Review, 66, Solas (1999). Solas for Missing Data Analysis 2.0. Statistical Solutions, Ltd. Cork, Ireland. West, S.A., Kratzke, D-T. and Robertson, K.W. (1996). Alternative Imputation Procedures for Item-Non-response from New Establishments in the Universe. ASA Proceedings of the Section in Survey Research Methods.