IAB Discussion Paper
|
|
|
- Silas Buddy Walsh
- 10 years ago
- Views:
Transcription
1 IAB Discussion Paper 6/2010 Articles on labour market issues Multiple imputation of missing values in the wave 2007 of the IAB Establishment Panel Jörg Drechsler
2 Multiple imputation of missing values in the wave 2007 of the IAB Establishment Panel Jörg Drechsler (IAB) Mit der Reihe IAB-Discussion Paper will das Forschungsinstitut der Bundesagentur für Arbeit den Dialog mit der externen Wissenschaft intensivieren. Durch die rasche Verbreitung von Forschungsergebnissen über das Internet soll noch vor Drucklegung Kritik angeregt und Qualität gesichert werden. The IAB Discussion Paper is published by the research institute of the German Federal Employment Agency in order to intensify the dialogue with the scientific community. The prompt publication of the latest research results via the internet intends to stimulate criticism and to ensure research quality at an early stage before printing. IAB-Discussion Paper 6/2010 2
3 Contents Abstract Zusammenfassung Introduction The Concept of Multiple Imputation Two Approaches to Generate Imputations for Missing Values Real Data Problems and Possible Ways to Handle Them Imputation of Skewed Continuous Variables Imputation of Semi-Continuous Variables Imputation Under Non-Negativity Constraints Imputation Under Linear Constraints Skip Patterns Multiple Imputation of Missing Values in the IAB Establishment Panel The Dataset The Imputation Task Imputation Models Evaluating the Quality of the Imputations Concluding Remarks IAB-Discussion Paper 6/2010 3
4 Abstract The basic concept of multiple imputation is straightforward and easy to understand, but the application to real data imposes many implementation problems. To define useful imputation models for a dataset that consists of categorical and of continuous variables with distributions that are anything but normal, contains skip patterns and all sorts of logical constraints is a challenging task. In this paper, we review different approaches to handle these problems and illustrate their successful implementation for a complex imputation project at the German Institute for Employment Research (IAB): The imputation of missing values in one wave of the IAB Establishment Panel. Zusammenfassung Die Grundidee der multiplen Imputation ist einfach zu verstehen, aber die Anwendung des Verfahrens auf reale Datensätze stellt den Anwender vor etliche zusätzliche Herausforderungen. Viele Datensätze bestehen sowohl aus kategorialen als auch aus kontinuierlichen Variablen, wobei letztere alles andere als normalverteilt gelten können. Zusätzlich verkomplizieren Filterfragen und verschiedene logische Restriktionen die Modellbildung. In diesem Papier stellen wir verschiedene Möglichkeiten vor, mit diesen Herausforderungen umzugehen und veranschaulichen eine erfolgreiche Implementierung anhand eines komplexen Imputationsprojekts am Institut für Arbeitsmarkt- und Berufsforschung (IAB): Die Imputation der fehlenden Werte einer Welle des IAB Betriebspanels. JEL classification:c52, C81, C83 Keywords: Multiple imputation; joint modeling; fully conditional specification; IAB Establishment Panel Acknowledgements: I would like to thank Hans Kiesl, Susanne Kohaut and Jerome P. Reiter for valuable comments. IAB-Discussion Paper 6/2010 4
5 1 Introduction For many datasets, especially for non mandatory surveys, missing data are a common problem. Deleting units that are not fully observed, using only the remaining units is a popular, easy to implement approach in this case. This can possibly lead to severe bias if the strong assumption of a missing pattern that is completely at random (MCAR, Rubin (1987)) is not fulfilled. Imputing missing values can help to handle this problem. However, ad hoc methods like, e.g., mean imputation can destroy the correlation between the variables. Furthermore, imputing missing values only once (single imputation) generally doesn t account for the fact that the imputed values are only estimates for the true values. After the imputation process, they are often treated like truly observed values leading to an underestimation of the variance in the data and by this to p values that are too significant. Multiple imputation as proposed by Rubin (1978) overcomes these problems. With multiple imputation, the missing values in a dataset are replaced by m > 1 simulated versions, generated according to a probability distribution for the missing values given the observed data. More precisely, let Y obs be the observed and Y mis the missing part of a dataset Y, with Y = (Y mis, Y obs ), then missing values are drawn from the Bayesian posterior predictive distribution of (Y mis Y obs ), or an approximation thereof. But even though the general concept of multiple imputation is easy to implement and software packages that will automatically generate multiply imputed datasets from the original data exist for most standard statistical software, application to real datasets often imposes additional challenges that need to be considered and often can not be handled with of-theshelf imputation programs. Maintaining all skip patterns and logical constraints in the data is difficult and cumbersome. Besides, depending on the variable to be imputed, it might be necessary to define different imputation models for different subsets of the data to increase the quality of the model and to exclude some variables from the imputation model to avoid multicollinearity problems. These model specifications usually can not be included in the standard software. Furthermore, the quality of the imputations needs to be monitored even if the implicit assumption of a missingness pattern that is missing at random (MAR) can not be tested with the observed data. This does not mean, the imputer can not test the quality of his or her imputations at all. Abayomi/Gelman/Levy (2008) suggest several ways of evaluating model based imputation procedures. In this paper we suggest some adjustments for the standard multiple imputation routines to handle the real data problems mentioned above and illustrate a successful implementation of multiple imputation for complex surveys by describing the multiple imputation project for a German establishment survey, the IAB Establishment Panel. The remainder of the paper is organized as follows: In Section 2 we recapitulate the basic concept of multiple imputation. In Section 3 we briefly introduce the two main approaches for multiple imputation, joint modeling and sequential regression, and discuss their advantages and disadvantages. In Section 4 we present some adjustments for standard multiple imputation routines to handle problems that often arise with real data. We don t claim that the ideas presented in this section are new. They have been suggested in several other IAB-Discussion Paper 6/2010 5
6 papers. The main aim of this section is to give an overview of potential problems that are likely to arise in real data applications and to provide a summary of possible solutions in one paper to free the potential multiple imputation user from the burden of a complete literature review in hopes of finding a solution to his specific problem. In Section 5 we describe results from the multiple imputation project at the German Institute for Employment Research (IAB) that heavily relies on the methods described in Section 4: The multiple imputation of missing values in the German IAB Establishment Panel. In this section we also discuss the methods we used to evaluate the quality of the imputations. The paper concludes with some final remarks. 2 The Concept of Multiple Imputation Multiple imputation, introduced by Rubin (1978) and discussed in detail in Rubin (1987, 2004), is an approach that retains the advantages of imputation while allowing the uncertainty due to imputation to be directly assessed. With multiple imputation, the missing values in a dataset are replaced by m > 1 simulated versions, generated according to a probability distribution for the true values given the observed data. More precisely, let Y obs be the observed and Y mis the missing part of a dataset Y, with Y = (Y mis, Y obs ), then missing values are drawn from the Bayesian posterior predictive distribution of (Y mis Y obs ), or an approximation thereof. Typically, m is small, such as m = 5. Each of the imputed (and thus completed) datasets is first analyzed by standard methods designed for complete data; the results of the m analyses are then combined to produce estimates, confidence intervals, and test statistics that reflect the missing-data uncertainty properly. In this paper, we discuss analysis with scalar parameters only, for multidimensional quantities see Little/Rubin (2002), Section To understand the procedure of analyzing multiply imputed datasets, think of an analyst interested in an unknown scalar parameter Q, where Q could be, e.g. the population mean or a regression coefficient from a linear regression. Inferences for this parameter for datasets with no missing values usually are based on a point estimate q, a variance estimate u, and a normal or Student s t reference distribution. For analysis of the imputed datasets, let q i and u i for i = 1, 2,...m be the point and variance estimates achieved from each of the m completed datasets. To get a final estimate over all imputations, these estimates have to be combined using the combining rules first described by Rubin (1978). For the point estimate, the final estimate simply is the average of the m point estimates q m = 1 m m i=1 q i. Its variance is estimated by T = ū m + (1 + m 1 )b m, where ū m = 1 m m i=1 u i is the "within-imputation" variance, b m = 1 m m 1 i=1 (q i q m ) 2 is the "betweenimputation" variance, and the factor (1 + m 1 ) reflects the fact that only a finite number of completed-data estimates q i, are averaged together to obtain the final point estimate. The quantity ˆγ = (1 + m 1 )b m /T estimates the fraction of information about Q that is missing due to nonresponse. IAB-Discussion Paper 6/2010 6
7 Inferences from multiply imputed data are based on q m, T, and a Student s t reference distribution. Thus, for example, interval estimates for Q have the form q m ± t(1 α/2) T, where t(1 α/2) is the (1 α/2) quantile of the t distribution. Rubin/Schenker (1986) provide the approximate value ν RS = (m 1)ˆγ 2 for the degrees of freedom of the t distribution, under the assumption that with complete data, a normal reference distribution would have been appropriate. Barnard/Rubin (1999) relax the assumption of Rubin/Schenker (1986) to allow for a t reference distribution with complete data, and suggest the value ν BR = (ν 1 RS + ˆν 1 obs ) 1 for the degrees of freedom in the multiple-imputation analysis, where ˆν obs = (1 ˆγ)(ν com )(ν com + 1)/(ν com + 3) and ν com denotes the complete-data degrees of freedom. 3 Two Approaches to Generate Imputations for Missing Values Over the years, two different methods emerged to generate draws from P (Y mis Y obs ): joint modeling and fully conditional specification (FCS), often also referred to as sequential regression multivariate imputation (SRMI) or chained equations. The first assumes that the data follow a specific multivariate distribution, e.g. a multivariate normal distribution. Under this assumption a parametric multivariate density P (Y θ) can be specified with θ representing parameters from the assumed underlying distribution. Within the Bayesian framework, this distribution can be used to generate draws from (Y mis Y obs ). Methods to create multivariate imputations using this approach have been described in detail by Schafer (1997), e.g., for the multivariate normal, the log-linear, and the general location model. FCS on the other hand does not depend on an explicit assumption for the joint distribution of the dataset. Instead, conditional distributions P (Y j Y j, θ j ) are specified for each variable separately. Thus imputations are based on univariate distributions allowing for different models for each variable. Missing values in Y j can be imputed for example by a linear or a logistic regression of Y j on Y j, depending on the scales of measurement of Y j, where Y j denotes all columns of Y excluding Y j. The process of iteratively drawing from the conditional distributions can be viewed as a Gibbs sampler that will converge to draws from the theoretical joint distribution of the data, if this joint distribution exists. In general, imputing missing values by joint modeling is faster and the imputation algorithms are simpler to implement. Furthermore, if the underlying joint distribution can be specified correctly, joint modeling will guarantee valid results with the imputed dataset. However, empirical data will seldom follow a standard multivariate distribution, especially if they consist of a mix of numerical and categorical variables. Besides, FCS provides a flexible tool to account for bounds, interactions, skip patterns or constraints between different variables. None of these restrictions that are very common in survey data can be handled by joint modeling. In practice the imputation task is often centralized at the methodological department of the statistical agency and imputation experts will fill in missing values for all the surveys conducted by the agency. Imputed datasets that don t fulfill simple restrictions like non-negativity or other logical constraints will never be accepted by subject matter analysts from other departments. Thus, preserving these constraints is a central element of the imputation task. IAB-Discussion Paper 6/2010 7
8 Overall, joint modeling will be preferable, if only a limited number of variables need to be imputed, no restrictions have to be maintained, and the joint distribution can be approximated reasonably well with a standard multivariate distribution. For more complex imputation tasks only fully conditional specification will enable the imputer to preserve constraints inherent in the data. In this case, convergence of the Gibbs sampler should be carefully monitored. A simple way to monitor problems with the iterative imputation procedure is to store the mean of every imputed variable for every iteration of the Gibbs sampler. A plot of the imputed means over the iterations can indicate if there is only the expected random variation between the iterations or if there is a trend between the iterations indicating problems with the model. Of course no observable trend over the iterations does not guarantee convergence since the monitored estimates can stay stable for hundreds of iterations before drifting off to infinity. Nevertheless, this is a straightforward method to identify flawed imputation models. More complex methods to monitor convergence are discussed in Arnold/Castillo/Sarabia (1999) and Gelman et al. (2004). 4 Real Data Problems and Possible Ways to Handle Them The basic concept of multiple imputation is straightforward to apply and multiple imputation software like IVEware in SAS (Raghunathan/Solenberger/van Hoewyk, 2002), mice (Van Buuren/Oudshoorn, 2000) and mi (Su et al., 2009) in R, ice in Stata (Royston, 2005) (for FCS), and the stand alone packages NORM, CAT, MIX, and PAN (Schafer, 1997)(for joint modeling) further reduce the modeling burden for the imputer. However, simply applying standard imputation procedures to real data can lead to biased or inconsistent imputations. Several additional aspects have to be considered in practice, when imputing real data. Unfortunately most of the standard software with the positive exceptions of IVEware and the new mi package in R can only handle some of these aspects: 4.1 Imputation of Skewed Continuous Variables One problem that especially arises when modeling business data is that most of the continuous variables like turnover or number of employees are heavily skewed. To control for this skewness, we suggest to transform each continuous variable by taking the cubic root before the imputation. We prefer the cubic root transformation over the log transformation that is often used in the economic literature to model skewed variables like turnover, because the cubic root transformation is less sensitive to deviations between the imputed and the original values in the right tail of the distribution. Since the slope of the exponential function increases exponentially whereas the slope of f(x) = x 3 increases only quadratically, a small deviation in the right tail of the imputed transformed variable has more severe consequences after backtransformation for the log transformed variable than for the variable transformed by taking the cubic root. IAB-Discussion Paper 6/2010 8
9 4.2 Imputation of Semi-Continuous Variables Another problem with modeling continuous variables that often arises in surveys, is the fact that many of these variables in fact are semi-continuous, i.e. they have a spike at one point of the distribution, but the remaining distribution can be seen as a continuous variable. For most variables, this spike will occur at zero. To give an example, in our dataset the establishments are asked how many of their employees obtained a college degree. Most of the small establishments do not require such high skilled workers. In this case, we suggest to adopt the two step imputation approach proposed by Raghunathan et al. (2001): In the first step we impute whether the missing value is zero or not. For that, missing values are imputed using a logit model with outcome 1 for all units with a positive value for that variable. In the second step a standard linear model is applied only to the units with observed positive values to predict the actual value for the units with a predicted positive outcome in step one. All values for units with outcome zero in step one are set to zero. 4.3 Imputation Under Non-Negativity Constraints Many survey variables can never be negative in reality. This has to be considered during the imputation process. A simple way to achieve this goal is to redraw from the imputation model for those units with imputed values that are negative until all values fulfill the nonnegativity constraint. In practice, usually an upper bound z has to be defined for the number of redraws for one unit, since it is possible that the probability to draw a positive value for this unit from the defined model is very low. The value for this unit is set to zero, if z draws from the model never produced a positive value. However, there is a caveat with this approach. Redrawing from the model for negative values is equivalent to drawing from a truncated distribution. If the truncation point is not at the very far end of the distribution, i.e. the model is misspecified, even simple descriptive analyses like the mean of the imputed variable will significantly differ from the true value of the complete data. For this reason, this approach can only be applied, if the probability to draw negative values from the specified model is very low and we only want to prevent that some very unlikely unrealistic values are imputed. If the fraction of units that would have to be corrected with this approach is too high, the model needs to be revised. Usually it is helpful to define different models for different subgroups of the data. To overcome the problem of generating too many negative values, a separate model for the units with small values should be defined. 4.4 Imputation Under Linear Constraints In many surveys the outcome of one variable by definition has to be equal to or above the outcome of another variable. For example, the total number of employees always has to be at least as high as the number of part-time employees. When imputing missing values in this situation, Schenker et al. (2006) suggest the following approach: Variables that define a subgroup of another variable are always expressed as a proportion, i.e. all values for the subgroup variable are divided by the total before the imputation and thus are bounded IAB-Discussion Paper 6/2010 9
10 between zero and one. A logit transformation of the variables guarantees that the variables will have values in the full range ], [ again. Missing values for these transformed variables can be imputed with a standard imputation approach based on linear regressions. After the imputation all values are transformed back to get proportions again and finally all values are multiplied with the totals to get back the absolute values. To avoid problems on the bounds of the proportions, we suggest setting proportions greater than to before the logit transformation and to use the two step imputation approach described in Section 4.2 to determine zero values. 4.5 Skip Patterns Skip patterns, e.g. a battery of questions are only asked if they are applicable, are very common in surveys. Although it is obvious that they are necessary and can significantly reduce the response burden for the survey participant, they are a nightmare for anybody involved in data editing and imputation or statistical disclosure control. Especially, if the skip patterns are hierarchical, it is very difficult to guarantee that imputed values are consistent with these patterns. With fully conditional specification, it is straightforward to generate imputed datasets that are consistent with all these rules. The two step approach described in Section 4.2 can be applied to decide if the questions under consideration are applicable. Values are imputed only for the units selected in step one. Nevertheless, correctly implementing all filtering rules is a labor intensive task that can be more cumbersome than defining good imputation models. Furthermore, skip patterns can lead to variables that are answered by only a small fraction of the respondents and it can be difficult to develop good models based on a small number of observations. 5 Multiple Imputation of Missing Values in the IAB Establishment Panel In this section we describe results from the multiple imputation project at the German Institute for Employment Research: The imputation of missing values in the wave 2007 of the IAB Establishment Panel. 5.1 The Dataset The IAB Establishment Panel is based on the German employment register aggregated via the establishment number as of 30 June of each year. The basis of the register, the German Social Security Data (GSSD) is the integrated notification procedure for the health, pension and unemployment insurances, which was introduced in January This procedure requires employers to notify the social security agencies about all employees covered by social security. As by definition the German Social Security Data only include employees covered by social security - civil servants and unpaid family workers for example are not included - approx. 80% of the German workforce are represented. However, the degree of coverage varies considerably across the occupations and the industries. IAB-Discussion Paper 6/
11 Since the register only contains information on employees covered by social security, the panel includes establishments with at least one employee covered by social security. The sample is drawn using a stratified sampling design. The stratification cells are defined by ten classes for the size of the establishment, 16 classes for the region, and 17 classes for the industry. These cells are also used for weighting and extrapolation of the sample. The survey is conducted by interviewers from TNS Infratest Sozialforschung. For the first wave, 4,265 establishments were interviewed in West Germany in the third quarter of Since then the Establishment Panel has been conducted annually - since 1996 with over 4,700 establishments in East Germany in addition. In the wave 2007 more than 15,000 establishments participated in the survey. Each year, the panel is accompanied by supplementary samples and follow-up samples to include new or reviving establishments and to compensate for panel mortality. The list of questions contains detailed information about the firms personnel structure, development and personnel policy. For a detailed description of the dataset we refer to Fischer et al. (2008) or Kölling (2000). 5.2 The Imputation Task Most of the 284 variables included in the wave 2007 of the panel are subject to nonresponse. Only 26 variables are fully observed. However, missing rates vary considerably between variables and are modest for most variables. 65.8% of the variables have missing rates below 1%, 20.4% of the variables have missing rates between 1% and 2%, 15.1% rates between 2% and 5% and only 12 variables have missing rates above 5%. The five variables with missing rates above 10% are subsidies for investment and material expenses (13.6%), payroll (14.4%), intermediate inputs as proportion of turnover (17.4%), turnover in the last fiscal year (18.6%), and number of workers who left the establishment due to restructuring measures (37.5%). Obviously, the variables with the highest missing rates contain information that is either difficult to provide like number of workers who left the establishment due to restructuring measures or considered sensitive like turnover in the last fiscal year. The variable number of workers who left the establishment due to restructuring measures is only applicable to 626 establishments in the dataset, who declared they had restructuring measures in the last year. Of these 626 only 391 establishments provided information on the number of workers that left the establishment due to these measures. Clearly, it is often difficult to tailor exactly which workers left as a result of the measures and which left for other reasons. This might be the reason for the high missing rates. The low number of observed values is also problematic for the modeling task, so this variable should be used with caution in the imputed dataset. 5.3 Imputation Models Since the dataset contains a mixture of categorical variables and continuous variables with skewed distributions and a variety of often hierarchical skip patterns and logical constraints, it is impossible to apply the joint modeling approach. We apply the fully conditional specification approach, iteratively imputing one variable at a time, conditioning on the other variables available in the dataset. For the imputation we basically rely on three different imputation models. The linear model for the continuous variables, the logit model for IAB-Discussion Paper 6/
12 binary variables and the multinomial logit for categorical variables with more than two categories. Multiple imputation procedures for these models are described in Raghunathan et al. (2001). In general, all variables that don t contain any structural missings are used as predictors in the imputation models in hopes of reducing problems from uncongeniality (Meng, 1994). In the multinomial logit model for the categorical variables the number of explanatory variables is limited to 30 variables found by stepwise regression to speed up the imputation process. To improve the quality of the imputation we define several separate models for the variables with high missing rates like turnover or payroll. Independent models are fit for East and West Germany and for different establishment size classes. All continuous variables are subject to non-negativity constraints and the outcome of many variables is further restricted by linear constraints. To complicate the imputation process most variables have huge spikes at zero and as mentioned before the skip patterns are often hierarchical. We therefore have to rely on a mixture of the adjustments presented in Section 4. Since the package mi was not available at the beginning of this project and other standard packages could not deal with all these problems or did not allow detailed model specification, we use own coding in R for the imputation routines to generate m = 5 datasets. 5.4 Evaluating the Quality of the Imputations It is difficult to evaluate the quality of the imputations for missing values, since information about the missing values usually by definition is not available and the assumption that the response mechanism is ignorable (Rubin, 1987), necessary for obtaining valid imputations if the response mechanism is not modeled directly, can not be tested with the observed data. A response mechanism is considered ignorable, if, given that the sampling mechanism is ignorable, the response probability only depends on the observed information. The additional requirement that the sampling mechanism is also ignorable (Rubin, 1987), i.e. the sampling probability only depends on observed data, is usually fulfilled in scientific surveys. The stratified sampling design of the IAB Establishment Panel also satisfies this requirement. If these conditions are fulfilled, the missing data is said to be missing at random (MAR) and imputation models only need to be based on the observed information. As a special case, the missing data is said to be missing completely at random (MCAR), if the response mechanism does not depend on the data (observed or unobserved), which implies that the distribution of the observed data and the distribution of the missing data are identical. If the above requirements are not fulfilled, the missing data is said to be missing not at random (MNAR) and the response mechanism needs to be modeled explicitly. Little/Rubin (2002) provide examples for non-ignorable missing-data models. As noted before, it is not possible to check, if the missing data is MAR with the observed data. But even if the MAR assumption can not be tested, this does not mean, the imputer can not test the quality of his or her imputations at all. Abayomi/Gelman/Levy (2008) suggest several ways of evaluating model based imputation procedures. Basically their ideas can be divided in two categories: On the one hand, the imputed data can be checked for reasonability. Simple distributional and outlier checks can be evaluated by subject matter experts for each variable to avoid implausible imputed values like a turnover of $ 10 IAB-Discussion Paper 6/
13 million for a small establishment in the social sector. On the other hand, since imputations usually are model based, the fit of these models can and indeed should be tested. Abayomi/Gelman/Levy (2008) label the former as external diagnostic techniques, since the imputations are evaluated using outside knowledge and the latter internal diagnostic techniques, since they evaluate the modeling based on model fit without the need of external information. To automate the external diagnostics to some extent, Abayomi/Gelman/Levy (2008) suggest to use the Kolmogorov Smirnoff test to flag any imputations for which the distribution of the imputed values significantly differs from the distribution of the observed values. Of course a significant difference in the distributions does not necessarily indicate problems with the imputation. Indeed, if the missing data mechanism is MAR but not MCAR, we would expect the two distributions to differ. The test is only intended to decrease the number of variables that need to be checked manually, implicitly assuming that no significant difference between the original and the imputed data indicates no problem with the imputation model. However, we are skeptical about this automated selection method, since the test is sensitive to the sample size, so the chance of rejecting the null hypothesis will be lower for variables with lower missing rates and variables that are answered only by a subset of the respondents. Furthermore, it is unclear what significance level to choose and as noted above rejection of the null hypothesis does not necessarily indicate an imputation problem, but not rejecting the null hypothesis is not a guarantee that we found a good imputation model either. However, this is implicitly assumed by this procedure. For the continuous variables, we searched for possible flaws in the imputations by plotting the distributions for the original and imputed values for every variable. We checked, if any notable differences between these distributions can be justified by differences in the distributions of the covariates. Figure 1 displays the distributions for two representative variables based on kernel density estimation. Original values are represented with a solid line, imputed values with a dashed line. Both variables are reported on the log-scale. The left variable (payroll) represents a candidate that we did not investigate further, since the distributions almost match exactly. The right variable (number of participants in further education (NB.PFE) 1 ) is an example for a variable for which we tried to understand the difference between the distribution of the observed values and the distribution of the imputed values before accepting the imputation model. Obviously, most of the imputed values for the variable NB.PFE are larger than the observed values for this variable. To understand this difference, we examined the dependence between the missing rate and the establishment size. In Table 1 we present the percentage of missing units in 10 establishment size classes defined by quantiles and the mean of NB.PFE within these quantiles. The missing rates are low up to the sixth establishment 1 In the questionnaire the respondents can indicate whether the number they are reporting represents the number of cases or the number of participating persons. The graph presented here only includes those establishments that reported the number of participating persons. The distributions for establishments that reported cases did not differ significantly between observed and imputed values. IAB-Discussion Paper 6/
14 payroll NB.PFE Figure 1: Observed (solid line) and imputed (dashed line) data for payroll and number of participants in further education (NB.PFE). Both variables are reported on the log-scale. Table 1: Missing rates and means per quantile for NB.PRE. est. size quantile missing rate in % mean(nb.pfe) per quantile size class. Beyond that point the missing rates increase significantly with every class. The average number of further education participants increases steadily with every establishment size class with largest increases in the second half of the table. With these results in mind, it is not surprising that the imputed values for that variable are often larger than the observed values. We inspected several continuous variables by comparing the distributions of the observed and imputed values in our dataset and did not find any differences in the distributions that could not be explained by the missingness pattern. However, these comparisons are only meaningful, if enough observations are imputed. Otherwise the distributions between observed data and imputed data might look completely different, only because using kernel density estimation to produce a smooth distribution graph is not appropriate in this context. For this reason we restricted the density comparisons to variables with more then 200 imputed values above zero. For the remaining variables we plotted histograms to check for differences between the observed and imputed values and to detect univariate outliers in the imputed data. IAB-Discussion Paper 6/
15 We also investigated if any weighted 2 imputed value for any variable lay above the maximum weighted observed value for that variable. Again, this would not necessarily be problematic, but we did not want to produce any unrealistic influential outliers. However, we did not find any weighted imputed value that was higher than the maximum of its weighted observed counterpart. For the internal diagnostics, we used three graphics to evaluate the model fit: A Normal Q-Q plot, a plot of the residuals from the regression against the fitted values and a binned residual plot (Gelman/Hill, 2006). The Normal Q-Q plot indicates if the assumption of a normal distribution for the residuals is justified by plotting the theoretical quantiles of a normal distribution against the empirical quantiles of the residuals. The residual plot visualizes any unwanted dependencies between the fitted values and the residuals. For the binned residual plot the average of the fitted values is calculated within several predefined bins and these average fitted values are plotted against the average of the residuals within these bins. This is especially helpful for categorical variables since the output of a simple residual plot is difficult to interpret if the outcome is discrete. Figure 2 again provides an example of one model (one of the models for the variable turnover) that we did not inspect any further and one model (for the variable number of participants in further education with college degree (NB.PFE.COL)), for which we checked the model for necessary adjustments. For both variables the assumption that the residuals are more or less normally distributed seems to be justified. For the variable turnover, the two residual plots further confirm the quality of the model. Only a small amount of residuals fall outside of the grey dotted 95% confidence bands for the residual plot and non of the averaged residuals falls outside the grey 95% confidence bands for the binned residuals. This is different for NB.PFE.COL. Although still most of the points are inside the 95% confidence bands, we see a clear relationship between the fitted values and the residuals for the small values and the binned residuals for these small values all fall outside the confidence bands. However, this phenomenon can be explained if we inspect the variable further. Most establishments don t have any participants in further training with college degree and we fitted the model only to the 3,426 units that reported to have at least one participant. 648 of these units report that they had only 1 participant, leading to a spike at 1 in the original data. Since we simply fit a linear model to the observed data, the almost vertical line in the residual plot is not surprising. It contains all the residuals for all the units with only 1 participant in the original data. The binned residual plot indicates that the small fitted values sometimes severely underestimate the original values. The reason for this again is the fact that the original data is truncated at 1 whereas the fitted values are predictions from a standard linear model that would even allow negative fitted values, since we computed the fitted values before the adjustments for non-negativity described in Section 4.3. The consequence is a slight overestimation for the larger fitted values. We found similar patterns in some other variables that had huge spikes at 1. We could 2 Weights were available for all records, since the stratifying variables necessary for the generation of the weights were always fully observed. IAB-Discussion Paper 6/
16 Sample Quantiles Normal Q Q plot Theoretical Quantiles Residuals Turnover Residual plot Predicted Binned residual plot Sample Quantiles number of participants in further education with college degree Normal Q Q plot Theoretical Quantiles Residuals Residual plot Predicted Binned residual plot Figure 2: Model checks for turnover and number of participants in further education with college degree. have tried to model the data with a truncated distribution or we could have applied the semi-continuous approach described in Section 4.2 to model the spike at 1 separately, but since we expect that the non-negativity adjustments reduce this effect, we decided to avoid making the already complex modeling task even more difficult. Missing rates are substantially lower for the categorical variables. Only 59 out of the close to 200 categorical variables in the dataset have missing rates above 1% and we limited our evaluation to these variables. We compared the relative number of responses in each category for the observed and the imputed values and flagged a variable for closer inspection, if the relative number of responses in one imputed category differed more than 20% from the relative number in the observed category. We further limited our search to categories that contained at least 25 units, since small changes in categories with less units would lead to significant changes in the relative differences for these categories. All 15 variables that were flagged by this procedure had a missing rate below 5% and the differences between the imputed and original response rates could be explained by the missingness pattern for all of them. We select one variable here to illustrate the significant differences between observed and imputed values that can arise from a missingness pattern that is definitely not missing completely at random. The variable under consideration asks for the expectations about the investment in 2007 compared to Table 2 provides some summary statistics for this variable. We find a substantial difference for the second and the third category, if we simply compare the observed response rates (column 1) with the imputed response rates IAB-Discussion Paper 6/
17 Table 2: Expectations for the investments in 2007 compared to 2006 (response rates in % for each category). category obs. data imp. data obs. units with investment 2006 obs. units without investment 2006 will stay the same increase expected decrease expected don t know yet (column 2). But the missing rate is only 0.2% for this variable for units with investments in 2006 but soars to 10.5% for units without investments in Thus, the response rates across categories for the imputed values will be influenced by the expectations for those units that had no investments in 2006 (column 4) even though only 12.9% of the participants who planned investments for 2007 reported no investments in These response rates differ completely from the response rates for units that reported investments in 2006 (column 3). Thus the percentage of establishments that expect an increase in investments is significantly larger in the imputed data than it is in the original data. For categorical data the Normal Q-Q plot is not appropriate as an internal diagnostic tool and the residual plot is difficult to interpret if the outcome is discrete. Therefore, we only examined the binned residual plots for the 59 categorical variables with missing rates above 1%. All plots indicate a good model fit. We move all graphics to the Appendix for brevity. To check for possible problems with the iterative imputation procedure, we stored the mean for several continuous variables after every imputation round. We did not find any inherent trend for the imputed means for any of the variables. Of course, this is no guarantee for convergence. A possible strategy to measure the convergence of the algorithm that we did not implement in our imputation routines is discussed in Su et al. (2009). 6 Concluding Remarks More than 30 years after its initial proposal by Rubin (1978), multiple imputation has been widely accepted as the best practice to deal with item nonresponse in surveys. Nevertheless, the implementation of good imputation routines is a difficult and cumbersome task and an unsupervised imputation using standard imputation software can harm the results more than simple complete case analysis. In this paper we discussed several problems that often arise when imputing real datasets, severely complicating the imputation process. We summarized several ideas to handle these problems, and the detailed discussion of the imputation of missing values in the German IAB Establishment Panel illustrates that generating multiply imputed datasets with high data quality for complex surveys is not an impossible task. However, there are still several ways for improvements and open research questions. We found that defining several independent models for each variable can lead to significant improvements in the imputations. To reduce the runtime of the procedure and to keep the programming burden low, we applied this strategy only to a few variables with very IAB-Discussion Paper 6/
18 high missing rates. Using this strategy for all variables whenever possible would further improve the quality of the imputations. A definitive shortcoming in our implementation is the limitation to only 30 explanatory variables in the multinomial variables that was necessary due to multicollinearity problems and to speed up the imputation process. This might lead to uncongeniality problems (Meng, 1994) if the analyst uses explanatory variables in his or her model that were not included in the imputation model. A possible strategy to overcome these limitations would be to use CART models for the categorical variables. Reiter (2005) suggests this strategy in the context of multiple imputation for statistical disclosure control. Further research is needed to investigate under which circumstances this approach is also applicable for multiple imputation for nonresponse. Appendix Figures 3-9 present the binned residual plots for all 59 categorical variables with missing rates 1%. For variables with more than two categories, we present a graph for each category (the first category is always defined as the reference category in the multinomial imputation model). For readability, we use the internal labeling for the variables. A detailed description of all variables can be obtained from the author upon request. IAB-Discussion Paper 6/
19 o04 o07a o07b o07c o07e o07f o07g o07h o11(category2) o11(category3) o11(category4) o11(category5) o22(category2) Figure 3: Binned residual plots for all categorical variables with missing rates above 1%. IAB-Discussion Paper 6/
20 o22(category3) o22(category4) o22(category5) o29aa o29ab o29ac o29ad o29ae o29af o29ag o29aj o29ak o37b Figure 4: Binned residual plots for all categorical variables with missing rates above 1%. IAB-Discussion Paper 6/
21 o37c o47a o o57a(category2) o57a(category3) o57a(category4) o57b(category2) o57b(category3) o57b(category4) o57c(category2) o57c(category3) o57c(category4) o57d(category2) Figure 5: Binned residual plots for all categorical variables with missing rates above 1%. IAB-Discussion Paper 6/
22 o57d(category3) o57d(category4) o57e(category2) o57e(category3) o57e(category4) o57f(category2) o57f(category3) o57f(category4) o57g(category2) o57g(category3) o57g(category4) o57h(category2) o57h(category3) Figure 6: Binned residual plots for all categorical variables with missing rates above 1%. IAB-Discussion Paper 6/
23 o57h(category4) o o73a o73b o73c o73d o73e o o79a o79b o79c o79d o79e Figure 7: Binned residual plots for all categorical variables with missing rates above 1%. IAB-Discussion Paper 6/
24 o79f o79g o79h o84a(category2) o84a(category3) o84a(category4) o86b o88(category2) o88(category3) o88(category4) o o90(category2) o90(category3) Figure 8: Binned residual plots for all categorical variables with missing rates above 1%. IAB-Discussion Paper 6/
25 o90(category4) o92(category2) o92(category3) o92(category4) o92(category5) o92(category6) o95a o95b o95c o99(category2) o99(category3) o99(category4) o99(category5) Figure 9: Binned residual plots for all categorical variables with missing rates above 1%. IAB-Discussion Paper 6/
26 References Abayomi, K.; Gelman, A.; Levy, M. (2008): Diagnostics for multivariate imputations. In: Journal of the Royal Statistical Society, Series C, Vol. 57, p Arnold, B. C.; Castillo, E.; Sarabia, J. M. (1999): Conditional Specification of Statistical Models. Springer. Barnard, J.; Rubin, D. B. (1999): Small-sample degrees of freedom with multipleimputation. In: Biometrika, Vol. 86, p Fischer, G.; Janik, F.; Müller, D.; Schmucker, A. (2008): The IAB Establishment Panel - from Sample to Survey to Projection. Tech. Rep., FDZ-Methodenreport, No. 1 (2008). Gelman, A.; Carlin, J. B.; Stern, H. S.; Rubin, D. B. (2004): Bayesian Data Analysis: Second Edition. London: Chapman & Hall. Gelman, A.; Hill, J. (2006): Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. Kölling, A. (2000): The IAB-Establishment Panel. In: Journal of Applied Social Science Studies, Vol. 120, p Little, R. J. A.; Rubin, D. B. (2002): Statistical Analysis with Missing Data: Second Edition. New York: John Wiley & Sons. Meng, Xiao-Li (1994): Multiple-imputation Inferences With Uncongenial Sources of Input (Disc: P ). In: Statistical Science, Vol. 9, p Raghunathan, T. E.; Lepkowski, J. M.; van Hoewyk, J.; Solenberger, P. (2001): A multivariate technique for multiply imputing missing values using a series of regression models. In: Survey Methodology, Vol. 27, p Raghunathan, T. E.; Solenberger, P.; van Hoewyk, J. (2002): IVEware: Imputation and Variance Estimation Software. In:, p. Available at: Reiter, J. P. (2005): Using CART to Generate Partially Synthetic, Public Use Microdata. In: Journal of Official Statistics, Vol. 21, p Royston, P. (2005): Multiple imputation of missing values: Update of ice. In: The Stata Journal, Vol. 5, p Rubin, D. B. (2004): The Design of a General and Flexible System for Handling Nonresponse in Sample Surveys. In: The American Statistician, Vol. 58, p Rubin, D. B. (1987): Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons. Rubin, D. B. (1978): Multiple imputations in sample surveys. In: Proceedings of the Section on Survey Research Methods of the American Statistical Association, p IAB-Discussion Paper 6/
27 Rubin, D. B.; Schenker, N. (1986): Multiple Imputation for Interval Estimation From Simple Random Samples With Ignorable Nonresponse. In: Journal of the American Statistical Association, Vol. 81, p Schafer, J. L. (1997): Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Schenker, Nathaniel; Raghunathan, Trivellore E.; Chiu, Pei Lu; Makuc, Diane M.; Zhang, Guangyu; Cohen, Alan J. (2006): Multiple Imputation of Missing Income Data in the National Health Interview Survey. In: Journal of the American Statistical Association, Vol. 101, p Su, Y.; Gelman, A.; Hill, J.; Yajima, M. (2009): Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box. In: Journal of Statistical Software, p. (forthcoming). Van Buuren, S.; Oudshoorn, C.G.M. (2000): MICE V1.0 User s manual. Report PG/VGZ/ Tech. Rep., TNO Prevention and Health, Leiden. IAB-Discussion Paper 6/
28 Recently published No. Author(s) Title Date 16/2009 Schels, B. Job entry and the ways out of benefit receipt of 8/09 young adults in Germany 17/2009 Kopf, E. Short-term training variety for welfare recipients: 8/09 The effects of different training types 18/2009 Fuchs, M. The determinants of local employment dynamics 8/09 in Western Germany 19/2009 Kunz, M. Sources for regional unemployment disparities in 8/09 Germany: Lagged adjustment processes, exogenous shocks or both? 20/2009 Hohmeyer, K. Effectiveness of One-Euro-Jobs: Do programme 8/09 characteristics matter? 21/2009 Drasch, K. Matthes, B. Improving Retrospective life course data by combining modularized self-reports and event history calendars: Experiences from a large scale survey 9/09 22/2009 Litzel, N. Möller, J. 23/2009 Bauer, Th. Bender, S. Paloyo, A.R. Schmidt, Ch.M. Industrial clusters and economic integration: Theoretic concepts and an application to the European Metropolitan Region Nuremberg Evaluating the labor-market effects of compulsory military service 24/2009 Hohendanner, C. Arbeitsgelegenheiten mit Mehraufwandsentschädigung: Eine Analyse potenzieller Substitutionseffekte mit Daten des IAB-Betriebspanels 25/2009 Dlugosz St. Stephan, G. Wilke, R.A. 1/2010 Schmieder J.F. von Wachter, T. Bender, S. 2/2010 Heckmann, M. Noll, S. Rebien, M. 3/2010 Schmillen, A. Möller, J. 4/2010 Schmieder, J.F. von Wachter, T. Bender, S. Fixing the leak: Unemployment incidence before and after the 2006 reform of unemployment benefits in Germany The long-term impact of job displacement in Germany during the 1982 recession on earnings, income, and employment Stellenbesetzungen mit Hindernissen: Auf der Suche nach Bestimmungsfaktoren für den Suchverlauf Determinants of lifetime unemployment: A micro data analysis with censored quantile regressions The effects of unemployment insurance on labour supply and search outcomes: Regression discontinuity estimates from Germany 5/2010 Rebien, M. The use of social networks in recruiting processes from a firms perspective As per: /09 11/09 12/09 12/09 1/10 1/10 1/10 2/10 2/10 For a full list, consult the IAB website
29 Imprint IAB-Discussion Paper 6/2010 Editorial address Institute for Employment Research of the Federal Employment Agency Regensburger Str. 104 D Nuremberg Editorial staff Regina Stoll, Jutta Palm-Nowak Technical completion Jutta Sebald All rights reserved Reproduction and distribution in any form, also in parts, requires the permission of IAB Nuremberg Website Download of this Discussion Paper For further inquiries contact the author: Jörg Drechsler Phone
Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
Dealing with Missing Data
Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
Multiple Imputation for Missing Data: A Cautionary Tale
Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust
Problem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;
Comparison of Imputation Methods in the Survey of Income and Program Participation
Comparison of Imputation Methods in the Survey of Income and Program Participation Sarah McMillan U.S. Census Bureau, 4600 Silver Hill Rd, Washington, DC 20233 Any views expressed are those of the author
Generating Multiply Imputed Synthetic Datasets: Theory and Implementation
Generating Multiply Imputed Synthetic Datasets: Theory and Implementation Dissertation zur Erlangung des akademischen Grades eines Doktors der Sozial- und Wirtschaftswissenschaften (Dr. rer. pol.) an der
Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random
[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator
Handling attrition and non-response in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
Visualization of missing values using the R-package VIM
Institut f. Statistik u. Wahrscheinlichkeitstheorie 040 Wien, Wiedner Hauptstr. 8-0/07 AUSTRIA http://www.statistik.tuwien.ac.at Visualization of missing values using the R-package VIM M. Templ and P.
Remote data access and the risk of disclosure from linear regression
Statistics & Operations Research Transactions SORT Special issue: Privacy in statistical databases, 20, 7-24 ISSN: 696-228 eissn: 203-8830 www.idescat.cat/sort/ Statistics & Operations Research c Institut
Handling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
How To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center
Missing Data & How to Deal: An overview of missing data Melissa Humphries Population Research Center Goals Discuss ways to evaluate and understand missing data Discuss common missing data methods Know
A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA
123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake
IAB Discussion Paper 44/2008
IAB Discussion Paper 44/2008 Articles on scientific dialoge Multiple Imputation of Right-Censored Wages in the German IAB Employment Sample Considering Heteroscedasticity Thomas Büttner Susanne Rässler
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES
Examples: Monte Carlo Simulation Studies CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Monte Carlo simulation studies are often used for methodological investigations of the performance of statistical
IBM SPSS Missing Values 22
IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,
Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs
Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs Andrew Gelman Guido Imbens 2 Aug 2014 Abstract It is common in regression discontinuity analysis to control for high order
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Analyzing Structural Equation Models With Missing Data
Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University [email protected] based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
Missing-data imputation
CHAPTER 25 Missing-data imputation Missing data arise in almost all serious statistical analyses. In this chapter we discuss a variety of methods to handle missing data, including some relatively simple
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Imputing Missing Data using SAS
ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are
Missing data and net survival analysis Bernard Rachet
Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based,
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected]
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected] IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way
Organizing Your Approach to a Data Analysis
Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize
Missing Data. Katyn & Elena
Missing Data Katyn & Elena What to do with Missing Data Standard is complete case analysis/listwise dele;on ie. Delete cases with missing data so only complete cases are le> Two other popular op;ons: Mul;ple
Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
Normality Testing in Excel
Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected]
International Statistical Institute, 56th Session, 2007: Phil Everson
Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: [email protected] 1. Introduction
2013 MBA Jump Start Program. Statistics Module Part 3
2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just
APPLIED MISSING DATA ANALYSIS
APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview
Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University
Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple
CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS
Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships
Applications of R Software in Bayesian Data Analysis
Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx
Sensitivity Analysis in Multiple Imputation for Missing Data
Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes
Analysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
Missing data in randomized controlled trials (RCTs) can
EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled
5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
Data Cleaning and Missing Data Analysis
Data Cleaning and Missing Data Analysis Dan Merson [email protected] India McHale [email protected] April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
Statistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural
7 Time series analysis
7 Time series analysis In Chapters 16, 17, 33 36 in Zuur, Ieno and Smith (2007), various time series techniques are discussed. Applying these methods in Brodgar is straightforward, and most choices are
Getting Correct Results from PROC REG
Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
IBM SPSS Forecasting 22
IBM SPSS Forecasting 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release 0, modification
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Integrating Financial Statement Modeling and Sales Forecasting
Integrating Financial Statement Modeling and Sales Forecasting John T. Cuddington, Colorado School of Mines Irina Khindanova, University of Denver ABSTRACT This paper shows how to integrate financial statement
Introduction to time series analysis
Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
Workpackage 11 Imputation and Non-Response. Deliverable 11.2
Workpackage 11 Imputation and Non-Response Deliverable 11.2 2004 II List of contributors: Seppo Laaksonen, Statistics Finland; Ueli Oetliker, Swiss Federal Statistical Office; Susanne Rässler, University
2. Making example missing-value datasets: MCAR, MAR, and MNAR
Lecture 20 1. Types of missing values 2. Making example missing-value datasets: MCAR, MAR, and MNAR 3. Common methods for missing data 4. Compare results on example MCAR, MAR, MNAR data 1 Missing Data
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,
South Carolina College- and Career-Ready (SCCCR) Probability and Statistics
South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Visualization of Complex Survey Data: Regression Diagnostics
Visualization of Complex Survey Data: Regression Diagnostics Susan Hinkins 1, Edward Mulrow, Fritz Scheuren 3 1 NORC at the University of Chicago, 11 South 5th Ave, Bozeman MT 59715 NORC at the University
Teaching Multivariate Analysis to Business-Major Students
Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
Imputing Values to Missing Data
Imputing Values to Missing Data In federated data, between 30%-70% of the data points will have at least one missing attribute - data wastage if we ignore all records with a missing value Remaining data
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
Approaches for Analyzing Survey Data: a Discussion
Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
The frequency of visiting a doctor: is the decision to go independent of the frequency?
Discussion Paper: 2009/04 The frequency of visiting a doctor: is the decision to go independent of the frequency? Hans van Ophem www.feb.uva.nl/ke/uva-econometrics Amsterdam School of Economics Department
Credit scoring and the sample selection bias
Credit scoring and the sample selection bias Thomas Parnitzke Institute of Insurance Economics University of St. Gallen Kirchlistrasse 2, 9010 St. Gallen, Switzerland this version: May 31, 2005 Abstract
Example G Cost of construction of nuclear power plants
1 Example G Cost of construction of nuclear power plants Description of data Table G.1 gives data, reproduced by permission of the Rand Corporation, from a report (Mooz, 1978) on 32 light water reactor
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation
Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: [email protected]
Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]
Survival Analysis of Left Truncated Income Protection Insurance Data [March 29, 2012] 1 Qing Liu 2 David Pitt 3 Yan Wang 4 Xueyuan Wu Abstract One of the main characteristics of Income Protection Insurance
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Multivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
I. Introduction. II. Background. KEY WORDS: Time series forecasting, Structural Models, CPS
Predicting the National Unemployment Rate that the "Old" CPS Would Have Produced Richard Tiller and Michael Welch, Bureau of Labor Statistics Richard Tiller, Bureau of Labor Statistics, Room 4985, 2 Mass.
Elements of statistics (MATH0487-1)
Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -
Regression Modeling Strategies
Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions
SPSS Explore procedure
SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
IBM SPSS Missing Values 20
IBM SPSS Missing Values 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 87. This edition applies to IBM SPSS Statistics 20 and to all
