# Multiple Imputation for Missing Data: A Cautionary Tale

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust Walk, Philadelphia, PA , (voice), (fax).

2 Biographical Sketch Paul D. Allison is Professor of Sociology at the University of Pennsylvania. His recently published books include Multiple Regression: A Primer and Logistic Regression Using the SAS System: Theory and Applications. He is currently writing a book on missing data. Each summer he teaches five-day workshops on event history analysis and categorical data analysis. 2

3 Multiple Imputation for Missing Data: A Cautionary Tale Abstract: Two algorithms for producing multiple imputations for missing data are evaluated with simulated data. Software using a propensity score classifier with the approximate Bayesian boostrap produces badly biased estimates of regression coefficients when data on predictor variables are missing at random or missing completely at random. On the other hand, a regression-based method employing the data augmentation algorithm produces estimates with little or no bias. 3

4 Multiple imputation (MI) appears to be one of the most attractive methods for generalpurpose handling of missing data in multivariate analysis. The basic idea, first proposed by Rubin (1977) and elaborated in his (1987) book, is quite simple: 1. Impute missing values using an appropriate model that incorporates random variation. 2. Do this M times (usually 3-5 times), producing M complete data sets. 3. Perform the desired analysis on each data set using standard complete-data methods. 4. Average the values of the parameter estimates across the M samples to produce a single point estimate. 5. Calculate the standard errors by (a) averaging the squared standard errors of the M estimates (b) calculating the variance of the M parameter estimates across samples, and (c) combining the two quantities using a simple formula (given below). Additional discussions of multiple imputation in SMR can be found in Little and Rubin (1989) and Landerman, Land and Pieper (1997). Multiple imputation has several desirable features: Introducing appropriate random error into the imputation process makes it possible to get approximately unbiased estimates of all parameters. No deterministic imputation method can do this in general settings. Repeated imputation allows one to get good estimates of the standard errors. Single imputation methods don t allow for the additional error introduced by imputation (without specialized software of very limited generality). MI can be used with any kind of data and any kind of analysis without specialized software. 4

5 Of course, certain requirements must be met for MI to have these desirable properties. First, the data must be missing at random (MAR), meaning that the probability of missing data on a particular variable Y can depend on other observed variables, but not on Y itself (controlling for the other observed variables). Second, the model used to generate the imputed values must be correct in some sense. Third, the model used for the analysis must match up, in some sense, with the model used in the imputation. All these conditions have been rigorously described by Rubin (1987, 1996). The problem is that it s easy to violate these conditions in practice. There are often strong reasons to suspect that the data are not MAR. Unfortunately, not much can be done about this. While it s possible to formulate and estimate models for data that are not MAR, such models are complex, untestable, and require specialized software. Hence, any general-purpose method will necessarily invoke the MAR assumption. Even when the MAR condition is satisfied, producing random imputations that yield unbiased estimates of the desired parameters is not always easy or straightforward. In this paper I examine two approaches to multiple imputation that have been incorporated into widely available software. We will see that one of them (embodied in software currently retailing for \$895) does a terrible job at producing imputations for missing data on predictor variables in multiple regression analysis. The other (available free on the Web) does an excellent job. To compare these two algorithms, I generated 10,000 observations on three variables X, Y, and Z. Because the primary concern is with bias, the large sample size is designed to minimize sampling variations. X and Z were drawn from a bivariate standard normal distribution with a correlation of.50. Thus, in the observed data, X and Z each have means of about 0, standard deviations of about 1.0, and a correlation of about.5. Y was then generated from the equation 5

6 Y = X + Z + U where U was drawn from a standard normal distribution, independent of X and Z. mechanisms: I then caused about half of the Z values to be missing according to four different 1. Missing completely at random: Z missing with probability.5, independent of Y and X. 2. Missing at random, dependent on X: Z missing if X < Missing at random, dependent on Y: Z missing if Y < Nonignorable: Z missing if Z < 0. For each missing data mechanism, new draws were made from the underlying distribution of X, Z and Y. Column 1 of Table 1 shows results from OLS regression using the original data with no missing cases on Z Table 1 about here Column 2 of Table 1 shows the results of an OLS regression using listwise deletion (complete case analysis) on the data set with missing values. Listwise deletion shows little or no bias for all missing data mechanisms except where missingness on Z is dependent on Y. In that case, both regression coefficients are about 30% lower than the true values. Not surprisingly, the standard errors are always somewhat larger than those based on the full data set. Remarkably, listwise deletion does well even when missingness on Z depends on Z itself (the nonignorable case). More generally, it can be shown that listwise deletion produces unbiased regression estimates whenever the missing data mechanism depends only on the predictor variables, not on the response variable (Little and Rubin, forthcoming) Next we ll use multiple imputation as implemented by the Windows program SOLAS (Statistical Solutions, SOLAS (version 1.1) incorporates a number 6

7 of imputation methods, the most sophisticated of which is multiple imputation using what is described in promotional material as the latest industry-accepted techniques. In fact, SOLAS uses the algorithm proposed by Lavori, Dawson and Shera (1995). For our example, the algorithm amounts to this: 1. Do a logistic regression in which the dependent variable is whether or not Z is missing. Independent variables can be chosen by the analyst; I chose both Y and X, although the results are not much different if only one of the two is chosen. 2. Use the estimated logistic model to calculate a predicted probability of being missing. This is called the propensity score, following the work of Rosenbaum and Rubin (1983). 3. Sort the observations by propensity scores and group them into five groups (quintiles). 4. Within each quintile, there are r cases with Z observed and m cases with Z missing. From among the r fully observed cases, draw a simple random sample of r cases with replacement. For each missing case, randomly draw one value (with replacement) from the random sample of r cases and use the observed value of Z as the imputed value. This method is called the approximate Bayesian bootstrap (Rubin and Schenker 1986) Steps 1-4 produce a single imputed data set. Step 4 is repeated M times to produce M complete data sets. For this example, I chose M = 5. I then performed OLS multiple regressions of Y on X and Z in each of the five imputed data sets. The coefficients shown in column 3 of Table 1 are the means of the five estimates. The standard error is obtained from Rubin s (1987) formula. Let b k be the estimated regression coefficient in sample k of the M samples, and let s k be its estimated standard error. The mean of the b k s is b ; its estimated standard error is given by 7

8 1 M k s M M k k ( b k b ) 2. In words, this is the square root of the average of the sampling variances plus the variance of coefficient estimates (multiplied by a correction factor 1 + 1/M). Results for SOLAS are shown in column 3 of Table 1. All coefficient estimates are clearly biased and, in most cases, the biases are quite severe. The worst case is when missingness on Z depends on X, where both coefficients are about 50% off, either too high or too low. Next we turn to a quite different method for generating multiple imputations, described by Schafer (1997) and embodied in NORM, his freeware Windows program ( To understand the method consider the following simplified algorithm for random regression imputation: 1. Regress Z on X and Y using all cases with data present on Z. Based on this regression, let Ẑ represent the predicted values for all cases including those with data missing, and let σˆ be the root mean squared error. 2. For cases with data missing on Z, compute Z* = Zˆ + σˆ U where U represents a random draw from a standard normal distribution. For cases with no missing data, Z* = Z. 3. Regress Y on X and Z* for all cases. Steps 2 and 3 may then be repeated M times. While this algorithm does a pretty good job, it is not proper in Rubin s (1987) sense because it does not adjust for the fact that σˆ and the coefficients used to produce Ẑ are only estimates, not the true values. This is also true of the multiple imputation methods available in the recently released missing data module for SPSS or the predicting mean matching method described by Landerman et al. (1997). NORM goes the 8

9 extra mile by using a Bayesian method known as data augmentation to iterate between random imputations under a specified set of parameter values and random draws from the posterior distribution of the parameters (given the observed and imputed data). The algorithm is based on the assumptions that the data come from a multivariate normal distribution and are missing at random. A similar algorithm that uses a different method for generating random draws of the parameters is described by King, Honaker, Joseph and Sheve (1999) and incorporated into their freeware program AMELIA ( Rather than using NORM, I have coded Schafer s algorithms into two macros for the SAS system, MISS and COMBINE ( The MISS macro was used to produce five complete data sets. After performing regression runs in each data set, the COMBINE macro produced the final estimates shown in column 5 of Table 1. All the estimates are close to the true values except, as expected, for the nonignorable case in which both coefficient show some upward bias. Somewhat surprisingly, the estimated standard errors for multiple imputation in column 5 are only slightly smaller than those produced by listwise deletion in column 2 which discarded half the cases. To further investigate the relative performance of listwise deletion and multiple imputation (using the data augmentation algorithm), I simulated 500 random samples, each of size 500, from the model described earlier. In each sample, I made values of Z missing by mechanism 2 missing at random whenever X < 0, a condition under which both listwise deletion and multiple imputation are at least approximately unbiased. Both methods were essentially unbiased across the repeated samples. However, the sampling variance of the multiple imputation estimates was considerably smaller: 39% less for the coefficient of X and 28% less for the coefficient of Z. Within-sample estimation of the standard errors for multiple 9

10 imputation were also good. The standard deviations across the 500 samples for the two coefficients were.0853 for X and.0600 for Z. By comparison, the average of the within-sample standard error estimates were.0835 for X and.0664 for Z. Nominal 95 percent confidence intervals for the multiple imputation estimates were computed by taking the point estimate plus or minus 1.96 times the estimated standard error. For the X coefficient, 96.4 percent of the confidence intervals included the true value. For Z, the coverage rate was 96.6 percent DISCUSSION We have seen that multiple imputation using the data augmentation algorithm produces reasonably good results, but those produced by SOLAS are severely biased, even when the data are missing completely at random. Why? There s nothing intrinsically wrong with the approximate Bayesian bootstrap algorithm that SOLAS uses to generate the missing values. The problem stems from the fact that when forming groups of similar cases, SOLAS only uses information from covariates that are associated with whether or not the data are missing. It does not use information from the associations among the variables themselves. For example, under the MCAR condition, missingness on Z is not related to either X or Y. Consequently, the constructed quintiles of similar cases are essentially random groupings of the observations. One might as well have imputed Z with random draws from all the observed values of Z. By contrast, the data augmentation method does use information on both X and Y to generate the imputed values. In short, SOLAS violates Rubin s principle that one must use a correct model in producing the imputed values. The failure of SOLAS under the missing at random conditions points out another crucial requirement for correct imputation. Under condition 2, when missingness on Z is highly associated with X, the imputed values of Z generated by SOLAS do make use of the correlation 10

11 between X and Z, but Y contributes nothing to the imputations. To get unbiased estimates in a regression analysis, it is essential to use the dependent variable to impute values for missing data on the predictor variables (Schafer 1997). Failure to include the dependent variable implies that the imputed values for the predictors will not be associated with the dependent variable, net of other variables in the imputation model. Consequently, regression coefficients for predictors with large fractions of imputed data will show substantial biases toward zero (Landerman et al. 1997). While it might seem that using the dependent variable to impute predictors would produce inflated coefficient estimates, this is avoided by the addition of the random component to the imputed value. Are there any conditions under which SOLAS lives up to its claims? Its imputation algorithm, as originally as originally described by Lavori et al (1995), was designed for a randomized experiment with repeated measures on the response variable. However, some subjects dropped out of the study before all response measurements could be made. The goal was to impute the missing responses based on previous response measurements, as well as baseline covariates. While I have not done any simulations for this kind of design, the results reported by Lavori et al. indicate that their method performs well in this situation. What is clear is that the method is not helpful in imputing predictor variables. Indeed, listwise deletion is clearly better in all the conditions studied here. Executives at Statistical Solutions Limited, the originator of SOLAS, have been made aware of this problem and have stated that a new version with improved algorithms is currently in preparation (personal communication). Nevertheless, the current version (1.1) of the software is still heavily marketed, with no warning about the dangers of imputing missing predictors. 11

12 REFERENCES Arbuckle, James L. (1995) Amos User s Guide. Chicago: Smallwaters. King, Gary, James Honaker, Anne Joseph and Kenneth Scheve (1999) Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation. Unpublished but available on the Web at Landerman, Lawrence R., Kenneth C. Land and Carl F. Pieper (1997) An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values. Sociological Methods & Research 26: Lavori, P., R. Dawson and D. Shera (1995) A Multiple Imputation Strategy for Clinical Trials with Truncation of Patient Data. Statistics in Medicine 14: Little, Roderick J. A. and Donald B. Rubin (1989) The Analysis of Social Science Data with Missing Values. Sociological Methods and Research 18: Little, Roderick J. A. and Donald B. Rubin (forthcoming) Statistical Analysis with Missing Data, 2 nd edition. New York: Wiley. Rosenbaum, Paul R. and Donald B. Rubin (1983) The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 70: Rubin, Donald B. (1976) Inference and Missing Data. Biometrika 63: Rubin, Donald B. (1977) Formalizing Subjective Notion about the Effect of Nonrespondents in Sample Surveys. Journal of the American Statistical Association 72: Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons. Rubin, Donald B. (1996) Multiple Imputation after 18+ Years (with discussion. Journal of the American Statistical Association 91:

13 Rubin, Donald B. and Nathaniel Schenker (1986) Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse. Journal of the American Statistical Association 81: Schafer, Joseph L. (1997) Analysis of Incomplete Multivariate Data. London: Chapman and Hall. 13

14 Table 1. Estimates for Coefficients in Regression of Y on X and Z Missing Data Mechanism MCAR 1 OLS on original data 2 Listwise Deletion 3 MI with SOLAS 5 MI with data augmentation X.979 (.011).969 (.016) (.016).976 (.012) Z (.012) (.017).667 (.020) (.016) MAR X (.012).986 (.025) (.013) (.025) (dep. on X) Z (.012) (.017).448 (.015).997 (.016) MAR X.993 (.012).695 (.015) (.013).985 (.021) (dep. on Y) Z (.012).708 (.015).746 (.023).997 (.013) Non-Ignorable X (.012).995 (.016) (.013) (.015) (dep. on Z) Z (.012) (.024) (.027) (.020) 14

### Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

### Dealing with Missing Data

Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

### Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

### Missing Data Dr Eleni Matechou

1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

### A Basic Introduction to Missing Data

John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### Problem of Missing Data

VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,

### A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake

### Data Cleaning and Missing Data Analysis

Data Cleaning and Missing Data Analysis Dan Merson vagabond@psu.edu India McHale imm120@psu.edu April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS

### HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS

HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS Mike Kenward London School of Hygiene and Tropical Medicine Acknowledgements to James Carpenter (LSHTM) Geert Molenberghs (Universities of

### Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

### Discussion. Seppo Laaksonen 1. 1. Introduction

Journal of Official Statistics, Vol. 23, No. 4, 2007, pp. 467 475 Discussion Seppo Laaksonen 1 1. Introduction Bjørnstad s article is a welcome contribution to the discussion on multiple imputation (MI)

### CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA Hatice UENAL Institute of Epidemiology and Medical Biometry, Ulm University, Germany

### Analyzing Structural Equation Models With Missing Data

Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.

ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10

### Dealing with Missing Data

Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January

### Missing Data. Paul D. Allison INTRODUCTION

4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all)

### Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

### A Review of Methods for Missing Data

Educational Research and Evaluation 1380-3611/01/0704-353\$16.00 2001, Vol. 7, No. 4, pp. 353±383 # Swets & Zeitlinger A Review of Methods for Missing Data Therese D. Pigott Loyola University Chicago, Wilmette,

### Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

### Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple

### Bayesian Multiple Imputation of Zero Inflated Count Data

Bayesian Multiple Imputation of Zero Inflated Count Data Chin-Fang Weng chin.fang.weng@census.gov U.S. Census Bureau, 4600 Silver Hill Road, Washington, D.C. 20233-1912 Abstract In government survey applications,

### Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Missing Data & How to Deal: An overview of missing data Melissa Humphries Population Research Center Goals Discuss ways to evaluate and understand missing data Discuss common missing data methods Know

### Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten Missing Data Treatments

Brockmeier, Kromrey, & Hogarty Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten s Lantry L. Brockmeier Jeffrey D. Kromrey Kristine Y. Hogarty Florida A & M University

### Imputing Missing Data using SAS

ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

### Modern Methods for Missing Data

Modern Methods for Missing Data Paul D. Allison, Ph.D. Statistical Horizons LLC www.statisticalhorizons.com 1 Introduction Missing data problems are nearly universal in statistical practice. Last 25 years

### Longitudinal Studies, The Institute of Education, University of London. Square, London, EC1 OHB, U.K. Email: R.D.Wiggins@city.ac.

A comparative evaluation of currently available software remedies to handle missing data in the context of longitudinal design and analysis. Wiggins, R.D 1., Ely, M 2. & Lynch, K. 3 1 Department of Sociology,

### Sensitivity Analysis in Multiple Imputation for Missing Data

Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes

### , then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (

Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we

### Item Imputation Without Specifying Scale Structure

Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete

### Missing Data Techniques for Structural Equation Modeling

Journal of Abnormal Psychology Copyright 2003 by the American Psychological Association, Inc. 2003, Vol. 112, No. 4, 545 557 0021-843X/03/\$12.00 DOI: 10.1037/0021-843X.112.4.545 Missing Data Techniques

### Best Practices for Missing Data Management in Counseling Psychology

Journal of Counseling Psychology 2010 American Psychological Association 2010, Vol. 57, No. 1, 1 10 0022-0167/10/\$12.00 DOI: 10.1037/a0018082 Best Practices for Missing Data Management in Counseling Psychology

### Combining Multiple Imputation and Inverse Probability Weighting

Combining Multiple Imputation and Inverse Probability Weighting Shaun Seaman 1, Ian White 1, Andrew Copas 2,3, Leah Li 4 1 MRC Biostatistics Unit, Cambridge 2 MRC Clinical Trials Unit, London 3 UCL Research

### Empirical Methods in Applied Economics

Empirical Methods in Applied Economics Jörn-Ste en Pischke LSE October 2005 1 Observational Studies and Regression 1.1 Conditional Randomization Again When we discussed experiments, we discussed already

### Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

### PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

PATTERN MIXTURE MODELS FOR MISSING DATA Mike Kenward London School of Hygiene and Tropical Medicine Talk at the University of Turku, April 10th 2012 1 / 90 CONTENTS 1 Examples 2 Modelling Incomplete Data

### Association Between Variables

Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

### APPLIED MISSING DATA ANALYSIS

APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview

### A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

Methods Report A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Hrishikesh Chakraborty and Hong Gu March 9 RTI Press About the Author Hrishikesh Chakraborty,

### PARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA

PARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA ABSTRACT The decision of whether to use PLS instead of a covariance

### IBM SPSS Missing Values 22

IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,

### Workpackage 11 Imputation and Non-Response. Deliverable 11.2

Workpackage 11 Imputation and Non-Response Deliverable 11.2 2004 II List of contributors: Seppo Laaksonen, Statistics Finland; Ueli Oetliker, Swiss Federal Statistical Office; Susanne Rässler, University

### Title: Categorical Data Imputation Using Non-Parametric or Semi-Parametric Imputation Methods

Masters by Coursework and Research Report Mathematical Statistics School of Statistics and Actuarial Science Title: Categorical Data Imputation Using Non-Parametric or Semi-Parametric Imputation Methods

### HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

### When to Use Which Statistical Test

When to Use Which Statistical Test Rachel Lovell, Ph.D., Senior Research Associate Begun Center for Violence Prevention Research and Education Jack, Joseph, and Morton Mandel School of Applied Social Sciences

### Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

### Manual of Fisheries Survey Methods II: with periodic updates. Chapter 6: Sample Size for Biological Studies

January 000 Manual of Fisheries Survey Methods II: with periodic updates : Sample Size for Biological Studies Roger N. Lockwood and Daniel B. Hayes Suggested citation: Lockwood, Roger N. and D. B. Hayes.

### Dealing with missing data: Key assumptions and methods for applied analysis

Technical Report No. 4 May 6, 2013 Dealing with missing data: Key assumptions and methods for applied analysis Marina Soley-Bori msoley@bu.edu This paper was published in fulfillment of the requirements

### Missing Data: Our View of the State of the Art

Psychological Methods Copyright 2002 by the American Psychological Association, Inc. 2002, Vol. 7, No. 2, 147 177 1082-989X/02/\$5.00 DOI: 10.1037//1082-989X.7.2.147 Missing Data: Our View of the State

### 6/15/2005 7:54 PM. Affirmative Action s Affirmative Actions: A Reply to Sander

Reply Affirmative Action s Affirmative Actions: A Reply to Sander Daniel E. Ho I am grateful to Professor Sander for his interest in my work and his willingness to pursue a valid answer to the critical

### An Alternative Route to Performance Hypothesis Testing

EDHEC-Risk Institute 393-400 promenade des Anglais 06202 Nice Cedex 3 Tel.: +33 (0)4 93 18 32 53 E-mail: research@edhec-risk.com Web: www.edhec-risk.com An Alternative Route to Performance Hypothesis Testing

### SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

### Imputation and Analysis. Peter Fayers

Missing Data in Palliative Care Research Imputation and Analysis Peter Fayers Department of Public Health University of Aberdeen NTNU Det medisinske fakultet Missing data Missing data is a major problem

### Missing data in randomized controlled trials (RCTs) can

EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled

### Applied Missing Data Analysis in the Health Sciences. Statistics in Practice

Brochure More information from http://www.researchandmarkets.com/reports/2741464/ Applied Missing Data Analysis in the Health Sciences. Statistics in Practice Description: A modern and practical guide

### How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power

How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power Linda K. Muthén Muthén & Muthén 11965 Venice Blvd., Suite 407 Los Angeles, CA 90066 Telephone: (310) 391-9971 Fax: (310) 391-8971

### The General Linear Model: Theory

Gregory Carey, 1998 General Linear Model - 1 The General Linear Model: Theory 1.0 Introduction In the discussion of multiple regression, we used the following equation to express the linear model for a

### Health 2011 Survey: An overview of the design, missing data and statistical analyses examples

Health 2011 Survey: An overview of the design, missing data and statistical analyses examples Tommi Härkänen Department of Health, Functional Capacity and Welfare The National Institute for Health and

### Missing data: the hidden problem

white paper Missing data: the hidden problem Draw more valid conclusions with SPSS Missing Data Analysis white paper Missing data: the hidden problem 2 Just about everyone doing analysis has some missing

### What s Missing? Handling Missing Data with SAS

What s Missing? Handling Missing Data with SAS Timothy B. Gravelle Principal Scientist & Director, Insights Lab September 13, 2013 2000-2013 PriceMetrix Inc. Patents granted and pending. Missing data Survey

### SAMPLE SIZE ESTIMATION USING KREJCIE AND MORGAN AND COHEN STATISTICAL POWER ANALYSIS: A COMPARISON. Chua Lee Chuan Jabatan Penyelidikan ABSTRACT

SAMPLE SIZE ESTIMATION USING KREJCIE AND MORGAN AND COHEN STATISTICAL POWER ANALYSIS: A COMPARISON Chua Lee Chuan Jabatan Penyelidikan ABSTRACT In most situations, researchers do not have access to an

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Many economic models involve endogeneity: that is, a theoretical relationship does not fit

### MS&E 226: Small Data. Lecture 17: Additional topics in inference (v1) Ramesh Johari

MS&E 226: Small Data Lecture 17: Additional topics in inference (v1) Ramesh Johari ramesh.johari@stanford.edu 1 / 34 Warnings 2 / 34 Modeling assumptions: Regression Remember that most of the inference

### Mode and Patient-mix Adjustment of the CAHPS Hospital Survey (HCAHPS)

Mode and Patient-mix Adjustment of the CAHPS Hospital Survey (HCAHPS) April 30, 2008 Abstract A randomized Mode Experiment of 27,229 discharges from 45 hospitals was used to develop adjustments for the

### A Comparison of Three Brand Evaluation Procedures

A Comparison of Three Brand Evaluation Procedures YORAM WIND, JOSEPH DENNY, AND ARTHUR CUNNINGHAM im his 1977 AAPOR presidential address, Irving Crespi (1977) concluded his observations on the current

### MULTIPLE REGRESSION WITH CATEGORICAL DATA

DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 86 MULTIPLE REGRESSION WITH CATEGORICAL DATA I. AGENDA: A. Multiple regression with categorical variables. Coding schemes. Interpreting

### Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

### Multicollinearity in Regression Models

00 Jeeshim and KUCC65. (003-05-09) Multicollinearity.doc Introduction Multicollinearity in Regression Models Multicollinearity is a high degree of correlation (linear dependency) among several independent

### Checklists and Examples for Registering Statistical Analyses

Checklists and Examples for Registering Statistical Analyses For well-designed confirmatory research, all analysis decisions that could affect the confirmatory results should be planned and registered

### Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: james.carpenter@lshtm.ac.uk

### Pooling and Meta-analysis. Tony O Hagan

Pooling and Meta-analysis Tony O Hagan Pooling Synthesising prior information from several experts 2 Multiple experts The case of multiple experts is important When elicitation is used to provide expert

### Chapter 7 Factor Analysis SPSS

Chapter 7 Factor Analysis SPSS Factor analysis attempts to identify underlying variables, or factors, that explain the pattern of correlations within a set of observed variables. Factor analysis is often

### NCEE 2009-0049. What to Do When Data Are Missing in Group Randomized Controlled Trials

NCEE 2009-0049 What to Do When Data Are Missing in Group Randomized Controlled Trials What to Do When Data Are Missing in Group Randomized Controlled Trials October 2009 Michael J. Puma Chesapeake Research

### CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

### Multiple Imputation of Missing Income Data in the National Health Interview Survey

Multiple Imputation of Missing Income Data in the National Health Interview Survey Nathaniel SCHENKER, Trivellore E. RAGHUNATHAN, Pei-LuCHIU, Diane M. MAKUC, Guangyu ZHANG, and Alan J. COHEN The National

### Advances in Missing Data Methods and Implications for Educational Research. Chao-Ying Joanne Peng, Indiana University-Bloomington

Advances in Missing Data 1 Running head: MISSING DATA METHODS Advances in Missing Data Methods and Implications for Educational Research Chao-Ying Joanne Peng, Indiana University-Bloomington Michael Harwell,

### Simple Regression Theory II 2010 Samuel L. Baker

SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

### A Review of Methods. for Dealing with Missing Data. Angela L. Cool. Texas A&M University 77843-4225

Missing Data 1 Running head: DEALING WITH MISSING DATA A Review of Methods for Dealing with Missing Data Angela L. Cool Texas A&M University 77843-4225 Paper presented at the annual meeting of the Southwest

### Simple Linear Regression Chapter 11

Simple Linear Regression Chapter 11 Rationale Frequently decision-making situations require modeling of relationships among business variables. For instance, the amount of sale of a product may be related

### Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

### The Basics of Regression Analysis. for TIPPS. Lehana Thabane. What does correlation measure? Correlation is a measure of strength, not causation!

The Purpose of Regression Modeling The Basics of Regression Analysis for TIPPS Lehana Thabane To verify the association or relationship between a single variable and one or more explanatory One explanatory

### Calculating Interval Forecasts

Calculating Chapter 7 (Chatfield) Monika Turyna & Thomas Hrdina Department of Economics, University of Vienna Summer Term 2009 Terminology An interval forecast consists of an upper and a lower limit between

### 2. Making example missing-value datasets: MCAR, MAR, and MNAR

Lecture 20 1. Types of missing values 2. Making example missing-value datasets: MCAR, MAR, and MNAR 3. Common methods for missing data 4. Compare results on example MCAR, MAR, MNAR data 1 Missing Data

### arxiv:1301.2490v1 [stat.ap] 11 Jan 2013

The Annals of Applied Statistics 2012, Vol. 6, No. 4, 1814 1837 DOI: 10.1214/12-AOAS555 c Institute of Mathematical Statistics, 2012 arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 ADDRESSING MISSING DATA MECHANISM

### Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out Sandra Taylor, Ph.D. IDDRC BBRD Core 23 April 2014 Objectives Baseline Adjustment Introduce approaches Guidance

### Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

### How to choose an analysis to handle missing data in longitudinal observational studies

How to choose an analysis to handle missing data in longitudinal observational studies ICH, 25 th February 2015 Ian White MRC Biostatistics Unit, Cambridge, UK Plan Why are missing data a problem? Methods:

### Software Cost Estimation with Incomplete Data

890 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 27, NO. 10, OCTOBER 2001 Software Cost Estimation with Incomplete Data Kevin Strike, Khaled El Emam, and Nazim Madhavji AbstractÐThe construction of

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

### Lecture 18 Linear Regression

Lecture 18 Statistics Unit Andrew Nunekpeku / Charles Jackson Fall 2011 Outline 1 1 Situation - used to model quantitative dependent variable using linear function of quantitative predictor(s). Situation

### Statistics & Regression: Easier than SAS

ST003 Statistics & Regression: Easier than SAS Vincent Maffei, Anthem Blue Cross and Blue Shield, North Haven, CT Michael Davis, Bassett Consulting Services, Inc., North Haven, CT ABSTRACT In this paper

### Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME:

Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME: Instructions: This exam is 30% of your course grade. The maximum number of points for the course is 1,000; hence this exam is worth 300

### Published online: 25 Sep 2014.

This article was downloaded by: [Cornell University Library] On: 23 February 2015, At: 11:23 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: