Analyzing Structural Equation Models With Missing Data

Similar documents

A Basic Introduction to Missing Data

Craig K. Enders Arizona State University Department of Psychology

An introduction to modern missing data analyses

APPLIED MISSING DATA ANALYSIS

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Problem of Missing Data

ZHIYONG ZHANG AND LIJUAN WANG

Handling missing data in Stata a whirlwind tour

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Handling attrition and non-response in longitudinal data

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Multiple Imputation for Missing Data: A Cautionary Tale

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten Missing Data Treatments

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

TABLE OF CONTENTS ALLISON 1 1. INTRODUCTION... 3

Missing Data Techniques for Structural Equation Modeling

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Reject Inference in Credit Scoring. Jie-Men Mok

2. Making example missing-value datasets: MCAR, MAR, and MNAR

A Review of Methods. for Dealing with Missing Data. Angela L. Cool. Texas A&M University

Imputing Missing Data using SAS

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Best Practices for Missing Data Management in Counseling Psychology

Dealing with Missing Data

Applications of Structural Equation Modeling in Social Sciences Research

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Dealing with missing data: Key assumptions and methods for applied analysis

Comparison of Estimation Methods for Complex Survey Data Analysis

Analysis of Longitudinal Data with Missing Values.

Data Cleaning and Missing Data Analysis

Factor Analysis. Factor Analysis

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

Item Imputation Without Specifying Scale Structure

Missing Data. Paul D. Allison INTRODUCTION

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

ANALYSIS WITH MISSING DATA IN PREVENTION RESEARCH

On Treatment of the Multivariate Missing Data

Missing data and net survival analysis Bernard Rachet

Copyright 2010 The Guilford Press. Series Editor s Note

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

The Latent Variable Growth Model In Practice. Individual Development Over Time

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Introduction to mixed model and missing data issues in longitudinal studies

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Missing Data: Patterns, Mechanisms & Prevention. Edith de Leeuw

Indices of Model Fit STRUCTURAL EQUATION MODELING 2013

Factor analysis. Angela Montanari

Dealing with Missing Data

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

A General Approach to Variance Estimation under Imputation for Missing Survey Data

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

Simple Regression Theory II 2010 Samuel L. Baker

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Using Medical Research Data to Motivate Methodology Development among Undergraduates in SIBS Pittsburgh

A Review of Methods for Missing Data

Presentation Outline. Structural Equation Modeling (SEM) for Dummies. What Is Structural Equation Modeling?

Sensitivity Analysis in Multiple Imputation for Missing Data

Multivariate Normal Distribution

Stephen du Toit Mathilda du Toit Gerhard Mels Yan Cheng. LISREL for Windows: PRELIS User s Guide

Regression Modeling Strategies

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Missing Data Sensitivity Analysis of a Continuous Endpoint An Example from a Recent Submission

Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine.

PARTIAL LEAST SQUARES IS TO LISREL AS PRINCIPAL COMPONENTS ANALYSIS IS TO COMMON FACTOR ANALYSIS. Wynne W. Chin University of Calgary, CANADA

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Imputation of missing data under missing not at random assumption & sensitivity analysis

Missing Data. Katyn & Elena

Advances in Missing Data Methods and Implications for Educational Research. Chao-Ying Joanne Peng, Indiana University-Bloomington

M. Ehren N. Shackleton. Institute of Education, University of London. June Grant number: LLP-NL-KA1-KA1SCR

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Least Squares Estimation

lavaan: an R package for structural equation modeling

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey

4. Continuous Random Variables, the Pareto and Normal Distributions

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Lecture 3: Linear methods for classification

Missing Data Part 1: Overview, Traditional Methods Page 1

IBM SPSS Missing Values 22

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Werner Wothke, SmallWaters Corp. Longitudinal and multi-group modeling with missing data

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Package EstCRM. July 13, 2015

Missing Data: Our View of the State of the Art

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

SPSS Guide: Regression Analysis

Note on the EM Algorithm in Linear Regression Model

Applied Missing Data Analysis

Multivariate Analysis (Slides 13)

Visualization of missing values using the R-package VIM

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

How to choose an analysis to handle missing data in longitudinal observational studies

Introduction to Path Analysis

Transcription:

Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G. R. Hancock & R. O. Mueller (Eds.), Structural Equation Modeling: A Second Course (pp. -4). Greenwich, CT: Information Age Publishing. * AERA Extended Course San Francisco April 6 & 7, 006 Structural Equation Modeling: A Second Course Overview Missing data theory, and assumptions pertaining to missing data analyses Full information maximum likelihood (FIML), and multiple imputation (MI) Incorporating information from auxiliary variables

Running Example Concepts illustrated using single factor CFA model with seven manifest indicators The indicators are a subset of items from the Eating Attitudes Test (EAT), a widely used inventory for assessing eating disorder risk The data also include body mass index (BMI) scores (BMI = weight(kg)/[height(m)] ) The data are available upon request from Dr. Enders Running Example EAT4: Like stomach to be empty EAT4: Am preoccupied with... fat on body EAT: Think about burning calories... DRIVE FOR THINNESS EAT: Desire to be thinner EAT0: Feel extremely guilty after eating EAT: Avoid eating when I am hungry EAT: Am terrified about being overweight

Missing Data Mechanisms Overview Rubin (976) provided a taxonomy for missing data problems Missing completely at random (MCAR) Missing at random (MAR) Missing not at random (MNAR) Missing data mechanisms describe how the missing values are related to the data, if at all Rubin s mechanisms can be viewed as assumptions that dictate the performance of a particular missing data technique Missing Data Mechanisms Missing Completely At Random (MCAR) Missing values on Y are unrelated to other variables, as well as to the unobserved values of Y The observed data are a random sample of the hypothetically complete data For example, the EAT was not administered due to a clerical error or scheduling difficulties; or individuals who did respond randomly failed to make dark enough marks in places on response sheet This is an unusually strict assumption in most cases FIML and MI are unbiased, as are most traditional deletion methods (listwise; pairwise)

Missing Data Mechanisms Missing At Random (MAR) Missing values on Y can depend on other variables, but not on the unobserved values of Y For example, missing EAT item responses are related to BMI, such that individuals with very low BMI scores are more likely to skip certain items FIML and MI assume MAR, and are unbiased Traditional deletion (pairwise; listwise) methods are biased Warning: MAR is defined relative to the variables in the analysis. If the cause of the missing data is a measured variable (e.g., BMI) that does not appear in the analysis model, MAR does not hold Missing Data Mechanisms Missing Not At Random (MNAR) Missing values on Y depend on the unobserved values of Y For example, missing EAT scores are related to the respondent s level of eating disorder risk, such that individuals who are highly preoccupied with food tend to skip those items FIML and MI will yield biased estimates under MNAR, although less biased than traditional methods 4

The Multivariate Normal Probability Density Function (PDF) The shape of the multivariate normal curve is defined by the familiar equation l i µ Σ i µ i = e p / / ( π ) Σ ( y ) ( y )/ Plugging in a person s raw data values gives us the likelihood (i.e., the relative probability) of that set of scores, given the values of µ and Σ Suppose that µ = 0, σ =, and r =.60 Bivariate Normal Distribution Two cases: X = -.5, Y = 0 X = -.5, Y = - l =.64 Case is closer to the parameter values, and thus has a higher likelihood) =.064 l - Y - 0 0 - - X 5

The Log Likelihood Likelihoods tend to be very small numbers Taking the natural log of the likelihood makes the math a bit more tractable p ( ) log Li = log( π ) log Σ y µ Σ ( y µ ) The log likelihood still quantifies the same thing the relative probability of a person s scores, given a normal distribution with a particular µ and Σ Casewise Log Likelihoods Two cases: X = -.5, Y = 0 X = -.5, Y = - Case has a higher log likelihood, given these parameters =.64 LL = -.8 LL = -.75 =.064 l l - Y - 0 0 - - X 6

Log Likelihood Functions With complete data, each case s contribution to the log likelihood is p ( ) log Li = log( π ) log Σ y µ Σ ( y µ ) In the missing data context, each ith case s contribution to the log likelihood is pi ( ) log Li = logπ log Σ i yi µ i Σ i ( yi µ i) The Missing Data Log Likelihood The sample log likelihood is obtained by summing over the N cases N N log L = K log i ( i i) Σ y µ Σ i ( yi µ i) i= i= In the current CFA example, µ and Σ are functions of the model parameters: Σ = ΛΦΛ + Θ µ = τ + Λκ 7

How Does ML Differ With Missing Data? The only difference between the complete data and missing data log likelihood is the i subscript This allows the size and content of the data and parameter arrays to vary across cases The log likelihood is based only on those variables (and parameters) for which case i has complete data Example () Consider a case with three variables, Y, Y, and Y The contribution to the log likelihood for cases who have complete data is σ σ σ y µ σ σ σ y µ log Li = Ki log σ σ σ y µ σ σ σ y µ σ σ σ y σ µ σ σ y µ Σ y µ Σ y µ 8

Example () The contribution to the log likelihood for cases who are missing Y is log L i σ σ y µ σ σ y µ = Ki log σ σ y µ σ σ y µ The rows and columns for Y were removed σ σ σ y µ σ σ σ y µ log Li = Ki log σ σ σ y µ σ σ σ y µ σ σ σ y σ µ σ σ y µ Notes on FIML FIML is not much different from complete-data ML estimation, but incorporates a mean structure Missing values are not imputed! Parameters and standard errors are estimated directly using all the observed data Including the partially complete cases serves to steer the estimation algorithm toward a more accurate set of parameters, via the relations among the variables 9

Incorporating Auxiliary Variables In order for MAR to hold, the cause of the missing data must appear in the analysis model An auxiliary variable (AV) such as BMI is ancillary to one s substantive hypotheses Adopting an inclusive analysis strategy that incorporates AVs can make MAR more plausible A useful AV is either a potential cause or correlate of missingness, or a correlate of the variable that is missing The Saturated Correlates Model Graham (00) outlined a saturated correlates model for including AVs in an SEM analysis Three rules apply: AVs should be correlated with observed (not latent) predictors in the model AVs should be correlated with the residual terms from observed (not latent) dependent variables AVs should be correlated with one another 0

The Saturated Correlates Model EAT Analysis CFA Model EAT4 EAT4 EAT DRIVE FOR THINNESS EAT BMI EAT0 EAT EAT The Saturated Correlates Model Path Model Example X Y Y ζ AV ζ AV

The Saturated Correlates Model General Structural Model Example Z ζ X Y x x y y AV AV FIML Miscellanea When given the choice, compute standard errors using the observed information matrix The observed information assumes MAR, while the expected information requires MCAR Rescaled test statistics for nonnormal data are available in Mplus and EQS (see Yuan and Bentler, 000) Sandwich estimator (i.e., robust ) standard errors are also available, but assume MCAR data

Overview Of Multiple Imputation Multiple copies (e.g., m = 0) of the incomplete data set are created Each data set is imputed with different estimates of the missing values using multiple regression The SEM analysis is performed on each of the filled-in data sets A single point estimate and standard error is obtained by averaging over the m data sets Imputation And Analysis Phase Unlike FIML, missing data are handled in an imputation phase that is distinct from the analysis The imputation phase is used to create the m imputed data sets Once the data sets are constructed, any number of different analyses can be performed using the same set of imputations

Imputation Phase: Imputation (I) Step Begin with initial estimate of µ and Σ Use µ and Σ to construct regression equations for each missing data pattern Impute missing values with predicted scores, and add a normally distributed residual to each to restore lost variability Imputation Phase: Posterior (P) Step Compute new estimates of µ and Σ using the imputed data Conditional on the updated estimates of µ and Σ, randomly sample new values for µ and Σ from a simulated sampling distribution Carry the new values of µ and Σ forward to the next I step, and impute missing values using a new set of regression equations 4

EAT Example Case EAT m = m = m = m = 9 m = 0? 5 6 4? 5 Analysis Phase: Combining Estimates After the imputations are created, the model of interest is fit to each of the m data sets A point estimate for any parameter of interest is obtained by averaging the estimates from the m analyses m ˆ θ = θi m i = 5

Analysis Phase: Combining S.E.s Between-imputation variance is the variance of the parameter across the m imputations Within-imputation variance is the mean of the m squared S.E.s The total variance (i.e., squared S.E.) V i = m = V ˆ i m i = ( ) i m B = ˆ θ θ m T = V + + B m Single Parameter Significance Tests Single parameter tests are obtained via a t statistic t = The degrees of freedom are complex, and depend on between- and within-imputation variance and m Use adjusted degrees of freedom (Barnard & Rubin, 999) whenever possible θ T 6

Incorporating Auxiliary Variables AVs are incorporated as additional predictor variables in the imputation phase The AVs need not be included in the subsequent analyses, because the imputed values have already been conditioned on the AVs As a general rule, the imputation phase should include all variables and design effects (e.g., interaction terms) that appear in the subsequent analysis model MI Miscellanea Mplus and LISREL have functions that automate the process of estimating and combining parameter estimates and standard errors χ statistics can be combined, but this is more complex than a simple average (e-mail Dr. Enders if you want a SAS program to do this) It is unclear how to combine fit indices at this point Assessing MI convergence requires special graphical techniques (see the book chapter) 7

Selected Analysis Results Loading EAT EAT EAT0 EAT EAT EAT4 EAT4 FIML.85 (.09).54 (.059).745 (.065).096 (.077). (.0).98 (.07).7 (.077) MI.64 (.07).54 (.059).77 (.066).09 (.077).099 (.0).980 (.07).707 (.074) A Final Comparison of FIML and MI Both procedures assume MAR data and multivariate normality FIML and MI should yield very similar estimates if the same set of input variables are used MI is more complex to use, but the resulting analyses don t require specialized software If an estimation routine exists, it is probably more straightforward to use FIML and incorporate AVs into the analysis 8