Using Mplus individual residual plots for. diagnostics and model evaluation in SEM

Similar documents
Simple Second Order Chi-Square Correction

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

Multiple Regression: What Is It?

Homework 11. Part 1. Name: Score: / null

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Specification of Rasch-based Measures in Structural Equation Modelling (SEM) Thomas Salzberger

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Comparison of Estimation Methods for Complex Survey Data Analysis

4. Simple regression. QBUS6840 Predictive Analytics.

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Module 3: Correlation and Covariance

Econometrics Simple Linear Regression

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Applications of Structural Equation Modeling in Social Sciences Research

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Introduction to Data Analysis in Hierarchical Linear Models

Review of Fundamental Mathematics

Chapter 7: Simple linear regression Learning Objectives

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction to Structural Equation Modeling (SEM) Day 4: November 29, 2012

Regression III: Advanced Methods

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Additional sources Compilation of sources:

College Readiness LINKING STUDY

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Homework 8 Solutions

The Latent Variable Growth Model In Practice. Individual Development Over Time

Introduction to Longitudinal Data Analysis

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

A Short Tour of the Predictive Modeling Process

PCHS ALGEBRA PLACEMENT TEST

Scatter Plot, Correlation, and Regression on the TI-83/84

Notes on Applied Linear Regression

RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

lavaan: an R package for structural equation modeling

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Simple linear regression

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Exponential Growth and Modeling

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

5. Multiple regression

Lecture 3: Linear methods for classification

CHAPTER 13 EXAMPLES: SPECIAL FEATURES

Simple Regression Theory II 2010 Samuel L. Baker

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Dimensionality Reduction: Principal Components Analysis

GLM I An Introduction to Generalized Linear Models

Simple Methods and Procedures Used in Forecasting

Goodness of fit assessment of item response theory models

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

CHAPTER 4 EXAMPLES: EXPLORATORY FACTOR ANALYSIS

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

3. Data Analysis, Statistics, and Probability

EST.03. An Introduction to Parametric Estimating

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Correlation and Simple Linear Regression

Part 2: Analysis of Relationship Between Two Variables

THE PSYCHOMETRIC PROPERTIES OF THE AGRICULTURAL HAZARDOUS OCCUPATIONS ORDER CERTIFICATION TRAINING PROGRAM WRITTEN EXAMINATIONS

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

GUIDELINES FOR THE VALIDATION OF ANALYTICAL METHODS FOR ACTIVE CONSTITUENT, AGRICULTURAL AND VETERINARY CHEMICAL PRODUCTS.

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Linear Models in STATA and ANOVA

SAS Software to Fit the Generalized Linear Model

Data Mining Part 5. Prediction

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Module 5: Multiple Regression Analysis

Using Excel for Statistical Analysis

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Correlation and Regression Analysis: SPSS

Test Bias. As we have seen, psychological tests can be well-conceived and well-constructed, but

table to see that the probability is (b) What is the probability that x is between 16 and 60? The z-scores for 16 and 60 are: = 1.

Session 7 Bivariate Data and Analysis

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Analysis of Bayesian Dynamic Linear Models

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Using SAS Proc Mixed for the Analysis of Clustered Longitudinal Data

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Common factor analysis

Spreadsheet software for linear regression analysis

You buy a TV for $1000 and pay it off with $100 every week. The table below shows the amount of money you sll owe every week. Week

Indices of Model Fit STRUCTURAL EQUATION MODELING 2013

Example: Boats and Manatees

Transcription:

Using Mplus individual residual plots for diagnostics and model evaluation in SEM Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 20 October 6, 2014 1 Introduction A variety of plots are available in Mplus 7.2 that can be used to visually evaluate the fit of the model. We can inspect the fit on a more detailed level where individual observations are plotted. Such level of detail usually can not be obtained when standard SEM fit statistics are used such as chi-square statistics and CFI/TLI indices. These standard SEM fit statistics are usually based on a global model fit, i.e., a fit for the entire population as a whole rather than fit for individual observations. In this paper we provide several examples where the plots can be used to quickly discover patterns in the data that are not accounted for in the model. 1

Consider for example a SEM model that contains the following relationship between the dependent variables Y, latent variables η and covariates X Y = ν + λη + βx + ε η = α + γx + ζ where ν, α, λ, γ, and β are model parameters where ν and α are vectors and λ, γ and β are matrices. Mplus can be used to compute the factor scores ˆη = E(η Y, X) and the model estimated/predicted values for Y, which we denote by Ŷ. Mplus computes two versions for the predicted values of Y. The first one uses both the factor score and the covariates, i.e., Ŷ = ˆα + ˆλˆη + ˆβX. The second method for evaluating the predicted Y values is Ŷ = E(Y X). The second version of predicted value is commonly used in statistical practice while the first version is more useful in a SEM setting because it is more directly related to the actual estimated model and uses also the factor scores to predict the Y values. The first method however has a slight caveat. Because the factor score ˆη already incorporates the information contained in the dependent variable Y, then Ŷ will also contain information from Y. Therefore we will be using Y to predict Y which can in itself be problematic 2

as it lacks the rigor of pure statistical conditional expectation. However, when the latent variables η are measured well with a sufficient number of accurate measurements, the dependence between ˆη and a single Y variable, i.e., a particular indicator, will be sufficiently small so that it can be ignored. On a more technical level the purity of the Ŷ definition guarantees that Cov(Ŷ, ε) = 0 while Cov(Ŷ, ε) 0. In most cases we assume that the systematic/predicted part of the variable is independent of its residual. Note however that if the measurement instrument is good Cov(Ŷ, ε) 0. Thus any violation of the above equation could be interpreted as a model misspecification. Estimates for the individual level residuals ε can also be formed using either of the estimated Y values Y res = Y Ŷ Y res = Y Ŷ. Thus a common check that can be done using graphing utilities is to inspect that Y res and Ŷ are independent. 3

In this paper we provide some simple examples where inspecting visually the scatter plots between the dependent variables Y, the estimated values Ŷ, the residual values Y res, the estimated factor scores ˆη, and the covariates X can lead to useful discoveries and model modifications that could be overlooked otherwise. For our examples we use simulated data so that the illustration plots are somewhat clearer than those that might be found in real practical examples. The methodology, however, applies to real examples as well. The plots do not infer statistical significance. This should come as usual from formal statistical testing. The plots however can be used to more meaningfully adjust the model. Model modification indices can sometimes produce large values for model parameters without clear justification. Usually model modifications that originate from a scatter plot are quite easy to interpret and explain. The power to detect model misspecifications with plots naturally will depend on the person analyzing the scatter plot and thus a formal statement regarding plot power is not possible. Presumably, however, the power will be lower than with other purely numerical statistical tools. Usually the sample size does little to improve the effect of a particular scatter plot. For example doubling the number of plotted points, i.e., replicating each plotted point, will not increase or make any difference in the plot, while any statistical test will increase its power. Our ability to detect visually a missing effect depends on the size of the effect rather than its statistical significance. The examples provided in this paper are by no means exhaustive. These are just a few simple illustrations. Other scatter plots can be examined to 4

cross-validate the estimated model. 2 Non-linear trend In this example we generate data according to a factor analysis model with one factor with 10 indicator variables and one covariate. All loadings λ are set to 1, all intercept parameters are set to 0, all residual variances are set to 1, all direct effect coefficients β are set to 0. We generate the data with a non-linear effect on the factor η = γx + F (X) + ζ, where X is a standard normal variable, γ = 1 and F (X) uses of the these 3 forms: 0, X 2, exp(x). As we vary the function F we get 3 different samples. We generate the data sets with 500 observations. Next we estimate a standard factor analysis model where the non-linear trend is not accounted for, i.e., we estimate the same model as the data generating model but without the predictor F (X). We plot the estimated factor score ˆη against the predictor X. The plots for the 3 samples are given in Figures 1-3. The results are very clear. In the first sample the model is adequate, i.e., the linear trend between η and X is correct and the plot reflects the linear trend that is in the model. In the second sample Figure 2 shows a quadratic relationship between η and X. Thus the model is insufficient and there is a need to add X 2 as a predictor of η. In the third sample Figure 3 shows an exponential relationship between η and X. Thus the model is insufficient and there is a need 5

Figure 1: Sample 1 - linear trend, F (X) = 0 to add exp(x) as a predictor of η. None of the standard fit statistics, such as the chi-square test, CFI, and TLI detected the misfit/inadequanecy of the model in the second and the third samples. In addition the plots can actually provide guidance regarding the type of non-linearity that should be explored. The quadratic v.s. the exponential trend are clearly distinguishable. 3 Using residual values to detect direct effects In this section we generate a sample as in the previous section with 10 indicator variables, one factor and 5 standard normal covariates X 1,..., X 5. All 6

Figure 2: Sample 2 - quadratic trend, F (X) = X 2 Figure 3: Sample 3 - exponential trend, F (X) = Exp(X) 7

γ i coefficients are set to 1. We also add one direct effect to the model generation from X 1 to Y 1, i.e., β 11 = 1. We estimate the MIMIC model without the direct effect, i.e., assuming that β 11 = 0. We consider 4 different scatter plots: Y 1 v.s. X 1 given in Figure 4, Y 2 v.s. X 1 given in Figure 5, Y 1,res v.s. X 1 given in Figure 6, Y 2,res v.s. X 1 given in Figure 7. Both Figure 4 and 5 show a linear trend between Y i and X 1. These trends do not indicate any misspecifcation because they are implied by the estimated model. However, according to the estimated model which didn t account for the direct effect from X 1 to Y 1 the residual variables should be independent of X 1. Figure 6 shows a linear dependence between Y 1,res and X 1. This implies an unaccounted direct effect from X 1 to Y. Figure 7 shows that Y 2,res is independent of X 2 as expected and implies no misspecification. Therefore it is important to examine scatter plots for the residual variables and not just the observed variables. This type of misspecification is easily detectable however with other methods such as modification indices as well as by examining the model estimated variance covariance matrix for Y i and X i and the observed variance covariance matrix for these variables. This comparison is obtained in Mplus with the RESIDUAL option of the OUTPUT command. 8

Figure 4: Y 1 v.s. X 1 Figure 5: Y 2 v.s. X 1 9

Figure 6: Y 1,res v.s. X 1 Figure 7: Y 2,res v.s. X 1 10

4 Using residual scatter plots to detect residual covariances In this section we generate a sample as in the previous section with 10 indicator variables, one factor and without any covariates. We introduce to the data generation model a residual correlation of 0.5 between Y 1 and Y 2. The data however is analyzed without the residual correlation. We consider the scatter plot between Y 1,res and Y 2,res to see if the misspecification will be detected. This scatter plot is presented in Figure 8, while in Figures 9 we present the scatter plots between Y 1,res and Y 3,res for comparative purposes. Figure 8 shows that Y 1,res and Y 2,res are positively correlated and there is a linear trend between the two with a positive slope. This clearly suggest that the residual correlation between Y 1 and Y 2 should be added to the model. On the other hand Figure 9 shows that Y 1,res and Y 3,res appear to be independent which is correctly reflected in the model. In this example the misspecification was detected by using scatter plots between the residual variables. A scatter plot between the observed variables would not be able to detect this misspecification because both the model and the scatter plot indicate a positive correlation between any pair of indicator variables. This type of misspecification can also be detected using modification indices or by comparing the estimated and observed variance covariance matrices given in the residual output. 11

Figure 8: Y 1,res v.s. Y 2,res Figure 9: Y 1,res v.s. Y 3,res 12

5 Using scatter plots to detect heterogeneity in the population For this example we generate the data set as in the previous section using a factor analysis model with 10 indicators and one factor. The parameters are as in the previous sections except for the residual variances of the indicator variables which we set now at 0.1. In addition we generate the data from a two-class model with two equal classes where the factor loadings for Y 2 in the second class is 0. All other loadings are 1. We estimate a model where the latent class variable is ignored, i.e., we estimate a simple factor analysis model with the 10 indicators and a single factor. We consider the scatter plot of Y i v.s. Ŷ i. Figure 10 contains the scatter plot Y 1 v.s. Ŷ 1 and Figure 11 contains the scatter plot Y 2 v.s. Ŷ 2. In Figure 10 we see a good match between the observed and the estimated values. In Figure 11 however that is not the case. We can clearly see the two patterns in the data, indicating an underlying heterogeneity in the population and the potential for Mixture modeling to improve the model and provide a better fit for the data. In this example as well, the usual fit statistics such as the chi-square test, CFI and TLI are misleading because they indicate a nearly perfect fit and overlook the drastic discrepancy illustrated in Figure 11. 13

Figure 10: Y 1 v.s. Ŷ 1 Figure 11: Y 2 v.s. Ŷ 2 14