Latent Class Regression Part II

Similar documents

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

STATISTICA Formula Guide: Logistic Regression. Table of Contents

VI. Introduction to Logistic Regression

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

11. Analysis of Case-control Studies Logistic Regression

Use of the Chi-Square Statistic. Marie Diener-West, PhD Johns Hopkins University

Ordinal Regression. Chapter

Statistical Machine Learning

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Penalized regression: Introduction

Introduction to Path Analysis

Simple Predictive Analytics Curtis Seare

What do we mean by cause in public health?

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Linear Classification. Volker Tresp Summer 2015

Simple Linear Regression Inference

Lecture 25. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Regression III: Advanced Methods

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Chapter 7: Simple linear regression Learning Objectives

Elements of statistics (MATH0487-1)

Statistics in Psychosocial Research Lecture 8 Factor Analysis I. Lecturer: Elizabeth Garrett-Mayer

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

Section B. Some Basic Economic Concepts

5. Multiple regression

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Multivariate Logistic Regression

Some Essential Statistics The Lure of Statistics

Multinomial and Ordinal Logistic Regression

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Lecture 14: GLM Estimation and Logistic Regression

APPLIED MISSING DATA ANALYSIS

An Introduction to Latent Class Growth Analysis and Growth Mixture Modeling

CHAPTER 4 EXAMPLES: EXPLORATORY FACTOR ANALYSIS

Washoe County Senior Services 2013 Survey Data: Service User

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Univariate Regression

Logistic regression modeling the probability of success

Survey Research: Choice of Instrument, Sample. Lynda Burton, ScD Johns Hopkins University

Examining a Fitted Logistic Model

Generalized Linear Models

Statistical Models in R

Statistics, Data Analysis & Econometrics

Additional sources Compilation of sources:

Logistic Regression (1/24/13)

Nominal and ordinal logistic regression

Categorical Data Analysis

Simple linear regression

Lies, damned lies and latent classes: Can factor mixture models allow us to identify the latent structure of common mental disorders?

Social Media Mining. Data Mining Essentials

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Statistics in Retail Finance. Chapter 2: Statistical models of default

7 Time series analysis

[This document contains corrections to a few typos that were found on the version available through the journal s web page]

Goodness of fit assessment of item response theory models

PREFERENCE ELICITATION METHODS. Lecture 9. Kevin Frick

Gerry Hobbs, Department of Statistics, West Virginia University

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Lecture 19: Conditional Logistic Regression

Economic and Financial Considerations in Health Policy. Gerard F. Anderson, PhD Johns Hopkins University

Introduction to Predictive Modeling Using GLMs

Data Mining Lab 5: Introduction to Neural Networks

Lecture 3: Linear methods for classification

MTH 140 Statistics Videos

GLM I An Introduction to Generalized Linear Models

Examples of Using R for Modeling Ordinal Data

Data analysis process

II. DISTRIBUTIONS distribution normal distribution. standard scores

Organizing Your Approach to a Data Analysis

The Big Picture. Correlation. Scatter Plots. Data

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques Page 1 of 11. EduPristine CMA - Part I

lavaan: an R package for structural equation modeling

Logistic Regression. BUS 735: Business Decision Making and Research

Basic Probability and Statistics Review. Six Sigma Black Belt Primer

Life Tables. Marie Diener-West, PhD Sukon Kanchanaraksa, PhD

Lecture 18: Logistic Regression Continued

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Fitting Subject-specific Curves to Grouped Longitudinal Data

Section 3 Part 1. Relationships between two numerical variables

Introduction to Longitudinal Data Analysis

Section 6: Model Selection, Logistic Regression and more...

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Biostatistics: Types of Data Analysis

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Economics and Economic Evaluation

Module 3: Correlation and Covariance

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Analysis Issues II. Mary Foulkes, PhD Johns Hopkins University

Power and sample size in multilevel modeling

Statistics in Retail Finance. Chapter 6: Behavioural models

The Competitive Environment: Sales and Marketing

Transcription:

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this site. Copyright 2007, The Johns Hopkins University and Qian-Li Xue. All rights reserved. Use of these materials permitted only in accordance with license rights granted. Materials provided AS IS ; no representations or warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for obtaining permissions for use from third parties as needed.

Latent Class Regression Part II Statistics for Psychosocial Research II: Structural Models Qian-Li Xue

Outline Model Checking Check model fit Check model assumption Conditional independence Non-differential measurement Identifiability

Model Checking Analog = residual checking in linear regression IT S CRITICALLY IMPORTANT! Can give misleading findings if measurement model assumptions are unwarranted Philosophical opinion: we learn primarily by specifying how simple models fail to fit, not by observing that complex models happen to fit Two types of checking Whether the model fits (e.g. observed vs. expected) How a model may fail to fit (ASSUMPTIONS)

Checking Whether the Model Fits Means 1) Do Y s aggregate as expected given the model? [Conditional Independence] 2) Do Y s relate to the X s as expected given the model [Non-Differential Measurement]

Checking Whether the Model Fits 1) Do Y s aggregate as expected given the model? [Conditional Independence] Compare observed pattern frequencies to predicted pattern frequencies

Frailty Example: Observed versus Expected Response Patterns: Ignoring Covariates 5 most frequently observed patterns among the non-frail Criterion Weight loss Weak Slow Exhaustion Low Activity Pattern Frequencies Observed Expected 1-class 2-class 3-class N N N N N 310 247.8 339.5 341.2 N N Y N N 76 107.2 74.9 72.4 N N N N Y 40 61.9 36.0 34.1 N Y N N N 36 63.9 39.7 38.5 N N Y N Y 32 26.8 22.8 28.7 5 most frequently observed patterns among the frail (Bandeen-Roche et al. 2006) N Y Y N Y 16 6.9 18.8 14.5 N Y Y Y Y 12 1.1 9.5 8.2 N N Y Y Y 10 4.3 9.5 7.3 Y Y Y Y Y 10 0.2 3.3 7.6 Y Y Y N Y 9 1 6.4 7 Latent Class Model Fit statistics Pearson Chi-Square 568 (p<.0001) 24.4 (p=.22) 13.1 (p=.52) AIC 3560 3389 3390 BIC 3583 3440 3467

MPLUS Input TITLE: Weighted Latent Class Analysis of Frailty Components Using Combined WHAS I and II Data Age 70-79 DATA: FILE IS "C:\teaching\140.658.2007\lcr.dat"; VARIABLE: NAMES ARE baseid shrink weak slow exhaust kcal sweight age educ disease; USEVARIABLES ARE shrink weak slow exhaust kcal sweight; MISSING ARE ALL (999999); CATEGORICAL ARE shrink-kcal; CLASSES = frailty(2); WEIGHT IS sweight;analysis: TYPE IS MIXTURE; MODEL: %OVERALL% %frailty#1% [shrink$1*-1 weak$1*-1 slow$1*0 exhaust$1*-1 kcal$1*-1]; %frailty#2% [shrink$1*1 weak$1*1 slow$1*1 exhaust$1*1 kcal$1*2]; OUTPUT: TECH1 TECH10 SAVEDATA: FILE IS "C:\teaching\140.658.2007\lcasave.out"; SAVE=CPROB; Contains observed vs. expected frequencies of response patterns Save estimated posterior probabilities of class membership

MPLUS Output TECHNICAL 10 OUTPUT MODEL FIT INFORMATION FOR THE LATENT CLASS INDICATOR MODEL PART RESPONSE PATTERNS No. Pattern No. Pattern No. Pattern No. Pattern 1 00000 2 10000 3 01000 4 11000 5 00100 6 10100 7 01100 8 11100 9 00010 10 10010 11 01010 12 11010 13 00110 14 10110 15 01110 16 11110 RESPONSE PATTERN FREQUENCIES AND CHI-SQUARE CONTRIBUTIONS Response Frequency Standardized Chi-square Contribution Pattern Observed Estimated Residual Pearson Loglikelihood Deleted (z-score) 1 344.81 331.36 0.99 0.11 3.01 2 30.14 28.91 0.23 0.05 3.31 3 37.13 38.02-0.15-0.06-5.15 4 7.95 5.36 1.12 1.25 5.99 5 73.46 76.67-0.39 0.12-3.94 6 8.27 11.27-0.90 0.73-3.63 Recall: standardized residual = O - E / (E) 1/2

Do Y patterns behave as model predicts? Compare observed pattern frequencies to expected pattern frequencies How does addition of regression change interpretation? Evaluating fit of measurement piece Will be same as in standard LCA model unless..

Do Y patterns behave as model predicts? With Categorical Covariates Easier than continuous (computationally) Example Calculate: Predicted whites with weight loss Observed whites with weight loss Predicted blacks with weight loss Observed blacks with weight loss

Do Y patterns behave as model predicts? Categorical Covariates Assume LC regression model with only race as covariate Race (x) = 0 if white, 1 if black Item of interest: weight loss (y m =1 yes; 0 no) Want find how many class 2 whites we would expect to report weight loss based on the model Pr( y im = 1, η = i = Pr( y im j, x i = 0) = 1 η = i j, x i = 0) Pr( η = = 0) Pr( x = 0) Calculate this for each of the classes and sum up: Will tell us the expected number of whites reporting weight loss i j x = Pr( yim = 1 ηi = j) Pr( ηi = j xi = 0) Pr( xi = 0) β0 e Expected(weight loss, whites, and class = 2) = N π mj Prop.( whites) β0 1+ e i i

Checking Whether the Model Fits 2) Do Y s relate to the X s as expected given the model [Non-Differential Measurement] Idea: focus on one item at a time Recall: J M yim (1 y i1 = yi 1,..., YiM = yim xi ) = pij ( xi ) π mj (1 mj ) j= 1 m= 1 P( Y π If interested in item m, ignore ( marginalize over ) other items: yim P( Y = y x ) = p ( x ) π (1 π im im i J j= 1 ij i mj mj ) (1 y im ) im )

Comparing Fitted to Observed Construct the predicted curve by plotting this conditional probability versus any given x Add a smooth spline to reveal systematic trend m Superimpose it with an observed item response curve by Plot item response (0 or 1) by x Add smooth spline to reveal systematic trend

Checking How the Model Fails to Fit Check Assumptions non-differential measurement conditional independence Non-differential Measurement: P(y im x i, η i ) = P(y im η i ) In words, within a class, there is no association between y s and x s. Check this using logistic regression approach

Checking How the Model Fails to Fit Basic ideas: Suppose the model is true If we know persons latent class memberships, we would check directly: Stratify into classes, then, within classes: Check correlations or pairwise odds ratios among the item responses (Conditional Independence) Regress item responses or observed patterns on covariates (non-differential measurement) Regress class memberships on covariates, hope for Similar findings re regression coefficients No strong effects of outliers Identify strongly nonlinear covariates effects

Checking Conditional Independence In words, within a class, there is no association between y m and y n, m n. Define: OR mn j = P( y P( y m m = 1, y = 1, y n n = 1 η = = 0 η = j) / j) / P( y P( y m m = = 0, 0, y y n n = 1 η = = 0 η = j) j) OR~1

Checking Non-differential Measurement For binary covariates and for each class j and item m consider OR m jx = P( y P( y m m = 1 x = 1, η = = 1 x = 0, η = j) / j) / P( y P( y m m = = 0 0 x x = 1, η = = 0, η = j) j) If assumption holds, this OR will be approximately equal to 1 What about continuous covariates? Use same general idea, but estimate the logor within classes by logistic regression Example: age

Checking How the Model Fails to Fit But in reality, we don t know the true latent class membership! Latent class memberships must be estimated Randomize people into pseudo classes using their posterior probabilities Recall: posterior probability is defined as Pr( yi ηi = j) Pr( ηi = j xi ) Pr( ηi = j xi, yi ) = J Pr( y η = j) Pr( η = j x ) j= 1 i i Different from latent class probabilities Pr( η = j x )!! Analyze as described before, except using pseudo class membership rather than true ones i i i i

MPLUS Output: Estimated Posterior Probabilities SAVEDATA: FILE IS "C:\teaching\140.658.2007\lcasave.out"; SAVE=CPROB; Response Data Sampling weights Posterior Class Prob. 0.000 1.000 1.000 0.000 0.000 0.845 0.211 0.789 2.000 1.000 1.000 1.000 1.000 1.000 0.737 0.001 0.999 2.000 0.000 1.000 1.000 0.000 1.000 0.737 0.022 0.978 2.000 1.000 1.000 1.000 1.000 1.000 0.737 0.001 0.999 2.000 0.000 0.000 1.000 0.000 1.000 0.766 0.215 0.785 2.000 * 1.000 1.000 0.000* 0.718 0.102 0.898 2.000 1.000 1.000 1.000 1.000 1.000 0.742 0.001 0.999 2.000 0.000 1.000 0.000 0.000 0.000 0.791 0.786 0.214 1.000 0.000 1.000 0.000 0.000 0.000 0.791 0.786 0.214 1.000 Most likely Class

Checking Conditional Independence 1) Assign individuals to pseudo-classes based on posterior probability of class membership e.g. individual with 0.20, 0.05, 0.75 better chance of being in class 3 not necessarily in class 3 2) Calculate OR s within classes. 3) Repeat 1) and 2) at least a few times 4) Compare OR s to 1

Utility of Model Checking May modify interpretation to incorporate lack of fit/violation of assumption May help elucidate a transformation that that would be more appropriate (e.g. log(age) versus age) May suggest how to improve measurement (e.g. better survey instrument) May lead to believe that LCR is not appropriate

Select Number of Classes Method 1: analyze for a theorized number of classes, check for violation of conditional dependence, refit with more classes if clear patterns of dependence emerge Labor-intensive! Method 2: Conduct a LCA ignoring covariates; select classes as discussed last term This works assuming non-differential measurement Y s are latent class distributed ignoring covarites With same number of classes and same πs as in regression model

Identifiability General Idea: different parameters can lead to the same model fit 2-step rule: If (a) polytomous logistic regression is identified (b) standard LCM is identified Then model is identified T-rule: need more data cells than parameters Recall: for binary indicators: # of parameters (unknowns) = [M*J] πs + [(J-1)*(H+1)] β s, where H is # of covariates Data cell count is 2 n -1 per covariate combination Each possible covariate combination creates a stratum For continuous covariates, t-rule likely satisfied

Empirical Testing of Identifiability Must run model more than once using different starting values to check identifiability! Mplus input: ANALYSIS: TYPE IS MIXTURE; STARTS = 500 50; STITERATIONS=20; Number of initial stage random sets of starting values and the number of final stage optimizations to use Maximum number of iteration allowing in the initial stage

MPLUS Output: Good Stability RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST LOGLIKELIHOOD VALUES Final stage loglikelihood values at local maxima, seeds, and initial stage start numbers: -2750.059 373505 88-2750.059 642909 251-2750.059 569131 26-2750.059 467339 66-2750.059 903369 134-2750.059 652266 490-2750.059 765392 382-2750.059 315029 471-2750.059 22089 143-2750.059 127215 9-2750.059 252949 487

MPLUS Output: Bad Stability RANDOM STARTS RESULTS RANKED FROM THE BEST TO THE WORST LOGLIKELIHOOD VALUES Final stage loglikelihood values at local maxima, seeds, and initial stage start numbers: -2750.059 373505 192-2759.012 642909 425-2755.426 852154 19-2750.059 467339 66-2743.785 158965 111-2750.059 965321 336-2743.785 765392 382-2743.785 178526 189-2750.059 56325 258-2750.059 128963 41-2759.012 56833 401 Identifiability is in question!!

Identifiability To best assure identification Incorporate a priori theory as much as possible Set πs to 0 or 1 where it makes sense to do so Set πs equal to each other If program fails to converge Run the program longer Re-initilize in very different place Add constraints (e.g. set πs to 0 or 1 where sensible)

Cautions Identifiability Even if no warning message, some other solution than the one you ve identified may fit as well or better than yours Try several initializations Models that appeared identified at the latent class analysis stage may generate an error message at the regression stage Estimability vs. global identifiability May need to add further measurement model constraints If fit very unstable: should reconsider using LCR at all

Parameter Constraints: Mplus Example TITLE: this is an example of a LCA with binary latent class indicators and parameter constraints DATA: FILE IS ex7.13.dat; VARIABLE: NAMES ARE u1-u4; CLASSES = c (2); CATEGORICAL = u1-u4; ANALYSIS: TYPE = MIXTURE; MODEL: %OVERALL% %c#1% [u1$1*-1]; [u2$1-u3$1*-1] (1); [u4$1*-1] (p1); %c#2% u1 has π equal to 1 in class 2 [u1$1@-15]; [u2$1-u3$1*1] (2); [u4$1*1] (p2); MODEL CONSTRAINT: p2 = - p1; OUTPUT: TECH1 TECH8; MPLUS User s Guide p. 139 u2 and u3 have same π in class 1 u2 and u3 have same π in class 2 The threshold of u4 in class 1 is equal to - 1*threshold of u4 in class 2 (i.e. same error rate)