Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén Bengt Muthén Copyright 28 Muthén & Muthén www.statmodel.com Table Of Contents General Latent Variable Modeling Framework Analysis With Categorical Observed And Latent Variables Categorical Observed Variables Logit And Probit Regression British Coal Miner Eample Logistic Regression And Adjusted Odds Ratios Latent Response Variable Formulation Versus Probability Curve Formulation Ordered Polytomous Regression Alcohol Consumption Eample Unordered Polytomous Regression Censored Regression Count Regression Poisson Regression Negative Binomial Regression Path Analysis With Categorical Outcomes Occupational Destination Eample 7 3 8 25 39 46 49 55 58 65 67 68 7 73 8 2
Table Of Contents (Continued) Categorical Observed And Continuous Latent Variables Item Response Theory Eploratory Factor Analysis Practical Issues CFA With Covariates Antisocial Behavior Eample Multiple Group Analysis With Categorical Outcomes Technical Issues For Weighted Least Squares Estimation References 86 89 3 29 42 47 67 72 79 3 Inefficient dissemination of statistical methods: Many good methods contributions from biostatistics, psychometrics, etc are underutilized in practice Fragmented presentation of methods: Technical descriptions in many different journals Many different pieces of limited software Mplus: Integration of methods in one framework Easy to use: Simple, non-technical language, graphics Powerful: General modeling capabilities Mplus versions V: November 998 V3: March 24 V5: November 27 Mplus Background V2: February 2 V4: February 26 Mplus team: Linda & Bengt Muthén, Thuy Nguyen, Tihomir Asparouhov, Michelle Conn, Jean Maninger 4 2
Statistical Analysis With Latent Variables A General Modeling Framework Statistical Concepts Captured By Latent Variables Continuous Latent Variables Measurement errors Factors Random effects Frailties, liabilities Variance components Missing data Categorical Latent Variables Latent classes Clusters Finite mitures Missing data 5 Statistical Analysis With Latent Variables A General Modeling Framework (Continued) Models That Use Latent Variables Continuous Latent Variables Factor analysis models Structural equation models Growth curve models Multilevel models Categorical Latent Variables Latent class models Miture models Discrete-time survival models Missing data models Mplus integrates the statistical concepts captured by latent variables into a general modeling framework that includes not only all of the models listed above but also combinations and etensions of these models. 6 3
General Latent Variable Modeling Framework Observed variables background variables (no model structure) y continuous and censored outcome variables u categorical (dichotomous, ordinal, nominal) and count outcome variables Latent variables f continuous variables c interactions among f s categorical variables multiple c s 7 Several programs in one Eploratory factor analysis Structural equation modeling Item response theory analysis Latent class analysis Latent transition analysis Survival analysis Growth modeling Multilevel analysis Comple survey data analysis Monte Carlo simulation Mplus Fully integrated in the general latent variable framework 8 4
Overview Of Mplus Courses Topic. March 8, 28, Johns Hopkins University: Introductory - advanced factor analysis and structural equation modeling with continuous outcomes Topic 2. March 9, 28, Johns Hopkins University: Introductory - advanced regression analysis, IRT, factor analysis and structural equation modeling with categorical, censored, and count outcomes Topic 3. August 2, 28, Johns Hopkins University: Introductory and intermediate growth modeling Topic 4. August 2, 28, Johns Hopkins University: Advanced growth modeling, survival analysis, and missing data analysis 9 Overview Of Mplus Courses (Continued) Topic 5. November, 28, University of Michigan, Ann Arbor: Categorical latent variable modeling with crosssectional data Topic 6. November, 28, University of Michigan, Ann Arbor: Categorical latent variable modeling with longitudinal data Topic 7. March 7, 29, Johns Hopkins University: Multilevel modeling of cross-sectional data Topic 8. March 8, 29, Johns Hopkins University: Multilevel modeling of longitudinal data 5
Analysis With Categorical Observed And Latent Variables Categorical Variable Modeling Categorical observed variables Categorical observed variables, continuous latent variables Categorical observed variables, categorical latent variables 2 6
Categorical Observed Variables 3 Two Eamples Alcohol Dependence And Gender In The NLSY Female Male n 4573 463 976 Not Dep 437 394 822 Dep 256 699 955 Prop.56.52 Odds (Prop/(-Prop)).59.79 Odds Ratio =.79/.59 = 3.9 Eample wording: Males are three times more likely than females to be alcohol dependent. Colds And Vitamin C n No Cold Cold Prop Odds Placebo 4 9 3.22.284 Vitamin C 39 22 7.22.39 4 7
Categorical Outcomes: Probability Concepts Probabilities: Joint: P (u, ) Marginal: P (u) Conditional: P (u ) Joint Female Alcohol Eample Conditional Not Dep.47 Dep.3 Male.43.8 Marginal.9. Distributions: Bernoulli: u = /; E(u) = π Binomial: sum or prop. (u = ), E(prop.) = π, V(prop.) = π( π)/n, π = prop Multinomial (#parameters = #cells ) Independent multinomial (product multinomial) Poisson.6.5 5 Categorical Outcomes: Probability Concepts (Continued) u = u = Cross-product ratio (odds ratio): = π π π = / π π = π π π / ( ππ) = π / π P(u =, = ) / P(u =, = ) / P(u =, = ) / P(u =, = ) Tests: Log odds ratio (appro. normal) Test of proportions (appro. normal) Pearson χ 2 = Σ(O E) 2 / E (e.g. independence) Likelihood Ratio χ 2 = 2 Σ Olog(O / E ) 6 8
Further Readings On Categorical Variable Analysis Agresti, A. (22). Categorical data analysis. Second edition. New York: John Wiley & Sons. Agresti, A. (996). An introduction to categorical data analysis. New York: Wiley. Hosmer, D. W. & Lemeshow, S. (2). Applied logistic regression. Second edition. New York: John Wiley & Sons. Long, S. (997). Regression models for categorical and limited dependent variables. Thousand Oaks: Sage. 7 Logit And Probit Regression Dichotomous outcome Adjusted log odds Ordered, polytomous outcome Unordered, polytomous outcome Multivariate categorical outcomes 8 9
Logs Logarithmic Function Logistic Distribution Function e log P(u = ) Logit Logit [P(u = )] Logistic Density Density u * 9 Binary Outcome: Logistic Regression The logistic function P(u = ) = F ( + )=. + e ( + ) Logistic distribution function Logistic density F ( + ) F ( + ) + + Logistic score Logistic density: δ F / δ z = F( F) = f (z;, π 2 /3) 2
Binary Outcome: Probit Regression Probit regression considers P (u = ) = Φ ( + ), (6) where Φ is the standard normal distribution function. Using the inverse normal function Φ -, gives a linear probit equation Φ - [P(u = )] = +. (6) Normal distribution function Normal density Φ ( + ) Φ ( + ) + + z score 2 Interpreting Logit And Probit Coefficients Sign and significance Odds and odds ratios Probabilities 22
2 23 Logistic Regression And Log Odds Odds (u = ) = P(u = )/ P(u = ) = P(u = ) / ( P(u = )). The logistic function gives a log odds linear in, + + = + + ) ( / log ) ( ) ( e e [ ] e ) ( log + = = + + + = + + + ) ( ) ( ) ( * log e e e logit = log [odds (u = )] = log [P(u = ) / ( P(u = ))] ) ( ) ( - e u P + + = = 24 Logistic Regression And Log Odds (Continued) logit = log odds = + When changes one unit, the logit (log odds) changes units When changes one unit, the odds changes units e
British Coal Miner Data Have you eperienced breathlessness? Proportion yes.44.42.4.38.36.34.32.3.28.26.24.22.2.8.6.4.2..8.6.4.2 2 4 6 8 246 8 2 2224 26283 32 34 3638 442 44 46 48 5 52 54 56 58 6 62 64 66 68 7 Age 25 Plot Of Sample Logits Logit -.2 -.4 -.6 -.8 -. -.2 -.4 -.6 -.8-2. -2.2-2.4-2.6-2.8-3. -3.2-3.4-3.6-3.8-4. -4.2-4.4-4.6-4.8-5. 2 4 6 8 246 8 2 2224 26283 32 34 3638 442 44 46 48 5 52 54 56 58 6 62 64 66 68 7 Age Sample logit = log [proportion / ( proportion)] 26 3
British Coal Miner Data (Continued) Age () N N Yes Proportion Yes OLS Estimated Probability Logit Estimated Probability Probit Estimated Probability 22 27 32 37 42 47 52 57 62,952,79 2,3 2,783 2.274 2,393 2,9,75.36 8,282 6 32 73 69 223 357 52 558 478 2,427.8.8.35.6.98.49.249.39.42.3 -.53 -.4.45.94.43.92.24.29.339.3.22.36.59.95.48.225.327.448.9.8.34.6..56.23.322.425 SOURCE: Ashford & Sowden (97), Muthén (993) 2 Logit model: χ LRT (7) = 7.3 (p >.) Probit model: χ 2 LRT (7) = 5.9 27 Coal Miner Data 22 22 27 27 32 32 37 37 42 42 47 47 52 52 57 57 62 62 u w 936 6 759 32 24 73 264 69 25 223 236 357 569 52 92 558 658 478 28 4
Mplus Input For Categorical Outcomes Specifying dependent variables as categorical use the CATEGORICAL option CATEGORICAL ARE u u2 u3; Thresholds used instead of intercepts only different in sign Referring to thresholds in the model use $ number added to a variable name the number of thresholds is equal to the number of categories minus u$ refers to threshold of u u$2 refers to threshold 2 of u 29 Mplus Input For Categorical Outcomes (Continued) u2$ refers to threshold of u2 u2$2 refers to threshold 2 of u2 u2$3 refers to threshold 3 of u2 u3$ refers to threshold of u3 Referring to scale factors use { } to refer to scale factors {u@ u2 u3}; 3 5
Input For Logistic Regression Of Coal Miner Data TITLE: DATA: VARIABLE: DEFINE: ANALYSIS: MODEL: OUTPUT: Logistic regression of coal miner data FILE = coalminer.dat; NAMES = u w; CATEGORICAL = u; FREQWEIGHT = w; = /; ESTIMATOR = ML; u ON ; TECH SAMPSTAT STANDARDIZED; 3 Input For Probit Regression Of Coal Miner Data TITLE: DATA: VARIABLE: DEFINE: MODEL: OUTPUT: Probit regression of coal miner data FILE = coalminer.dat; NAMES = u w; CATEGORICAL = u; FREQWEIGHT = w; = /; u ON ; TECH SAMPSTAT STANDARDIZED; 32 6
Output Ecerpts Logistic Regression Of Coal Miner Data Model Results Estimates S.E. Est./S.E. Std StdYX U ON X.25.25 4.758.25.556 Thresholds U$ 6.564.24 52.873 Odds: e.25 = 2.79 As increases unit ( years), the odds of breathlessness increases 2.79 33 Estimated Logistic Regression Probabilities For Coal Miner Data P ( u = ) = + e L where L = 6.564 +.25 For = 6.2 (age 62) L = 6.564 +.25 6.2 =.29 P( u = age 62) = + e,.29 =.448 34 7
Output Ecerpts Probit Regression Of Coal Miner Data Model Results Estimates S.E. Est./S.E. Std StdYX U ON X.548.3 43.75.548.545 Thresholds U$ 3.58.62 57.866 3.58 3.58 R-Square Observed Variable U Residual Variance. R-Square.297 35 Estimated Probit Regression Probabilities For Coal Miner Data P (u = = 62) = Φ ( + ) = Φ (τ ) = Φ ( τ + ). Φ ( 3.58 +.548 * 6.2) = Φ (.834).427 Note: logit probit * c where c = π 2 / 3 =.8 36 8
Categorical Outcomes: Logit And Probit Regression With One Binary And One Continuous X P(u =, 2 ) = F[ + + 2 2 ], (22) P(u =, 2 ) = - P[u =, 2 ], where F[z] is either the standard normal (Φ[z]) or logistic (/[ + e -z ]) distribution function. Eample: Lung cancer and smoking among coal miners u lung cancer (u = ) or not (u = ) smoker ( = ), non-smoker ( = ) 2 years spent in coal mine 37 Categorical Outcomes: Logit And Probit Regression With One Binary And One Continuous X P(u =, 2 ) = F [ + + 2 2 ], (22) P( u =, 2 ) = Probit / Logit = = =.5 2 2 38 9
Logistic Regression And Adjusted Odds Ratios Binary u variable regression on a binary variable and a continuous 2 variable: P (u =, 2 ) = - (, (62) + + + e 2 2 ) which implies log odds = logit [P (u =, 2 )] = + + 2 2. (63) This gives log odds{ = } = logit [P (u = =, 2 )] = + 2 2, (64) and log odds{ = } = logit [P (u = =, 2 )] = + + 2 2. (65) 39 Logistic Regression And Adjusted Odds Ratios (Continued) The log odds ratio for u and adjusted for 2 is odds log OR = log [ ] = log odds log odds = (66) odds so that OR = ep ( ), constant for all values of 2. If an interaction term for and 2 is introduced, the constancy of the OR no longer holds. Eample wording: The odds of lung cancer adjusted for years is OR times higher for smokers than for nonsmokers The odds ratio adjusted for years is OR 4 2
Analysis Of NLSY Data: Odds Ratios For Alcohol Dependence And Gender Adjusting for Age First Started Drinking (n=976) Observed Frequencies, Proportions, and Odds Ratios Frequency Proportion Dependent Age st Female Male Female Male OR 2 or < 3 4 5 6 7 8 or > 85 5 98 33 8 725 2329 223 8 38 534 99 777 59.7.33.86.6.79.7.3.233.256.253.85.52.7.89 3.98 2.24 3.6.9 2.9 2.72 3.6 4 Analysis Of NLSY Data: Odds Ratios For Alcohol Dependence And Gender (Continued) Estimated Probabilities and Odds Ratios Age st 2 or < 3 4 5 6 7 8 or > Logit Female.4.7.96.78.64.52.42 Male OR Female Male OR.34.26.22.85.54.27.5 2.66 2.66 2.66 2.66 2.66 2.66 2.66.52.25.2.82.65.5.4 Probit.298.257.22.86.55.28.4 2.37 2.42 2.48 2.55 2.63 2.72 2.82 2 Logit model: χ p (2) = 54.2 Probit model: χ 2 p (2) = 46.8 42 2
Analysis Of NLSY Data: Odds Ratios For Alcohol Dependence And Gender (Continued) Dependence on Gender and Age First Started Drinking Unstd. Coeff. Logit Regression s.e. t Std. Unstd. Coeff. Probit Regression s.e. t Std. Unstd. Coeff Rescaled To Logit Intercept.84.32 2.6 -.42.8-2.4 Male.98.8 2.7.5.5.4 3..48.9 Age st -.22.2 -.6 -.9 -.2. -. -.9 -.22 R 2.2.8 OR = e.98 = 2.66 logit probit * c where c = π 2 / 3 =.8 43 NELS 88 Table 2.2 Odds ratios of eighth-grade students in 988 performing below basic levels of reading and mathematics in 988 and dropping out of school, 988 to 99, by basic demographics Variable Below basic mathematics Below basic reading Dropped out Se Female vs. male.8*.73**.92 Race ethnicity Asian vs. white Hispanic vs. white Black vs. white Native American vs. white.82 2.9** 2.23** 2.43**.42** 2.29** 2.64** 3.5**.59 2.** 2.23** 2.5** Socioeconomic status Low vs. middle High vs. middle.9**.46**.9**.4** 3.95**.39* SOURCE: U.S. Department of Education, National Center for Education Statistics, National Education Longitudinal Study of 988 (NELS:88), Base Year and First Follow-Up surveys. 44 22
NELS 88 Table 2.3 Adjusted odds ratios of eighth-grade students in 988 performing below basic levels of reading and mathematics in 988 and dropping out of school, 988 to 99, by basic demographics Variable Below basic mathematics Below basic reading Dropped out Se Female vs. male.77**.7**.86 Race ethnicity Asian vs. white Hispanic vs. white Black vs. white Native American vs. white.84.6**.77** 2.2**.46**.74** 2.9** 2.87**.6.2.45.64 Socioeconomic status Low vs. middle High vs. middle.68**.49**.66**.44** 3.74**.4* 45 Latent Response Variable Formulation Versus Probability Curve Formulation Probability curve formulation in the binary u case: P (u = ) = F ( + ), (67) where F is the standard normal or logistic distribution function. Latent response variable formulation defines a threshold τ on a continuous u * variable so that u = is observed when u * eceeds τ while otherwise u = is observed, where δ ~ N (, V (δ)). u * = γ + δ, u = (68) u = τ u* 46 23