Multilevel Analysis and Complex Surveys Alan Hubbard UC Berkeley - Division of Biostatistics 1
Outline Multilevel data analysis Estimating specific parameters of the datagenerating distribution (GEE) Estimating the whole (latent variable) distribution (Multilevel mixed models and MLE). Complex Survey (Estimation and Inference) Estimating Multilevel mixed models with complex survey data 2
Schedule Beginning Time Ending Time Topic 8:00 9:15 Introduction/Overview, GEE 9:15 9:45 GEE Exer 9:45 11:30 Multilievel Models 11:30 MLM Exercise 1:00 2:00 Complex Survey 2:00 2:30 Survey Exer 2:30 3:30 Combined 3:30 4:00 Combined Exer 4:15 5:00 Causality Issues (Michael Oakes, Ecological Effects, etc). 3
Multilevel Analysis and Complex Surveys Part 1: Parameters and inference from mixed models (MLE) and estimating equation (GEE) approaches Alan Hubbard UC Berkeley - Division of Biostatistics 4
Models For Multilevel Data References Analysis of Longitudinal Data by Diggle, Liang and Zeger. Applied Longitudinal Data, by Fizmaurice, Laird and Ware. Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models by Skrondal, A. and Rabe-Hesketh, S. To gee or not to gee: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health (with commentary and reply). Epidemiology, 21:467 740 (2010). 5
Generalized Estimation Equation (GEE) Approach to Clustered Data Alan Hubbard UC Berkeley - Division of Biostatistics 6
Clustered Data Regressions Ignore Clustering ordinary regressions, assuming that outcomes conditionally independent. Multilevel (Mixed Effects) Models explicit model of sources of random variability at cluster level, E(Y ijk X ijk,α i,α ij ), α i ~N(0,σ 2 α ),. Generalized Estimating Equation (GEE) approach only specify relative simple parameters (e.g., E(Y ijk X ijk )). 7
Issues with Clustered Data (Estimation) Covariates of Interest and identifiability: higher level (ecological) vs. individual level covariates. Targeting contributions of both. Defining effect of interest (e.g., direct effect of ecological covariates apart from individual level covariates). Causal inference challenges with clustered data (can one ever measure impact of composite variables vs. contextual variables?). Much work on mechanical implementation, less on what are the appropriate parameters of interest and necessary (but sometimes dubious) identifiability assumptions (Oakes). 8
Issues with Clustered Data (Correlation) Dealing with correlated data: general repeated measures issues. Model based inference (inference based on proposed data-generating distribution) Empirical inference use form of estimating equation to get simple robust empirical variance sampling distribution: ˆ θ = θ + 1 n n IC(O i ; θ,γ ) + op 1 n,var( θ ˆ ) var(ic) n i=1 9
Example: observations within subjects: The Effect of Drug and Alcohol Use on Teenage Sexual Activity Minnis & Padian (2001) conducted a longitudinal study of teenagers in San Rafael, California to investigate the association between drug and alcohol use and sexual activity on the same day. Participants were asked to keep track of their activities over approximately one month and binary indicator variables were created to show whether drug/alcohol use and/or sexual activity were reported for each 24 hour period. 10
Example of Binary Outcome: Sex, Drugs and Teenagers A longitudinal study of the effects of drug-use on sexual activity. Let X ij, the only explanatory variable of interest for now, indicate whether or not subject i reported drug-use (1=yes, 0=no) on day j. Let Y ij denote whether subject had sex (1=yes, 0=no), i.e., Y ij is a binary outcome and thus its expectation can be modeled via the logit transform. 11
Data eid today drgalcoh sx24hrs 1. 10122 03 Jun 98 yes no 2. 10123 04 Jun 98 no no 3. 10123 05 Jun 98 no no 4. 10123 06 Jun 98 yes no 5. 10123 07 Jun 98 no no 6. 10123 08 Jun 98 no no 7. 10123 09 Jun 98 no no 8. 10123 12 Jun 98 no no 9. 10123 14 Jun 98 yes no 10. 10123 16 Jun 98 no no 11. 10123 17 Jun 98 no no 12. 10123 18 Jun 98 no yes 13. 10123 19 Jun 98 no no 14. 10123 20 Jun 98 no no 15. 10123 21 Jun 98 no no 16. 10123 23 Jun 98 no no 17. 10123 25 Jun 98 no yes 18. 10123 28 Jun 98 no no 19. 10123 29 Jun 98 no yes 20. 10123 01 Jul 98 no yes 21. 10123 02 Jul 98 no no 22. 10123 03 Jul 98 no no 23. 10123 04 Jul 98 no no 24. 10123 05 Jul 98 no no 25. 10124 04 Jun 98 no no 26. 10124 07 Jun 98 no no 27. 10124 08 Jun 98 no no 12
Sexual Activity and drug/alcohol use among teenagers revisted Main Variables sex24hrs - sex in last 24 hrs. (0=no, 1=yes) drgalcoh - drug or alcohol use in last 24 hrs. tues-sun - dummy variables designating day of week 13
Random Effects Models Uses a random effect to model the relative similarity of observations made on same statistical unit (e.g., person) Assumes Y ij and Y ik, j k are independent given some realized value of a random effect (β i0 ) and the covariates. Y ij Y ik X ij,β 0i The model assumes these random effects are randomly drawn from a known distribution. 14
Random Effects Model for Teenage Sex and Drug-Use logit[p(y ij =1 β 0i,X ij = x ij )] = log P(Y ij =1 β 0i, X ij = x ij ) = β RE 0 + β 0i + β RE 1 x ij P(Y ij = 0 β 0i,X ij = x ij ) Assume that the repeated observations for the ith teenager are independent of one another given β i0 and X ij. Must assume parametric distribution for the β i0, usually β i0 ~N(0,τ 2 ). exp(β 1 RE ) is odds ratio for having sex infection when subject i reports drug-use relative to when same subject does not report drug-use. 15
Motivation for This Approach Natural for modeling heterogeneity across individuals in their regression coefficients. This heterogeneity can be represented by a probability distribution Most useful when object is to make inferences about individuals rather than population averages. 16
Motivation for This Approach Also useful to estimate the contributions to variability from different sources (e.g., within and among individuals). Can be extended to hierarchy of units (multilevel modeling), such as repeated longitudinal measures of a person, within a household, within a community... 17
Some available software for random effects models Linear Models Proc Mixed in SAS xtreg in STATA (only simple random effects models) xtmixed in STATA 10 lme in R Logistic and Poisson Models xtlogit and xtpoisson in STATA for simple random effects, xtmelogit and xtmepoisson for general mixed models in STATA version 10 gllamm for general mixed models is STATA add-on 18
Random effects using xtlogit in STATA. xtlogit sx24hrs drgalcoh, or i(eid) re Random-effects logit Number of obs = 1708 Group variable (i) : eid Number of groups = 109 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 15.7 max = 33 Wald chi2(1) = 5.48 Log likelihood = -921.39213 Prob > chi2 = 0.0192 ------------------------------------------------------------------------------ sx24hrs OR Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- exp(β 1 RE ) 1.447266.2284893 2.34 0.019 1.062096 1.972119 -------------+---------------------------------------------------------------- /lnsig2u.5483488.2428238.0724228 1.024275 -------------+---------------------------------------------------------------- τ sigma_u 1.315444.1597106 1.036875 1.668854 rho.3446819.0166718.2463036.4584528 ------------------------------------------------------------------------------ Likelihood ratio test of rho=0: chibar2(01) = 184.17 Prob >= chibar2 = 0.000 19
Estimation of Marginal Models (GEE) Estimate marginal mean model. Marginal model is a population, not individual, model. The marginal E[Y ij X ij = x ij ] is defined as the mean value of an observation Y ij in the theoretical experiment where one randomly draws an observation from a population where everyone has X ij = x ij. 20
Marginal Models (GEE) For instance, if Y ij is the cholesterol and X ij = yes if one smokes, no otherwise. In a marginal model, E[Y ij X ij = yes] will be the mean of a randomly drawn Y ij from the subpopulation where everyone smokes. 21
Parameter Interpretation in a marginal model Parameters in an equivalent random effects and GEE model have subtly different interpretations. Coefficients in a random effects model represent expected differences (odds ratios, relative risks, etc) within an individual, given a change in their X from one value to another Coefficients in a marginal model represent expected differences (odds ratios, relative risks, etc) within an population, given a change in everyone s X from one value to another. 22
Parameter Interpretation in a GEE model, cont. In linear, log-linear models, the random effects and marginal regression parameters are the same. In Logistic regression, they are different more later. 23
24
Marginal Models (GEE) GEE software typically allows several different working correlation models (e.g., exchangeable, auto-regressive, unstructured, etc.). These correlation models are used to build weight matrices, which are used in a weighted regression. When deriving inferences for the coefficients, though, it calculates robust standard errors. 25
Examples of Correlation Models R 01 0 0 0 0 R 02 0 0 V = σ 2 0 0 R 03 0 0 0 0 0 0 0 0 R 0n Each individual is independent of all others Correlation within individuals across longitudinal observations has the same structure 26
Structure for R 0 General structure: 1 ρ 12 ρ 13 ρ 1n ρ 12 1 ρ 23 ρ 2n R 0 = ρ 13 ρ 23 1 ρ 3n 1 ρ 1n ρ 2n ρ 3n 1 A lot of unknown parameters 27
Correlation Models (contd): Uniform correlation (compound symmetry or exchangeable) 1 ρ ρ ρ ρ 1 ρ ρ R 0 = ρ ρ 1 ρ 1 ρ ρ ρ 1 Arises from random effects model e ij Y ij = α + α i + β x ij + e ij Errors uncorrelated, and independent of and x ij α i Var(α ρ = i ) Var(α i ) + Var(e ij ) 28
Correlation Models (contd):time-decaying Correlations (Auto-regressive) 1 ρ ρ 2 ρ n 1 ρ 1 ρ ρ n 2 R 0 = ρ 2 ρ 1 ρ n 3 1 ρ n 1 ρ n 2 ρ n 3 1 Auto-regressive: e ij = ρe ij 1 + η ij Not great for unequally spaced longitudinal data Exponential correlation model generalizes this to rather than corr(y ij, y ik ) = ρ t j t k ρ j k 29
Examples of var-cov. models Description Abbrev. Var-Cov. Matrix σ 2 2 2 +σ 0 σ 0 Compound Symmetry Unstructured Autoregressive Spatial Power CS UN AR(1) Banded Diagnonal UN(1) SP(POW)(c) 2 σ 0 2 σ 0 2 σ 0 2 σ 1 σ 2 +σ 0 2 2 σ 0 2 σ 0 2 σ 0 2 σ 0 σ 2 +σ 0 2 σ 0 2 2 σ 0 2 σ 0 2 σ 0 σ 2 +σ 0 2 σ 12 σ 13 σ 14 σ 12 2 σ 2 σ 23 σ 24 σ 13 σ 23 2 σ 3 σ 34 σ 14 σ 24 σ 34 2 σ 4 σ 2 ρσ 2 ρ 2 σ 2 ρ 3 σ 2 ρσ 2 σ 2 ρσ 2 ρ 2 σ 2 ρ 2 σ 2 ρσ 2 σ 2 ρσ 2 ρ 3 σ 2 ρ 2 σ 2 ρσ 2 σ 2 2 σ 1 0 0 0 0 2 σ 2 0 0 0 0 2 σ 3 0 0 0 0 2 σ 4 σ 2 ρ d12 σ 2 ρ d13 σ 2 ρ d14 σ 2 ρ d12 σ 2 σ 2 ρ d23 σ 2 ρ d24 σ 2 ρ d13 σ 2 ρ d23 σ 2 σ 2 ρ d34 σ 2 ρ d14 σ 2 ρ d24 σ 2 ρ d34 σ 2 σ 2 30
The GEE Algorithm Algorithm is similar to the one used for the non-repeated measures problems (e.g., OLS for continuous data, logistic regression for binary and Poisson regression for counts). Let R(α) be a n i x n i "working" correlation matrix that is fully characterized by a vector of parameters, α. V i is again the variance-covariance of the observations which will be a function of the mean (E(Y i X i )), a scale parameter, φ and R(α). 31
Standard Errors of Coefficients GEE will normally return two estimates of the variance of the coefficient estimates, 1) naive and 2) robust. Naive assumes that the chosen model for R(α), such as compound symmetry, is correct. Robust is a more nonparametric estimate that does not assume your guess for R(α) is correct. However, its variance estimates can be more variable. 32
log GEE Marginal Model for Teenage Sex and Drug-Use µ P( Y = 1 ij ij ij ij M M it[ P( Yij = 1 Xij = xij)] = log = log = β0 + β1 1 µ ij P( Yij = 0 Xij = xij) var(y ij )= µ ij (1-µ ij )*, corr(y ij, Y ik ) = ρ (i.e., assume compound symmetry). exp(β 1M ) is a ratio of population frequencies, i.e., it is a population averaged parameter. It is the odds ratio of the probabilities (proportions) of teenagers who would engage in sexual activity in populations reporting drug use vs. populations not reporting drug-use. X = x ) x ij * Semi-robust inference can you tell why? 33
Sexual Activity and drug/alcohol use among teenagers revisted Main Variables sex24hrs - sex in last 24 hrs. (0=no, 1=yes) drgalcoh - drug or alcohol use in last 24 hrs. tues-sun - dummy variables designating day of week 34
Results using xtgee in STATA robust SE. xtgee sx24hrs drgalcoh, eform i(id) family(binomial) cor(ind) robust GEE population-averaged model Number of obs = 1708 Group variable: id Number of groups = 109 Link: logit Obs per group: min = 1 Family: binomial avg = 15.7 Correlation: independent max = 33 (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- exp(β 1M )drgalcoh 1.739521.3149874 3.06 0.002 1.219823 2.480635 ------------------------------------------------------------------------------ non-robust (naive) SE. xtgee sx24hrs drgalcoh, eform i(eid) family(binomial) cor(ind) ------------------------------------------------------------------------------ sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- drgalcoh 1.739521.20244 4.76 0.000 1.384744 2.185194 ------------------------------------------------------------------------------ 35
xtgee Options family(?), link(?) -- identify that we wish linear regression with continuous outcome (as compared to, say, binary outcomes more later) corr(ind) -- identify that we will assume independence for our correlation structure (some other possibilities include exchangeability and autoregressive structures) i(?)--identify which variable indentifies the individual (or cluster) ro -- identifies that we wish robust estimates of variability 36
Model 2 same marginal model, different working correlation. log µ P( Y = 1 ij ij ij ij M M it[ P( Yij = 1 Xij = xij)] = log = log = β0 + β1 1 µ ij P( Yij = 0 Xij = xij) X = x ) x ij x ij = 0 if drug/alcohol use is no, 1 if yes y ij = 0 if no sex in last 24 hours, 1 if yes cor(yij,yij )=ρ (compound symmetry or exchangeable correlation structure) 37
Results of Model 2 using STATA robust SE. xtgee sx24hrs drgalcoh, eform i(id) family(binomial) cor(exc) robust GEE population-averaged model Number of obs = 1708 Group variable: id Number of groups = 109 Link: logit Obs per group: min = 1 Family: binomial avg = 15.7 Correlation: exchangeable max = 33 (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- drgalcoh 1.393705.1919735 2.41 0.016 1.063956 1.825653 ------------------------------------------------------------------------------ non-robust (naive) SE. xtgee sx24hrs drgalcoh, eform i(eid) family(binomial) cor(exc) ------------------------------------------------------------------------------ sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- drgalcoh 1.393705.1701631 2.72 0.007 1.097095 1.770507 38 ------------------------------------------------------------------------------
Estimated Working Correlation. xtcorr c1 c2 c3 c4 c5 c6 c7 c8 c9 r1 1.0000 r2 0.1614 1.0000 r3 0.1614 0.1614 1.0000 r4 0.1614 0.1614 0.1614 1.0000 r5 0.1614 0.1614 0.1614 0.1614 1.0000 r6 0.1614 0.1614 0.1614 0.1614 0.1614 1.0000 r7 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 1.0000 r8 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 1.0000 r9 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 1.0000 r10 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 r11 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 r12 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 r13 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 r14 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 r15 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 r16 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 r17 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 r18 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 r19 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 0.1614 39
Model 3 adjusting for day of week log it[ P( Yij = 1 xij, dayij)] = β + β xij + γ z1ij + γ 2 z ij + + γ z 0 1 1 2 6 6ij... x ij = 1 if drug/alcohol use is yes, 0 if no z 1ij = 1 if interview day is Tuesday, 0 if not z 2ij = 1 if interview day is Wed., 0 if not... z 6ij = 1 if interview day is Sunday, 0 if not y ij = 1 if sex in last 24 hours, 0 if no cor(yij,yij )=ρ (compound symmetry or exchangeable correlation structure) 40
Results of Model 3 using STATA. xtgee sx24hrs drgalcoh tues wed thur fri sat sun, eform i(id) family(binomial > ) cor(exc) robust GEE population-averaged model Number of obs = 1708 Group variable: id Number of groups = 109 Link: logit Obs per group: min = 1 Family: binomial avg = 15.7 Correlation: exchangeable max = 33 Wald chi2(7) = 11.40 Scale parameter: 1 Prob > chi2 = 0.1220 (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- drgalcoh 1.373029.1845197 2.36 0.018 1.055086 1.786782 tues 1.239246.2320747 1.15 0.252.8585234 1.788804 wed 1.234437.2523307 1.03 0.303.826942 1.842734 thur 1.099757.233122 0.45 0.654.7258761 1.666215 fri.9833647.1933837-0.09 0.932.6688388 1.445799 sat 1.277403.2490991 1.26 0.209.8716457 1.872043 sun 1.577958.306514 2.35 0.019 1.078331 2.30908 ------------------------------------------------------------------------------ 41
Model for drug/alcohol use vs. day of week log it [ P( Xij = 1 dayij)] = γ + γ * z1i j γ * 2 z ij γ * z 0 1 + + + 2 6 6 ij *... X ij = 1 if drug/alcohol use is yes, 0 if no z 1ij = 1 if interview day is Tuesday, 0 if not z 2ij = 1 if interview day is Wed., 0 if not... z 6ij = 1 if interview day is Sunday, 0 if not cor(yij,yij )=ρ (compound symmetry or exchangeable correlation structure) 42
Results of drug/alcohol use Model using STATA. xtgee drgalcoh tues wed thur fri sat sun, eform i(id) family(binomial) cor(ex > c) robust GEE population-averaged model Number of obs = 1708 Group variable: id Number of groups = 109 Link: logit Obs per group: min = 1 Family: binomial avg = 15.7 Correlation: exchangeable max = 33 Wald chi2(6) = 28.91 Scale parameter: 1 Prob > chi2 = 0.0001 (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust drgalcoh Odds Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- tues.7484218.1301296-1.67 0.096.5322875 1.052317 wed.7043399.1440654-1.71 0.087.4717131 1.051687 thur.9226514.171617-0.43 0.665.6407825 1.328509 fri 1.197263.2206008 0.98 0.329.834357 1.718015 sat 1.666645.3147173 2.71 0.007 1.151088 2.413115 sun 1.371219.205994 2.10 0.036 1.021488 1.840688 ------------------------------------------------------------------------------ 43
Covariate and Cluster size issues We examine a simple example to look at how estimation and inference with clustered data are impacted by various changes in the data distribution. Cluster constant (e.g., county level) versus cluster varying (e.g., individual-level) covariates. Balanced versus unbalanced data (number of subunits within clusters). 44
Longitudinal Data on HIV+ patients Deeks, et al. (1999) report the results from a longitudinal study of HIV-infected adults undergoing Highly Active Anti-Retroviral Therapy (HAART) at San Francisco General Hospital (SFGH). Patients were included in this analysis if they received at least 16 weeks of continuous therapy with an anti-retroviral regimen The following data was obtained during the initial review: date of birth, sex and length of previous exposure to each individual anti-retroviral agent. 45
Once patients were identified, their medical records were reviewed every 3-4 months until November 1998. Plasma HIV RNA assays were performed using a branched DNA (bdna) assay. Repeated and irregular measurements of CD4 and viral load (time-structured repeated measures) Data not always matched in time. Goal is to find how CD4 varies with viral load and how this pattern varies in the population 46
Sample of HIV+ Data 47
CD4 versus Time 2000 1500 1000 500 0 0 500 1000 1500 2000 etime 30 Evenly Spaced Subjects Ranked by Slope (CD4 vs. T) 48
HIV+ (CD4 Count) Data some simple analyses using only 2 observations per person Purpose is to illustrate the effects on estimates and inference of both different working correlation matrices and robust vs. naive inference: Consider two scenarios: baseline (time-independent) covariate, time-dependent covariate. 49
Association of Baseline Covariate (Age) on CD4 count. Binary age (X ij ) = 0 (<40) or 1 (>40) Fit simple linear model: E[ Y X = x] = β + β i j ij i 0 1 Compare results of Models A-D Naive Robust Unweighted OLS A B Weighted LS C D x i 50
Association of Baseline Covariate (Age) on CD4 count Model A. xtgee cd4 binage, i(id) cor(ind) ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- binage 24.2404 14.17075 1.71 0.087-3.533768 52.01457 _cons 225.902 9.867247 22.89 0.000 206.5625 245.2414 ------------------------------------------------------------------------------ Model B. xtgee cd4 binage, i(id) cor(ind) robust (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- binage 24.2404 19.26181 1.26 0.208-13.51206 61.99286 _cons 225.902 12.62139 17.90 0.000 201.1645 250.6394 ------------------------------------------------------------------------------ 51
Association of Baseline Covariate (Age) on CD4 count Model C. xtgee cd4 binage, i(id) cor(exc) ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- binage 24.2404 19.16452 1.26 0.206-13.32137 61.80217 _cons 225.902 13.34446 16.93 0.000 199.7473 252.0566 ------------------------------------------------------------------------------ Model D. xtgee cd4 binage, i(id) cor(exc) robust (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- binage 24.2404 19.26181 1.26 0.208-13.51206 61.99286 _cons 225.902 12.62139 17.90 0.000 201.1645 250.6394 ------------------------------------------------------------------------------ 52
Summary of Results of Association of Baseline Covariate (Age) on CD4 count β 0 (SE) Naive Robust Unweighted OLS 225.9 (9.9) 225.9 (12.6) Weighted LS 225.9 (13.3) 225.9 (12.6) β 1 (SE) Naive Robust Unweighted OLS 24.24 (14.2) 24.24 (19.3) Weighted LS 24.24 (19.2) 24.24 (19.3) 53
Association of Time (within cluster) Varying Covariate (Viral Load) on CD4 count. Binary VL: X ij = 0 (<2000) or 1 (>2000) all subjects included have one low and one high VL. Fit simple linear model: E[ Y X = x] = β + β i j ij i 0 1 x ij Compare results of Models A-D Naive Robust Unweighted OLS A B Weighted LS C D 54
Association of within-cluster-varying Covariate (VL) on CD4 count Model A. xtgee cd4 medvl, i(id) cor(ind) ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- medvl -98.3494 30.29324-3.25 0.001-157.7231-38.97574 _cons 377.3735 21.42055 17.62 0.000 335.39 419.357 ------------------------------------------------------------------------------ Model B. xtgee cd4 medvl, i(id) cor(ind) robust (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- medvl -98.3494 16.51035-5.96 0.000-130.7091-65.98971 _cons 377.3735 22.92943 16.46 0.000 332.4326 422.3143 ------------------------------------------------------------------------------ 55
Association of Within-Cluster-Varying Covariate (VL) on CD4 count Model C. xtgee cd4 medvl, i(id) cor(exc) ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- medvl -98.3494 16.41059-5.99 0.000-130.5136-66.18523 _cons 377.3735 21.42055 17.62 0.000 335.39 419.357 ------------------------------------------------------------------------------ Model D. xtgee cd4 medvl, i(id) cor(exc) robust (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- medvl -98.3494 16.51035-5.96 0.000-130.7091-65.98971 _cons 377.3735 22.92943 16.46 0.000 332.4326 422.3143 ------------------------------------------------------------------------------ 56
Robust Equivalent to Paired T-test Paired T-test. keep id cd4 medvl etime. sort cd4 medvl. reshape wide cd4 etime, i(id) j(medvl). ttest cd40= cd41 Paired t test ------------------------------------------------------------------------------ Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- cd40 83 377.3735 22.92943 208.897 331.7596 422.9874 cd41 83 279.0241 20.07767 182.9163 239.0832 318.965 ---------+-------------------------------------------------------------------- diff 83 98.3494 16.51035 150.4164 65.50505 131.1937 ------------------------------------------------------------------------------ Ho: mean(cd40 - cd41) = mean(diff) = 0 Ha: mean(diff) < 0 Ha: mean(diff)!= 0 Ha: mean(diff) > 0 t = 5.9568 t = 5.9568 t = 5.9568 P < t = 1.0000 P > t = 0.0000 P > t = 0.0000 57
Summary of Results of Association of Time Varying Covariate (VL) on CD4 count β 0 (SE) Naive Robust Unweighted OLS 377.4(21.4) 377.4(22.9) Weighted LS 377.4(21.4) 377.4(22.9) β 1 (SE) Naive Robust Unweighted OLS -98.3(30.3) -98.3(16.5) Weighted LS -98.3(16.4) -98.3(16.5) t-test (difference) -98.3(16.5) 58
Multiple and varying observations per person CD4 (Y) vs. continuous (log) Viral Load (X) E[Y i j X i1 = x i1, X ij = x i j ] = β 0 + β 1 x i1 + β 2 (x ij x i1 ) β 2 represents the expected change in Y given a change in X ij relative to the baseline value (X i1 ) - longitudinal effect. β 1 represents the expected difference in average Y across two sub-populations that differ by their baseline values, X i1 - crosssectional effect. 59
Association of Within-Cluster-Varying Covariate (VL) on CD4 count multiple observations per person Model A. xtgee cd4 logvlbase logvlchange, i(id) cor(ind) GEE population-averaged model Number of obs = 7053 Group variable: id Number of groups = 406 Link: identity Obs per group: min = 1 Family: Gaussian avg = 17.4 Correlation: independent max = 58 ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- logvlbase -83.74371 2.960401-28.29 0.000-89.54599-77.94143 logvlchange -99.194 2.453052-40.44 0.000-104.0019-94.3861 _cons 618.9555 11.61598 53.28 0.000 596.1886 641.7224 ------------------------------------------------------------------------------ 60
Association of Time-Varying Covariate (VL) on CD4 count multiple observations per person Model B. xtgee cd4 logvlbase logvlchange, i(id) cor(ind) robust GEE population-averaged model Number of obs = 7053 Group variable: id Number of groups = 406 Link: identity Obs per group: min = 1 Family: Gaussian avg = 17.4 Correlation: independent max = 58 Wald chi2(2) = 225.39 (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- logvlbase -83.74371 8.296962-10.09 0.000-100.0055-67.48196 logvlchange -99.194 6.831102-14.52 0.000-112.5827-85.80528 _cons 618.9555 35.19853 17.58 0.000 549.9677 687.9434 ------------------------------------------------------------------------------ 61
Association of Time-Varying Covariate (VL) on CD4 count multiple observations per person Model C. xtgee cd4 logvlbase logvlchange, i(id) cor(exc) ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- logvlbase -52.75548 7.402832-7.13 0.000-67.26477-38.2462 logvlchange -54.7488 2.172512-25.20 0.000-59.00684-50.49075 _cons 509.1174 31.23263 16.30 0.000 447.9026 570.3322 ------------------------------------------------------------------------------ Model D (standard errors adjusted for clustering on id) ------------------------------------------------------------------------------ Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- logvlbase -52.75548 7.72342-6.83 0.000-67.89311-37.61786 logvlchange -54.7488 3.158417-17.33 0.000-60.93918-48.55841 _cons 509.1174 32.95307 15.45 0.000 444.5305 573.7042 ------------------------------------------------------------------------------ 62
Summary of Results of Association of Time- Varying Covariate (VL) on CD4 count multiple observations per person β 0 (SE) Naive Robust Unweighted OLS 618.9(11.6) 618.9(35.2) Weighted LS 509.1(31.2) 509.1(32.9) β 1 (SE) Naive Robust Unweighted OLS -83.7(3.0) -83.7(8.3) Weighted LS -52.7(7.4) -52.7(7.7) β 2 (SE) Naive Robust Unweighted OLS -99.2(2.4) -99.2(6.8) Weighted LS -54.7(2.2) -54.7(3.2) 63
MultiLevel Models Alan Hubbard UC Berkeley - Division of Biostatistics 64
Many Names for variants of Same Statistical Model Hierarchical Linear Models (HLM s) Random Coefficient Models Mixed Models (most general) Multilevel models (MLM s) Nested modeling 65
Typical Data Structures for MLM s Distinctive feature is the hierarchical nature of statistical units, e.g. Neighborhoods people Measurements made over time on the people Classrooms students Measurements made over time on the students Different sources of variation: Between classrooms Between students within classrooms Within students 66
Motivation for using MLM s (mixed models) Procedure estimates the fixed effects of interest Dissects the sources of variation Accounts for residual correlation among statistically dependent units when deriving inference. Permits one to specify a rich set of correlation models and allows for heteroskedascity. 67
Motivation for using MLM s (mixed models) It allows different subjects to have different responses to a treatment, risk variable, etc., thus has intuitive appeal. Rarely interesting, but can also provide postestimation estimates of the random effects. You get the entire data-generating distribution. Use the virtues of having a likelihood. Can simulate data from the resulting parameter estimates In contrast with other approaches that only target a specific aspect of the data-generating distribution. 68
What s being Mixed? A mixed model has two types of effects, fixed and random. A fixed effect means that all levels of the variable are contained in the data and the effect is universal to all in the target population. A random effect means that the levels (effects) of the variable comprise random samples of the levels (effects) in the target population. Consider a risk factor effect. Fixed, Random, Both? 69
The Simplest Example. The Model: Y = µ + α + ij i e ij E(α i )=0, E(e ij )=0, E[α i e ij ]=0. Var(α i )= σ 2 α. Var(e ij )= σ 2 e. More specifically, α i ~N(0, σ 2 α), e ij ~N(0, σ 2 e). 70
Likelihood Given α i: f (Y ij α i ) = φ Y ij α i µ 2, f ( n Y i α σ i ) = φ Y ij α i µ i 2 e σ e Likelihood of observed data (for one unit) is: j =1 f ( Y i ) = f ( Y n i α) f (α)dα = φ Y ij α µ 2 σ e φ i α σ 2 α α α j =1 dα 71
Estimation of fixed effects using mixed models Random effects models imply certain variancecovariance structures. For instance, a simple random effects model results in equal correlation (exchangeable or compound symmetry) among all observations measured on the same subject. We know that if the variance-covariance matrix (V) is known, then the most efficient estimate of the coefficients is weighted-least squares: ˆ T 1 T β = ( X WX ) X WY 72 where W = V -1.
Estimation of coefficients using mixed models, cont. The Mixed Model procedure works by: Converting the random effects model into its implied variance-covariance matrix, V, starting with the independent model (OLS) it gets residuals and then estimates V based on this model, creates weight matrix as W = Vˆ 1, does weighted least squares and gets residuals, repeats until convergence. The SE s the procedure return come from: vâr( ˆ T 1 T 1 1 β ) = ( X WX ) = ( X V X ) 73 Vˆ ˆ
Model Based Inference When deriving the inference on coefficients, the estimating procedure assumes that the variance-covariance model of the outcome implied by the model IS CORRECT (i.e., it s SE(βˆ) always naïve, not robust ). 74
Virtues of MultiLevel Models Diez-Roux 75
Provides Road Map for Accounting for Systematic and random variation at various levels (individuals, counties, states,.) Diez-Roux 76
Stage 2 Diez-Roux 77
Put it together, just a mixed model Diez-Roux 78
Random Intercepts and Random Associations The Model: ij ( β ) 0 + β i + β 1 + β ) x eij Y + = 0 ( 1i ij E(β 0i )=0, E(β 1i )=0, E(e ij )=0. Var(β 0i )= σ 2 0, Var(β1 i )= σ2 1, Var(e ij )= σ2 cov(β 0i, β 1i )= σ 12, cov(β 0i, e ij )=0, cov(β 1i, e ij )=0. What are the fixed and random effects in this model? 79
Simple Example (Individual is Cluster) Orthodontic study (Potthoff and Roy; 1964) 16 boys and 11 girls between the ages of 8 and 14 years Response variable is the distance (in millimeters) between the pituitary and the pterygomaxillary fissure. 80
Dental Data obsno child age distance gender 1 1 8 21 0 2 1 10 20 0 3 1 12 21.5 0 4 1 14 23 0 5 2 8 21 0 6 2 10 21.5 0 7 2 12 24 0 8 2 14 25.5 0 9 3 8 20.5 0 10 3 10 24 0 11 3 12 24.5 0 12 3 14 26 0 13 4 8 23.5 0 14 4 10 24.5 0 15 4 12 25 0 16 4 14 26.5 0 17 5 8 21.5 0 18 5 10 23 0 19 5 12 22.5 0 20 5 14 23.5 0 21 6 8 20 0 81
Dental Data distance 20 25 30 8 9 10 11 12 13 14 age(yrs) 82
Mixed Model I for Dental Data Model ij ( β + β ) i + β x eij Y + = 0 0 1 where x ij, is the jth age of ith child, Y ij is the distance. ij β 0i i.i.d N(0, σ 2 0 ), e ij i.i.d. N(0, σ2 ) 83
Why is this model called MultiLevel? Can write the model in two steps (as two levels): Y x + ij * = β 0 i + β1 ij e ij then model for individual coefficients: ( β ) β + * 0 i = 0 β 0i Implies that each individual has their own random intercept and are drawn from a population with mean intercept, β 0, but all subjects have same slope, β 1. * β 0i Can also have functions of say baseline covariates, z, or β 0i * (z) = β 0 + β 0i + α 1 z 1 + α 2 z 2 +... 84
Mixed (MultiLevel) Model II for Dental Data The Model (called a random coefficients model) E(β 0i )=0, E(β 1i )=0, E(e ij )=0. ij ( β ) 0 + β i + β 1 + β ) x eij Y + = 0 ( 1i ij Var(β 0i )= σ 2 0, Var(β 1i )= σ2 1, Var(e ij )= σ2 cov(β 0i, β 1i )= σ 12, cov(β 0i, e ij )=cov(β 1i, e ij )=0. 85
STATA for Model II xtmixed. xtmixed distance age child: age, cov(uns) Mixed-effects REML regression Number of obs = 108 Group variable: child Number of groups = 27 Obs per group: min = 4 avg = 4.0 max = 4 Wald chi2(1) = 85.85 Log restricted-likelihood = -221.31834 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ distance Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- age.6601852.0712533 9.27 0.000.5205314.799839 _cons 16.76111.775246 21.62 0.000 15.24166 18.28057 ------------------------------------------------------------------------------ 86
STATA for Model II xtmixed Variance Components Estimates ------------------------------------------------------------------------------ Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ child: Unstructured σ 1 sd(age).2264277.0915312.1025274.5000568 σ 0 sd(_cons) 2.327034 1.065366.9486467 5.708223 σ 12 corr(age,_cons) -.6093328.3255995 -.9382101.2978612 -----------------------------+------------------------------------------------ sd(residual) 1.31004.1260584 1.08487 1.581944 ------------------------------------------------------------------------------ LR test vs. linear regression: chi2(3) = 66.53 Prob > chi2 = 0.0000 Note: LR test is conservative and provided only for reference 87
STATA for Model II xtmixed random coefficient estimates. predict b*, reffects. list b* +-----------------------+ β 1i, b1 β 0i,b2 ----------------------- 1. -.1782096 -.4859597 5..009858-1.011849 9..0506423 -.7727903 13. -.0298621 1.069163 88
Summary of Results of Association of Baseline Covariate (Age) on CD4 count β 0 (SE) Naive Robust Unweighted OLS 225.9 (9.9) 225.9 (12.6) Weighted LS 225.9 (13.3) 225.9 (12.6) β 1 (SE) Naive Robust Unweighted OLS 24.24 (14.2) 24.24 (19.3) Weighted LS 24.24 (19.2) 24.24 (19.3) 89
Re-visit Model of CD4 vs. Baseline Age Fit simple random effects model: Y i j = β 0 + β 0i + β 1 X ij + e ij. xtreg cd4 binage, i(id) re Random-effects GLS regression Number of obs = 594 Group variable (i): id Number of groups = 297 R-sq: within =. Obs per group: min = 2 between = 0.0054 avg = 2.0 overall = 0.0049 max = 2 Random effects u_i ~ Gaussian Wald chi2(1) = 1.59 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.2075 ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- est of β 1 binage 24.2404 19.22938 1.26 0.207-13.44848 61.92928 est of β 0 _cons 225.902 13.38962 16.87 0.000 199.6588 252.1451 -------------+---------------------------------------------------------------- sigma_u 157.74218 estimate of σ 2 0 sigma_e 71.379705 estimate of σ 2 rho.83003801 (fraction of variance due to u_i) ------------------------------------------------------------------------------ 90
Association of Time-Varying Covariate (Viral Load) on CD4 count. Binary VL: X ij = 0 (<2000) or 1 (>2000) all subjects included have one low and one high VL. Fit simple linear model: E[ Y X = x] = β + β i j ij i 0 1 Compare results of Models A-D x ij Naive Robust Unweighted OLS A B Weighted LS C D 91
Summary of Results of Association of Time Varying Covariate (VL) on CD4 count (note, different data that last lecture) β 0 (SE) Naive Robust Unweighted OLS 355.1(21.7) 355.1(23.6)) Weighted LS 355.1(21.7) 377.4(22.9) β 1 (SE) Naive Robust Unweighted OLS -79.3(30.7) -79.3(17.1) Weighted LS -79.3(17.0) -79.3(17.1) t-test (difference) -79.3(17.1) 92
Random Effects Model of CD4 vs. log 10 (viral load) Fit simple random effects model: Y i j = β 0 + β 0i + β 1 X ij + e ij. xtreg cd4 medvl, i(id) re Random-effects GLS regression Number of obs = 174 Group variable (i): id Number of groups = 87 R-sq: within = 0.0000 Obs per group: min = 2 between = 0.0000 avg = 2.0 overall = 0.0370 max = 2 Random effects u_i ~ Gaussian Wald chi2(1) = 21.44 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- medvl -79.34483 17.13523-4.63 0.000-112.9293-45.76039 _cons 355.0805 21.80987 16.28 0.000 312.3339 397.827 -------------+---------------------------------------------------------------- sigma_u 169.148 estimate of σ 2 0 sigma_e 113.0146 estimate of σ 2 rho.69136617 (fraction of variance due to u_i) ------------------------------------------------------------------------------ 93
Multiple and varying observations per person CD4 (Y) vs. continuous (log) Viral Load (X) E[Y i j X i1 = x i1,x ij = x i j] = β 0 + β 1 x i1 + β 2 (x ij x i1 ) β 2 represents the expected change in Y given a change in X ij relative to the baseline value (X i1 ) - longitudinal effect. β 1 represents the expected difference in average Y across two sub-populations that differ by their baseline values, X i1 - cross-sectional effect. 94
Summary of Results of Association of Time- Varying Covariate (VL) on CD4 count multiple observations per person β 0 (SE) Naive Robust Unweighted OLS 618.9(11.6) 618.9(35.2) Weighted LS 509.1(31.2) 509.1(32.9) β 1 (SE) Naive Robust Unweighted OLS -83.7(3.0) -83.7(8.3) Weighted LS -52.7(7.4) -52.7(7.7) β 2 (SE) Naive Robust Unweighted OLS -99.2(2.4) -99.2(6.8) Weighted LS -54.7(2.2) -54.7(3.2) 95
Random Effects Model of CD4 vs. log 10 (viral load) Fit simple random effects model: Y i j = β 0 + β 0i + β 1 X i1 + β 2 (X ij X i1 ) + e ij Random-effects GLS regression Number of obs = 7053 Group variable (i): id Number of groups = 406 R-sq: within = 0.1024 Obs per group: min = 1 between = 0.2423 avg = 17.4 overall = 0.1841 max = 58 Random effects u_i ~ Gaussian Wald chi2(2) = 834.72 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- logvlbase -52.16759 7.733616-6.75 0.000-67.3252-37.00998 logvlchange -53.67891 1.859443-28.87 0.000-57.32335-50.03447 _cons 506.9675 32.73488 15.49 0.000 442.8084 571.1267 -------------+---------------------------------------------------------------- sigma_u 176.47579 sigma_e 108.38743 rho.72610367 (fraction of variance due to u_i) ------------------------------------------------------------------------------ 96
Random Coefficients Model of CD4 vs. log 10 (viral load) Fit random coef. model: Y i j = (β 0 + β 0i ) + (β 1 + β 1i )X i1 + β 2 (X ij X i1 ) + e ij Fixed Effects (Coefficient) Estimates. xtmixed cd4 logvlbase logvlchange id: logvlbase Mixed-effects REML regression Number of obs = 7053 Group variable: id Number of groups = 406 Obs per group: min = 1 avg = 17.4 max = 58 Wald chi2(2) = 828.31 Log restricted-likelihood = -43817.853 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------ cd4 Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- logvlbase -51.98204 8.176032-6.36 0.000-68.00677-35.95731 logvlchange -53.37247 1.8559-28.76 0.000-57.00997-49.73498 _cons 506.2758 33.67954 15.03 0.000 440.2651 572.2865 ------------------------------------------------------------------------------ 97
Random Coefficients Model of CD4 vs. log 10 (viral load) Fit random coef. model: Y i j = (β 0 + β 0i ) + (β 1 + β 1i )X i1 + β 2 (X ij X i1 ) + e ij Variance Components Estimates ------------------------------------------------------------------------------ Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] -----------------------------+------------------------------------------------ id: Independent var(β 1i ) sd(logvlb~e) 19.6903 6.997558 9.811865 39.5142 var(β 0i ) sd(_cons) 167.2211 15.38626 139.6274 200.2679 -----------------------------+------------------------------------------------ var(e ij ) sd(residual) 108.3978.9403761 106.5702 110.2566 ------------------------------------------------------------------------------ LR test vs. linear regression: chi2(2) = 7444.62 Prob > chi2 = 0.0000 98
Random Effects Model for Teenage Sex and Drug-Use logit[p(y ij =1 β 0i, X ij = x ij )] = log P(Y ij =1 β 0i,X ij ) = β * P(Y ij = 0 β 0i,X ij ) 0i + β 1 X ij,β * 0i = β 0+ β 0i,β 0i ~ N(0,τ 2 ) Assume that the repeated observations for the ith teenager are independent of one another given β i0 and X ij. Must assume parametric distribution for the β i0, usually β i0 ~N(0,τ 2 ). exp(β 1 ) is odds ratio for having sex infection when subject i reports drug-use relative to when same subject does not report drug-use. 99
Random effects for teenage sex vs drug use. xtlogit sx24hrs drgalcoh, or i(eid) re Random-effects logit Number of obs = 1708 Group variable (i) : eid Number of groups = 109 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 15.7 max = 33 Wald chi2(1) = 5.48 Log likelihood = -921.39213 Prob > chi2 = 0.0192 ------------------------------------------------------------------------------ sx24hrs OR Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- Β 1 drgalcoh 1.447266.2284893 2.34 0.019 1.062096 1.972119 -------------+---------------------------------------------------------------- /lnsig2u.5483488.2428238.0724228 1.024275 -------------+---------------------------------------------------------------- τ sigma_u 1.315444.1597106 1.036875 1.668854 rho.3446819.0166718.2463036.4584528 ------------------------------------------------------------------------------ Likelihood ratio test of rho=0: chibar2(01) = 184.17 Prob >= chibar2 = 0.000 100
Random Effects Model for Diarrhea Study in Children P( Yijk log 1 P( Y Measurements made at children (k) within households (j) within villages (k). Want to know the greatest sources of variation: households var(β 0ij ) or villages var(β 0i ) = 1) = 1) ijk = β + β + β 0 0i 0ij Assumes children in same household have same probability of diarrhea. Use gllamm in STATA (also xtmelogit) 101
Random Effects Model for Diarrhea Study in Children gllamm diarrhea, i(hhid vilid) nip(5) family(binomial) number of level 1 units = 4736 number of level 2 units = 18 number of level 3 units = 4 Condition Number = 7.6690811 gllamm model log likelihood = -1612.2175 ------------------------------------------------------------------------------ diarrhea Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons -2.375147.1062966-22.34 0.000-2.583484-2.166809 ------------------------------------------------------------------------------ 102
Random Effects Model for Diarrhea Study in Children Variances and covariances of random effects ---------------------------------------------------------------------- ***level 2 (hhid) var(β 0ij ) var(1):.4257083 (.11126704) ***level 3 (vilid) var(β 0i ) var(1):.39405088 (.33582967) Cluster correlation coefficient (based on latent response model): ρ ρ house village = var( β 0ij = var( β 0ij var( β 0ij ) + var( β ) 0i var( β0i ) ) + var( β 0i ) + π 2 ) + π 2 = 3 0.43 = 0.10 0.43 + 0.39 + 3.29 0.39 = = 0.09 3 0.43+ 0.39 + 3.29 103
Bangladesh Fertility Study 104
Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 105
Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 106
Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 107
Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 108
Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 109
Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 110
Tower of London Task Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 111
Tower of London Task Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 112
Tower of London Task Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 113
References Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP - 2007 North American Stata Users Group Meeting, Boston - www.stat.columbia.edu/~gelman/stuff_for_blog/gutierrez.pdf. 114
Using conditional logistic regression for estimating within unit OR in logistic regression models 115
Treat Individual as a stratification variable for Teen Sex and Drugs For the teen sex and drugs example, we can represent the data on each individual, i, as a simple 2x2 table: Sex yes no D r u g s yes a i b i no c i d i n i Can get the OR for every subject: ˆ O R i = a id i b i c i Because our model it P( Yij = 1 β0i, Xij = xij) log [ P( Yij = 1 β 0 i, Xij = xij)] = log = β * + β0i + β * x 0 ij P( Yij = 0 β i Xij xij 1 0, = ) assumes every person has the same OR, we can average each estimated OR to get the estimate. 116
Mantel-Haenszel Average of Stratified OR s Then the MH estimate is: O ˆ R MH = exp( β ˆ 1 * ) = m i=1 m w i ˆ O R i w i i=1 i=1 Note, that for any subject who has identical exposure (drug use) or outcomes (sex) for all observations, the OR is undefined and that person does not contribute to the estimate (their 2x2 table are dropped). = m i=1 m (a i d i ) /n i (b i c i ) /n i 117
Conditional Logistic Regression To illustrate, use the teenage sex and drugs example, assume just two observation for a person, and that one had the outcome (Y i1 =1) with drugs (X i1 =1) one observation had neither (Y i2 =0, X i2 =0). Then, the conditional likelihood contribution for this observation is: CondLik i = P(Y i1 =1 X i1 =1)P(Y i2 = 0 X i2 = 0) P(Y i1 =1 X i1 =1)P(Y i2 = 0 X i2 = 0) + P(Y i1 =1 X i1 = 0)P(Y i2 = 0 X i2 =1) After plugging in the model for Y ij: log P( Y = 1 β, X x ) it ij 0i ij = ij [ P( Yij = 1 β 0 i, Xij = xij)] = log = β * + β 0 0i + β * P( Yij = 0 β i Xij xij 1 0, = ) and doing some algebra, one gets: CondLik i = 1 1+ exp(β 1 * (X i2 X i1 )) x ij Notice, the individual level intercept (whether random or not) drops out. 118
Conditional Logistic Regression What it means is that the estimate of the within subject OR no longer depends on assumptions on the distribution of the random effect. Can only use this to estimate the association of timevarying covariates. Subjects with identical outcomes will be dropped from analysis. For those covariates that do not change in a subject, they will not contribute to estimation of the OR for that covariate. 119
Conditional Logistic Regression More generally, you might want to estimate the within subject OR for several variables simultaneously and/or the OR for a unit change in a continuous variable. Can still do so by using the conditional likelihood - a method used to estimated OR s for matched case-control studies. The conditional likelihood (in example of a cohort) is the probability of observing that the cases have covariates they have and the controls have their observed covariates, given the distribution of covariates observed over all the repeated measurements. To define the likelihood, one normalizes the probability of observing the outcomes conditional on the covariates by the summed probabilities over all possible combinations of covariates and outcomes. 120
Teenage Sex and Drug-Use Using M-H summary OR.. cs sx24hrs drgalcoh, by(eid) or eid OR [95% Conf. Interval] M-H Weight -----------------+------------------------------------------------- 1... 0 (Cornfield) 2 0 0 10.56942.3478261 (Cornfield) 3. 0. 0 (Cornfield) 4... 0 (Cornfield) 5... 0 (Cornfield) 6 1.333333.2058078 8.53481.8823529 (Cornfield) 7... 0 (Cornfield) 8. 0. 0 (Cornfield) 9 1.5.1778039 12.91562.5714286 (Cornfield) 10. 0. 0 (Cornfield) 105... 0 (Cornfield) 106 0 0..6363636 (Cornfield) 107.8.1388054 4.9008 1.2 (Cornfield) 108 0 0..125 (Cornfield) 109... 0 (Cornfield) 110. 0. 0 (Cornfield) -----------------+------------------------------------------------- M-H combined 1.315498.9584698 1.805519 ------------------------------------------------------------------- 121
Conditional Logistic Estimate. clogit sx24hrs drgalcoh, or group(eid) note: multiple positive outcomes within groups encountered. note: 23 groups (161 obs) dropped due to all positive or all negative outcomes. Iteration 0: log likelihood = -664.37829 Iteration 1: log likelihood = -663.20668 Iteration 2: log likelihood = -663.20668 Conditional (fixed-effects) logistic regression Number of obs = 1547 LR chi2(1) = 2.93 Prob > chi2 = 0.0867 Log likelihood = -663.20668 Pseudo R2 = 0.0022 ------------------------------------------------------------------------------ sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- drgalcoh 1.323141.2158621 1.72 0.086.9610325 1.821689 ------------------------------------------------------------------------------ 122
Pitfalls of Latent Variable Models in General (including MultiLevel Mixed Models) 123
General Contrast of Mixed and GEE Models General Mixed Effects Model Specific Mixed Model (logistic) 124
Mixed Models General Likelihood of Observed Data based on Mixed Model. Specific from example 125
Latent Variable Models Nonparametric Nonidentifiability Point 1 for Mixed Models 126
Parameter returned by GEE: Population Average Models Parameter specific estimating function Estimating Function 127
Special Case: MultiLevel model coefficients have interpretations in both latent variable and observed data-generating worlds 128
Usually, different interpretations Logistic Case 129
Often difference in interpretations (Pop. Ave. vs. Mixed) (near) meaningless 130
It gets worse: What if Model is mis-specified? Parameter of Interest as Projection 131
Interpretation of coefficients in mis-specified multilevel mixed models can get wacky True model 132
Summary of MultiLevel Mixed vs. GEE (Population Ave.) Models 133
New Golden Rules of Estimation with Latent Variable Models? 134
Should avoid relying solely on unverifiable assumptions for inferences 135
Multilevel Analysis and Complex Surveys Part 2: Estimation and Inference from Complex Surveys Alan Hubbard UC Berkeley - Division of Biostatistics 136
Foundation Finite population U={1,..,N}. Sample s, subset of U. V=(Y,X) observations of outcome, covariates, on each units. Values in finite fixed population are v U =v 1,..,v N and the process by which one draws these will be called the observation process. Parameters can be defined with regard to a finite population, or superpopulation that is, there is some data generating model of interest, P V, and we want to estimate parameters of it, θ(p V ). 137
Sampling Mechanism δ=(i t, t=1,..,n) with δ t =1 if t s, 0 otherwise g(δ V=v,Z) sampling mechanism, where Z are so-called design variables, so z U =z 1,..,z N. Thus, the observed data O is generated by a combination of the mechanisms, P V Z and g: O=(δ*v U *z U ) define the joint distribution of observed data, P 0 (O), O=(V, Z δ=1) Special cases: g(δ V=v,Z) = g(δ Z) (noninformative sampling). Parameters of interest from distribution of V Z=z (disaggregated analysis) Parameters of interest from distribution of P V (aggregated) P V (v;θ ) = P(V = v Z = z) p(z = z) 138 z
Types of Inference Design-based inference Model-based inference 139
Full Likelihood Consider one more source of missingness (say R=1 respondent), e.g., nonrespondent. Full likelihood is then: f (r γ,z,v) f (δ z,v) f (v z) f (z) Missingness caused by mechanisms for both δ and r. Different assumptions imply different conditional independences. 140
Pseudo-Likelihood Estimating Equation Approaches Most practical applications of survey data do not contain enough information to define the entire likelihood of the joint missingness/ sampling mechanisms and the distribution of interest. In addition, the parameter of interest can often be identified without having the entire joint likelihood identifiable, but just some of the design elements. Thus, most survey analyses rely on pseudolikelihood estimation. 141
Simple Example Assume an exponential family: If the entire population was a simple random draw from this distribution (and you observed everyone s value) then the score equation based on the likelihood would be: with obvious solution f (y;θ) = θe θy, E(Y) =1/θ s(θ;v ) = N t =1 ( ) Y 1 θ 1 ˆ θ = Y ave(y) 142
Pseudo-Likelihood, Continued However, let s now assume all we have is the usual subset of the population U defined by s, and the probability that a observation was sampled, given it s observed values (for now, no Z): π t =P(δ t =1 Y t ). An unbiased estimating function for the population average (and thus the parameter of f(y;θ)) is: s(θ;o,π) = 1 ( Y 1 ) θ = t s π t Thus, just treat as a general missingness problem and use inverse weighting. N t =1 δ t π t ( Y 1 ) θ 143
It works (in this case)! Consistent estimating equation, which of course results in estimator, when solved: E Y E δ π Y 1 θ ( ) Y Note π = E(δ Y), so get E Y ( Y 1 ) θ = 0 ˆ µ s 1 ˆ θ s = E ( Y Y 1 θ)e δ π Y = t s t s π t 1 Y t π t 1 So, a re-weighted score equation provides consistent, pseudo-likelihood estimate. 144
Inference From Estimating Equation Designed based Approach parametric estimation uncertainty from repeated samplings (of the type done) from a fixed target population (Y U fixed). Model based from repeated draws from the underlying data generating distribution (Y U random). Both variance comes both from underlying data-generating mechanism and sampling mechanism if finite population large, model-based portion contributes almost nothing var( θ ˆ s ) = v ar(e( θ ˆ s U)) + E(var( θ ˆ 145 s U)) Model Source Design Source
Designed-based Inference, cont. All from sampling mechanism. Need simple empirical estimate derived from estimating equation. Often called sandwich estimator. Can be generally derived as the variancecovariance of the influence curve of the estimator. ˆ θ θ + 1 n t s IC(O t ;γ,θ), so var( θ ˆ ) var(ic(o ;γ,θ)) t n 146
Designed-based Inference In this general framework the things to account for in inference: stratified design (not a simple random sample) finite sample population correction (sometime samples not from an infinite population) clustered (correlated) data 147
Multilevel Analysis and Complex Surveys Part 3: Putting MultiLevel Models and Complex Survey Data Together Alan Hubbard UC Berkeley - Division of Biostatistics Slides are from Sophia Rabe-Hesketh, UC Berkeley, School of Education and Division of Biostatistics 148