Multilevel Analysis and Complex Surveys. Alan Hubbard UC Berkeley - Division of Biostatistics

Size: px
Start display at page:

Download "Multilevel Analysis and Complex Surveys. Alan Hubbard UC Berkeley - Division of Biostatistics"

Transcription

1 Multilevel Analysis and Complex Surveys Alan Hubbard UC Berkeley - Division of Biostatistics 1

2 Outline Multilevel data analysis Estimating specific parameters of the datagenerating distribution (GEE) Estimating the whole (latent variable) distribution (Multilevel mixed models and MLE). Complex Survey (Estimation and Inference) Estimating Multilevel mixed models with complex survey data 2

3 Schedule Beginning Time Ending Time Topic 8:00 9:15 Introduction/Overview, GEE 9:15 9:45 GEE Exer 9:45 11:30 Multilievel Models 11:30 MLM Exercise 1:00 2:00 Complex Survey 2:00 2:30 Survey Exer 2:30 3:30 Combined 3:30 4:00 Combined Exer 4:15 5:00 Causality Issues (Michael Oakes, Ecological Effects, etc). 3

4 Multilevel Analysis and Complex Surveys Part 1: Parameters and inference from mixed models (MLE) and estimating equation (GEE) approaches Alan Hubbard UC Berkeley - Division of Biostatistics 4

5 Models For Multilevel Data References Analysis of Longitudinal Data by Diggle, Liang and Zeger. Applied Longitudinal Data, by Fizmaurice, Laird and Ware. Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models by Skrondal, A. and Rabe-Hesketh, S. To gee or not to gee: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health (with commentary and reply). Epidemiology, 21: (2010). 5

6 Generalized Estimation Equation (GEE) Approach to Clustered Data Alan Hubbard UC Berkeley - Division of Biostatistics 6

7 Clustered Data Regressions Ignore Clustering ordinary regressions, assuming that outcomes conditionally independent. Multilevel (Mixed Effects) Models explicit model of sources of random variability at cluster level, E(Y ijk X ijk,α i,α ij ), α i ~N(0,σ 2 α ),. Generalized Estimating Equation (GEE) approach only specify relative simple parameters (e.g., E(Y ijk X ijk )). 7

8 Issues with Clustered Data (Estimation) Covariates of Interest and identifiability: higher level (ecological) vs. individual level covariates. Targeting contributions of both. Defining effect of interest (e.g., direct effect of ecological covariates apart from individual level covariates). Causal inference challenges with clustered data (can one ever measure impact of composite variables vs. contextual variables?). Much work on mechanical implementation, less on what are the appropriate parameters of interest and necessary (but sometimes dubious) identifiability assumptions (Oakes). 8

9 Issues with Clustered Data (Correlation) Dealing with correlated data: general repeated measures issues. Model based inference (inference based on proposed data-generating distribution) Empirical inference use form of estimating equation to get simple robust empirical variance sampling distribution: ˆ θ = θ + 1 n n IC(O i ; θ,γ ) + op 1 n,var( θ ˆ ) var(ic) n i=1 9

10 Example: observations within subjects: The Effect of Drug and Alcohol Use on Teenage Sexual Activity Minnis & Padian (2001) conducted a longitudinal study of teenagers in San Rafael, California to investigate the association between drug and alcohol use and sexual activity on the same day. Participants were asked to keep track of their activities over approximately one month and binary indicator variables were created to show whether drug/alcohol use and/or sexual activity were reported for each 24 hour period. 10

11 Example of Binary Outcome: Sex, Drugs and Teenagers A longitudinal study of the effects of drug-use on sexual activity. Let X ij, the only explanatory variable of interest for now, indicate whether or not subject i reported drug-use (1=yes, 0=no) on day j. Let Y ij denote whether subject had sex (1=yes, 0=no), i.e., Y ij is a binary outcome and thus its expectation can be modeled via the logit transform. 11

12 Data eid today drgalcoh sx24hrs Jun 98 yes no Jun 98 no no Jun 98 no no Jun 98 yes no Jun 98 no no Jun 98 no no Jun 98 no no Jun 98 no no Jun 98 yes no Jun 98 no no Jun 98 no no Jun 98 no yes Jun 98 no no Jun 98 no no Jun 98 no no Jun 98 no no Jun 98 no yes Jun 98 no no Jun 98 no yes Jul 98 no yes Jul 98 no no Jul 98 no no Jul 98 no no Jul 98 no no Jun 98 no no Jun 98 no no Jun 98 no no 12

13 Sexual Activity and drug/alcohol use among teenagers revisted Main Variables sex24hrs - sex in last 24 hrs. (0=no, 1=yes) drgalcoh - drug or alcohol use in last 24 hrs. tues-sun - dummy variables designating day of week 13

14 Random Effects Models Uses a random effect to model the relative similarity of observations made on same statistical unit (e.g., person) Assumes Y ij and Y ik, j k are independent given some realized value of a random effect (β i0 ) and the covariates. Y ij Y ik X ij,β 0i The model assumes these random effects are randomly drawn from a known distribution. 14

15 Random Effects Model for Teenage Sex and Drug-Use logit[p(y ij =1 β 0i,X ij = x ij )] = log P(Y ij =1 β 0i, X ij = x ij ) = β RE 0 + β 0i + β RE 1 x ij P(Y ij = 0 β 0i,X ij = x ij ) Assume that the repeated observations for the ith teenager are independent of one another given β i0 and X ij. Must assume parametric distribution for the β i0, usually β i0 ~N(0,τ 2 ). exp(β 1 RE ) is odds ratio for having sex infection when subject i reports drug-use relative to when same subject does not report drug-use. 15

16 Motivation for This Approach Natural for modeling heterogeneity across individuals in their regression coefficients. This heterogeneity can be represented by a probability distribution Most useful when object is to make inferences about individuals rather than population averages. 16

17 Motivation for This Approach Also useful to estimate the contributions to variability from different sources (e.g., within and among individuals). Can be extended to hierarchy of units (multilevel modeling), such as repeated longitudinal measures of a person, within a household, within a community... 17

18 Some available software for random effects models Linear Models Proc Mixed in SAS xtreg in STATA (only simple random effects models) xtmixed in STATA 10 lme in R Logistic and Poisson Models xtlogit and xtpoisson in STATA for simple random effects, xtmelogit and xtmepoisson for general mixed models in STATA version 10 gllamm for general mixed models is STATA add-on 18

19 Random effects using xtlogit in STATA. xtlogit sx24hrs drgalcoh, or i(eid) re Random-effects logit Number of obs = 1708 Group variable (i) : eid Number of groups = 109 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 15.7 max = 33 Wald chi2(1) = 5.48 Log likelihood = Prob > chi2 = sx24hrs OR Std. Err. z P> z [95% Conf. Interval] exp(β 1 RE ) /lnsig2u τ sigma_u rho Likelihood ratio test of rho=0: chibar2(01) = Prob >= chibar2 =

20 Estimation of Marginal Models (GEE) Estimate marginal mean model. Marginal model is a population, not individual, model. The marginal E[Y ij X ij = x ij ] is defined as the mean value of an observation Y ij in the theoretical experiment where one randomly draws an observation from a population where everyone has X ij = x ij. 20

21 Marginal Models (GEE) For instance, if Y ij is the cholesterol and X ij = yes if one smokes, no otherwise. In a marginal model, E[Y ij X ij = yes] will be the mean of a randomly drawn Y ij from the subpopulation where everyone smokes. 21

22 Parameter Interpretation in a marginal model Parameters in an equivalent random effects and GEE model have subtly different interpretations. Coefficients in a random effects model represent expected differences (odds ratios, relative risks, etc) within an individual, given a change in their X from one value to another Coefficients in a marginal model represent expected differences (odds ratios, relative risks, etc) within an population, given a change in everyone s X from one value to another. 22

23 Parameter Interpretation in a GEE model, cont. In linear, log-linear models, the random effects and marginal regression parameters are the same. In Logistic regression, they are different more later. 23

24 24

25 Marginal Models (GEE) GEE software typically allows several different working correlation models (e.g., exchangeable, auto-regressive, unstructured, etc.). These correlation models are used to build weight matrices, which are used in a weighted regression. When deriving inferences for the coefficients, though, it calculates robust standard errors. 25

26 Examples of Correlation Models R R V = σ R R 0n Each individual is independent of all others Correlation within individuals across longitudinal observations has the same structure 26

27 Structure for R 0 General structure: 1 ρ 12 ρ 13 ρ 1n ρ 12 1 ρ 23 ρ 2n R 0 = ρ 13 ρ 23 1 ρ 3n 1 ρ 1n ρ 2n ρ 3n 1 A lot of unknown parameters 27

28 Correlation Models (contd): Uniform correlation (compound symmetry or exchangeable) 1 ρ ρ ρ ρ 1 ρ ρ R 0 = ρ ρ 1 ρ 1 ρ ρ ρ 1 Arises from random effects model e ij Y ij = α + α i + β x ij + e ij Errors uncorrelated, and independent of and x ij α i Var(α ρ = i ) Var(α i ) + Var(e ij ) 28

29 Correlation Models (contd):time-decaying Correlations (Auto-regressive) 1 ρ ρ 2 ρ n 1 ρ 1 ρ ρ n 2 R 0 = ρ 2 ρ 1 ρ n 3 1 ρ n 1 ρ n 2 ρ n 3 1 Auto-regressive: e ij = ρe ij 1 + η ij Not great for unequally spaced longitudinal data Exponential correlation model generalizes this to rather than corr(y ij, y ik ) = ρ t j t k ρ j k 29

30 Examples of var-cov. models Description Abbrev. Var-Cov. Matrix σ σ 0 σ 0 Compound Symmetry Unstructured Autoregressive Spatial Power CS UN AR(1) Banded Diagnonal UN(1) SP(POW)(c) 2 σ 0 2 σ 0 2 σ 0 2 σ 1 σ 2 +σ σ 0 2 σ 0 2 σ 0 2 σ 0 σ 2 +σ 0 2 σ σ 0 2 σ 0 2 σ 0 σ 2 +σ 0 2 σ 12 σ 13 σ 14 σ 12 2 σ 2 σ 23 σ 24 σ 13 σ 23 2 σ 3 σ 34 σ 14 σ 24 σ 34 2 σ 4 σ 2 ρσ 2 ρ 2 σ 2 ρ 3 σ 2 ρσ 2 σ 2 ρσ 2 ρ 2 σ 2 ρ 2 σ 2 ρσ 2 σ 2 ρσ 2 ρ 3 σ 2 ρ 2 σ 2 ρσ 2 σ 2 2 σ σ σ σ 4 σ 2 ρ d12 σ 2 ρ d13 σ 2 ρ d14 σ 2 ρ d12 σ 2 σ 2 ρ d23 σ 2 ρ d24 σ 2 ρ d13 σ 2 ρ d23 σ 2 σ 2 ρ d34 σ 2 ρ d14 σ 2 ρ d24 σ 2 ρ d34 σ 2 σ 2 30

31 The GEE Algorithm Algorithm is similar to the one used for the non-repeated measures problems (e.g., OLS for continuous data, logistic regression for binary and Poisson regression for counts). Let R(α) be a n i x n i "working" correlation matrix that is fully characterized by a vector of parameters, α. V i is again the variance-covariance of the observations which will be a function of the mean (E(Y i X i )), a scale parameter, φ and R(α). 31

32 Standard Errors of Coefficients GEE will normally return two estimates of the variance of the coefficient estimates, 1) naive and 2) robust. Naive assumes that the chosen model for R(α), such as compound symmetry, is correct. Robust is a more nonparametric estimate that does not assume your guess for R(α) is correct. However, its variance estimates can be more variable. 32

33 log GEE Marginal Model for Teenage Sex and Drug-Use µ P( Y = 1 ij ij ij ij M M it[ P( Yij = 1 Xij = xij)] = log = log = β0 + β1 1 µ ij P( Yij = 0 Xij = xij) var(y ij )= µ ij (1-µ ij )*, corr(y ij, Y ik ) = ρ (i.e., assume compound symmetry). exp(β 1M ) is a ratio of population frequencies, i.e., it is a population averaged parameter. It is the odds ratio of the probabilities (proportions) of teenagers who would engage in sexual activity in populations reporting drug use vs. populations not reporting drug-use. X = x ) x ij * Semi-robust inference can you tell why? 33

34 Sexual Activity and drug/alcohol use among teenagers revisted Main Variables sex24hrs - sex in last 24 hrs. (0=no, 1=yes) drgalcoh - drug or alcohol use in last 24 hrs. tues-sun - dummy variables designating day of week 34

35 Results using xtgee in STATA robust SE. xtgee sx24hrs drgalcoh, eform i(id) family(binomial) cor(ind) robust GEE population-averaged model Number of obs = 1708 Group variable: id Number of groups = 109 Link: logit Obs per group: min = 1 Family: binomial avg = 15.7 Correlation: independent max = 33 (standard errors adjusted for clustering on id) Semi-robust sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] exp(β 1M )drgalcoh non-robust (naive) SE. xtgee sx24hrs drgalcoh, eform i(eid) family(binomial) cor(ind) sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] drgalcoh

36 xtgee Options family(?), link(?) -- identify that we wish linear regression with continuous outcome (as compared to, say, binary outcomes more later) corr(ind) -- identify that we will assume independence for our correlation structure (some other possibilities include exchangeability and autoregressive structures) i(?)--identify which variable indentifies the individual (or cluster) ro -- identifies that we wish robust estimates of variability 36

37 Model 2 same marginal model, different working correlation. log µ P( Y = 1 ij ij ij ij M M it[ P( Yij = 1 Xij = xij)] = log = log = β0 + β1 1 µ ij P( Yij = 0 Xij = xij) X = x ) x ij x ij = 0 if drug/alcohol use is no, 1 if yes y ij = 0 if no sex in last 24 hours, 1 if yes cor(yij,yij )=ρ (compound symmetry or exchangeable correlation structure) 37

38 Results of Model 2 using STATA robust SE. xtgee sx24hrs drgalcoh, eform i(id) family(binomial) cor(exc) robust GEE population-averaged model Number of obs = 1708 Group variable: id Number of groups = 109 Link: logit Obs per group: min = 1 Family: binomial avg = 15.7 Correlation: exchangeable max = 33 (standard errors adjusted for clustering on id) Semi-robust sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] drgalcoh non-robust (naive) SE. xtgee sx24hrs drgalcoh, eform i(eid) family(binomial) cor(exc) sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] drgalcoh

39 Estimated Working Correlation. xtcorr c1 c2 c3 c4 c5 c6 c7 c8 c9 r r r r r r r r r r r r r r r r r r r

40 Model 3 adjusting for day of week log it[ P( Yij = 1 xij, dayij)] = β + β xij + γ z1ij + γ 2 z ij + + γ z ij... x ij = 1 if drug/alcohol use is yes, 0 if no z 1ij = 1 if interview day is Tuesday, 0 if not z 2ij = 1 if interview day is Wed., 0 if not... z 6ij = 1 if interview day is Sunday, 0 if not y ij = 1 if sex in last 24 hours, 0 if no cor(yij,yij )=ρ (compound symmetry or exchangeable correlation structure) 40

41 Results of Model 3 using STATA. xtgee sx24hrs drgalcoh tues wed thur fri sat sun, eform i(id) family(binomial > ) cor(exc) robust GEE population-averaged model Number of obs = 1708 Group variable: id Number of groups = 109 Link: logit Obs per group: min = 1 Family: binomial avg = 15.7 Correlation: exchangeable max = 33 Wald chi2(7) = Scale parameter: 1 Prob > chi2 = (standard errors adjusted for clustering on id) Semi-robust sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] drgalcoh tues wed thur fri sat sun

42 Model for drug/alcohol use vs. day of week log it [ P( Xij = 1 dayij)] = γ + γ * z1i j γ * 2 z ij γ * z ij *... X ij = 1 if drug/alcohol use is yes, 0 if no z 1ij = 1 if interview day is Tuesday, 0 if not z 2ij = 1 if interview day is Wed., 0 if not... z 6ij = 1 if interview day is Sunday, 0 if not cor(yij,yij )=ρ (compound symmetry or exchangeable correlation structure) 42

43 Results of drug/alcohol use Model using STATA. xtgee drgalcoh tues wed thur fri sat sun, eform i(id) family(binomial) cor(ex > c) robust GEE population-averaged model Number of obs = 1708 Group variable: id Number of groups = 109 Link: logit Obs per group: min = 1 Family: binomial avg = 15.7 Correlation: exchangeable max = 33 Wald chi2(6) = Scale parameter: 1 Prob > chi2 = (standard errors adjusted for clustering on id) Semi-robust drgalcoh Odds Ratio Std. Err. z P> z [95% Conf. Interval] tues wed thur fri sat sun

44 Covariate and Cluster size issues We examine a simple example to look at how estimation and inference with clustered data are impacted by various changes in the data distribution. Cluster constant (e.g., county level) versus cluster varying (e.g., individual-level) covariates. Balanced versus unbalanced data (number of subunits within clusters). 44

45 Longitudinal Data on HIV+ patients Deeks, et al. (1999) report the results from a longitudinal study of HIV-infected adults undergoing Highly Active Anti-Retroviral Therapy (HAART) at San Francisco General Hospital (SFGH). Patients were included in this analysis if they received at least 16 weeks of continuous therapy with an anti-retroviral regimen The following data was obtained during the initial review: date of birth, sex and length of previous exposure to each individual anti-retroviral agent. 45

46 Once patients were identified, their medical records were reviewed every 3-4 months until November Plasma HIV RNA assays were performed using a branched DNA (bdna) assay. Repeated and irregular measurements of CD4 and viral load (time-structured repeated measures) Data not always matched in time. Goal is to find how CD4 varies with viral load and how this pattern varies in the population 46

47 Sample of HIV+ Data 47

48 CD4 versus Time etime 30 Evenly Spaced Subjects Ranked by Slope (CD4 vs. T) 48

49 HIV+ (CD4 Count) Data some simple analyses using only 2 observations per person Purpose is to illustrate the effects on estimates and inference of both different working correlation matrices and robust vs. naive inference: Consider two scenarios: baseline (time-independent) covariate, time-dependent covariate. 49

50 Association of Baseline Covariate (Age) on CD4 count. Binary age (X ij ) = 0 (<40) or 1 (>40) Fit simple linear model: E[ Y X = x] = β + β i j ij i 0 1 Compare results of Models A-D Naive Robust Unweighted OLS A B Weighted LS C D x i 50

51 Association of Baseline Covariate (Age) on CD4 count Model A. xtgee cd4 binage, i(id) cor(ind) cd4 Coef. Std. Err. z P> z [95% Conf. Interval] binage _cons Model B. xtgee cd4 binage, i(id) cor(ind) robust (standard errors adjusted for clustering on id) Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] binage _cons

52 Association of Baseline Covariate (Age) on CD4 count Model C. xtgee cd4 binage, i(id) cor(exc) cd4 Coef. Std. Err. z P> z [95% Conf. Interval] binage _cons Model D. xtgee cd4 binage, i(id) cor(exc) robust (standard errors adjusted for clustering on id) Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] binage _cons

53 Summary of Results of Association of Baseline Covariate (Age) on CD4 count β 0 (SE) Naive Robust Unweighted OLS (9.9) (12.6) Weighted LS (13.3) (12.6) β 1 (SE) Naive Robust Unweighted OLS (14.2) (19.3) Weighted LS (19.2) (19.3) 53

54 Association of Time (within cluster) Varying Covariate (Viral Load) on CD4 count. Binary VL: X ij = 0 (<2000) or 1 (>2000) all subjects included have one low and one high VL. Fit simple linear model: E[ Y X = x] = β + β i j ij i 0 1 x ij Compare results of Models A-D Naive Robust Unweighted OLS A B Weighted LS C D 54

55 Association of within-cluster-varying Covariate (VL) on CD4 count Model A. xtgee cd4 medvl, i(id) cor(ind) cd4 Coef. Std. Err. z P> z [95% Conf. Interval] medvl _cons Model B. xtgee cd4 medvl, i(id) cor(ind) robust (standard errors adjusted for clustering on id) Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] medvl _cons

56 Association of Within-Cluster-Varying Covariate (VL) on CD4 count Model C. xtgee cd4 medvl, i(id) cor(exc) cd4 Coef. Std. Err. z P> z [95% Conf. Interval] medvl _cons Model D. xtgee cd4 medvl, i(id) cor(exc) robust (standard errors adjusted for clustering on id) Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] medvl _cons

57 Robust Equivalent to Paired T-test Paired T-test. keep id cd4 medvl etime. sort cd4 medvl. reshape wide cd4 etime, i(id) j(medvl). ttest cd40= cd41 Paired t test Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] cd cd diff Ho: mean(cd40 - cd41) = mean(diff) = 0 Ha: mean(diff) < 0 Ha: mean(diff)!= 0 Ha: mean(diff) > 0 t = t = t = P < t = P > t = P > t =

58 Summary of Results of Association of Time Varying Covariate (VL) on CD4 count β 0 (SE) Naive Robust Unweighted OLS 377.4(21.4) 377.4(22.9) Weighted LS 377.4(21.4) 377.4(22.9) β 1 (SE) Naive Robust Unweighted OLS -98.3(30.3) -98.3(16.5) Weighted LS -98.3(16.4) -98.3(16.5) t-test (difference) -98.3(16.5) 58

59 Multiple and varying observations per person CD4 (Y) vs. continuous (log) Viral Load (X) E[Y i j X i1 = x i1, X ij = x i j ] = β 0 + β 1 x i1 + β 2 (x ij x i1 ) β 2 represents the expected change in Y given a change in X ij relative to the baseline value (X i1 ) - longitudinal effect. β 1 represents the expected difference in average Y across two sub-populations that differ by their baseline values, X i1 - crosssectional effect. 59

60 Association of Within-Cluster-Varying Covariate (VL) on CD4 count multiple observations per person Model A. xtgee cd4 logvlbase logvlchange, i(id) cor(ind) GEE population-averaged model Number of obs = 7053 Group variable: id Number of groups = 406 Link: identity Obs per group: min = 1 Family: Gaussian avg = 17.4 Correlation: independent max = cd4 Coef. Std. Err. z P> z [95% Conf. Interval] logvlbase logvlchange _cons

61 Association of Time-Varying Covariate (VL) on CD4 count multiple observations per person Model B. xtgee cd4 logvlbase logvlchange, i(id) cor(ind) robust GEE population-averaged model Number of obs = 7053 Group variable: id Number of groups = 406 Link: identity Obs per group: min = 1 Family: Gaussian avg = 17.4 Correlation: independent max = 58 Wald chi2(2) = (standard errors adjusted for clustering on id) Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] logvlbase logvlchange _cons

62 Association of Time-Varying Covariate (VL) on CD4 count multiple observations per person Model C. xtgee cd4 logvlbase logvlchange, i(id) cor(exc) cd4 Coef. Std. Err. z P> z [95% Conf. Interval] logvlbase logvlchange _cons Model D (standard errors adjusted for clustering on id) Semi-robust cd4 Coef. Std. Err. z P> z [95% Conf. Interval] logvlbase logvlchange _cons

63 Summary of Results of Association of Time- Varying Covariate (VL) on CD4 count multiple observations per person β 0 (SE) Naive Robust Unweighted OLS 618.9(11.6) 618.9(35.2) Weighted LS 509.1(31.2) 509.1(32.9) β 1 (SE) Naive Robust Unweighted OLS -83.7(3.0) -83.7(8.3) Weighted LS -52.7(7.4) -52.7(7.7) β 2 (SE) Naive Robust Unweighted OLS -99.2(2.4) -99.2(6.8) Weighted LS -54.7(2.2) -54.7(3.2) 63

64 MultiLevel Models Alan Hubbard UC Berkeley - Division of Biostatistics 64

65 Many Names for variants of Same Statistical Model Hierarchical Linear Models (HLM s) Random Coefficient Models Mixed Models (most general) Multilevel models (MLM s) Nested modeling 65

66 Typical Data Structures for MLM s Distinctive feature is the hierarchical nature of statistical units, e.g. Neighborhoods people Measurements made over time on the people Classrooms students Measurements made over time on the students Different sources of variation: Between classrooms Between students within classrooms Within students 66

67 Motivation for using MLM s (mixed models) Procedure estimates the fixed effects of interest Dissects the sources of variation Accounts for residual correlation among statistically dependent units when deriving inference. Permits one to specify a rich set of correlation models and allows for heteroskedascity. 67

68 Motivation for using MLM s (mixed models) It allows different subjects to have different responses to a treatment, risk variable, etc., thus has intuitive appeal. Rarely interesting, but can also provide postestimation estimates of the random effects. You get the entire data-generating distribution. Use the virtues of having a likelihood. Can simulate data from the resulting parameter estimates In contrast with other approaches that only target a specific aspect of the data-generating distribution. 68

69 What s being Mixed? A mixed model has two types of effects, fixed and random. A fixed effect means that all levels of the variable are contained in the data and the effect is universal to all in the target population. A random effect means that the levels (effects) of the variable comprise random samples of the levels (effects) in the target population. Consider a risk factor effect. Fixed, Random, Both? 69

70 The Simplest Example. The Model: Y = µ + α + ij i e ij E(α i )=0, E(e ij )=0, E[α i e ij ]=0. Var(α i )= σ 2 α. Var(e ij )= σ 2 e. More specifically, α i ~N(0, σ 2 α), e ij ~N(0, σ 2 e). 70

71 Likelihood Given α i: f (Y ij α i ) = φ Y ij α i µ 2, f ( n Y i α σ i ) = φ Y ij α i µ i 2 e σ e Likelihood of observed data (for one unit) is: j =1 f ( Y i ) = f ( Y n i α) f (α)dα = φ Y ij α µ 2 σ e φ i α σ 2 α α α j =1 dα 71

72 Estimation of fixed effects using mixed models Random effects models imply certain variancecovariance structures. For instance, a simple random effects model results in equal correlation (exchangeable or compound symmetry) among all observations measured on the same subject. We know that if the variance-covariance matrix (V) is known, then the most efficient estimate of the coefficients is weighted-least squares: ˆ T 1 T β = ( X WX ) X WY 72 where W = V -1.

73 Estimation of coefficients using mixed models, cont. The Mixed Model procedure works by: Converting the random effects model into its implied variance-covariance matrix, V, starting with the independent model (OLS) it gets residuals and then estimates V based on this model, creates weight matrix as W = Vˆ 1, does weighted least squares and gets residuals, repeats until convergence. The SE s the procedure return come from: vâr( ˆ T 1 T 1 1 β ) = ( X WX ) = ( X V X ) 73 Vˆ ˆ

74 Model Based Inference When deriving the inference on coefficients, the estimating procedure assumes that the variance-covariance model of the outcome implied by the model IS CORRECT (i.e., it s SE(βˆ) always naïve, not robust ). 74

75 Virtues of MultiLevel Models Diez-Roux 75

76 Provides Road Map for Accounting for Systematic and random variation at various levels (individuals, counties, states,.) Diez-Roux 76

77 Stage 2 Diez-Roux 77

78 Put it together, just a mixed model Diez-Roux 78

79 Random Intercepts and Random Associations The Model: ij ( β ) 0 + β i + β 1 + β ) x eij Y + = 0 ( 1i ij E(β 0i )=0, E(β 1i )=0, E(e ij )=0. Var(β 0i )= σ 2 0, Var(β1 i )= σ2 1, Var(e ij )= σ2 cov(β 0i, β 1i )= σ 12, cov(β 0i, e ij )=0, cov(β 1i, e ij )=0. What are the fixed and random effects in this model? 79

80 Simple Example (Individual is Cluster) Orthodontic study (Potthoff and Roy; 1964) 16 boys and 11 girls between the ages of 8 and 14 years Response variable is the distance (in millimeters) between the pituitary and the pterygomaxillary fissure. 80

81 Dental Data obsno child age distance gender

82 Dental Data distance age(yrs) 82

83 Mixed Model I for Dental Data Model ij ( β + β ) i + β x eij Y + = where x ij, is the jth age of ith child, Y ij is the distance. ij β 0i i.i.d N(0, σ 2 0 ), e ij i.i.d. N(0, σ2 ) 83

84 Why is this model called MultiLevel? Can write the model in two steps (as two levels): Y x + ij * = β 0 i + β1 ij e ij then model for individual coefficients: ( β ) β + * 0 i = 0 β 0i Implies that each individual has their own random intercept and are drawn from a population with mean intercept, β 0, but all subjects have same slope, β 1. * β 0i Can also have functions of say baseline covariates, z, or β 0i * (z) = β 0 + β 0i + α 1 z 1 + α 2 z

85 Mixed (MultiLevel) Model II for Dental Data The Model (called a random coefficients model) E(β 0i )=0, E(β 1i )=0, E(e ij )=0. ij ( β ) 0 + β i + β 1 + β ) x eij Y + = 0 ( 1i ij Var(β 0i )= σ 2 0, Var(β 1i )= σ2 1, Var(e ij )= σ2 cov(β 0i, β 1i )= σ 12, cov(β 0i, e ij )=cov(β 1i, e ij )=0. 85

86 STATA for Model II xtmixed. xtmixed distance age child: age, cov(uns) Mixed-effects REML regression Number of obs = 108 Group variable: child Number of groups = 27 Obs per group: min = 4 avg = 4.0 max = 4 Wald chi2(1) = Log restricted-likelihood = Prob > chi2 = distance Coef. Std. Err. z P> z [95% Conf. Interval] age _cons

87 STATA for Model II xtmixed Variance Components Estimates Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] child: Unstructured σ 1 sd(age) σ 0 sd(_cons) σ 12 corr(age,_cons) sd(residual) LR test vs. linear regression: chi2(3) = Prob > chi2 = Note: LR test is conservative and provided only for reference 87

88 STATA for Model II xtmixed random coefficient estimates. predict b*, reffects. list b* β 1i, b1 β 0i,b

89 Summary of Results of Association of Baseline Covariate (Age) on CD4 count β 0 (SE) Naive Robust Unweighted OLS (9.9) (12.6) Weighted LS (13.3) (12.6) β 1 (SE) Naive Robust Unweighted OLS (14.2) (19.3) Weighted LS (19.2) (19.3) 89

90 Re-visit Model of CD4 vs. Baseline Age Fit simple random effects model: Y i j = β 0 + β 0i + β 1 X ij + e ij. xtreg cd4 binage, i(id) re Random-effects GLS regression Number of obs = 594 Group variable (i): id Number of groups = 297 R-sq: within =. Obs per group: min = 2 between = avg = 2.0 overall = max = 2 Random effects u_i ~ Gaussian Wald chi2(1) = 1.59 corr(u_i, X) = 0 (assumed) Prob > chi2 = cd4 Coef. Std. Err. z P> z [95% Conf. Interval] est of β 1 binage est of β 0 _cons sigma_u estimate of σ 2 0 sigma_e estimate of σ 2 rho (fraction of variance due to u_i)

91 Association of Time-Varying Covariate (Viral Load) on CD4 count. Binary VL: X ij = 0 (<2000) or 1 (>2000) all subjects included have one low and one high VL. Fit simple linear model: E[ Y X = x] = β + β i j ij i 0 1 Compare results of Models A-D x ij Naive Robust Unweighted OLS A B Weighted LS C D 91

92 Summary of Results of Association of Time Varying Covariate (VL) on CD4 count (note, different data that last lecture) β 0 (SE) Naive Robust Unweighted OLS 355.1(21.7) 355.1(23.6)) Weighted LS 355.1(21.7) 377.4(22.9) β 1 (SE) Naive Robust Unweighted OLS -79.3(30.7) -79.3(17.1) Weighted LS -79.3(17.0) -79.3(17.1) t-test (difference) -79.3(17.1) 92

93 Random Effects Model of CD4 vs. log 10 (viral load) Fit simple random effects model: Y i j = β 0 + β 0i + β 1 X ij + e ij. xtreg cd4 medvl, i(id) re Random-effects GLS regression Number of obs = 174 Group variable (i): id Number of groups = 87 R-sq: within = Obs per group: min = 2 between = avg = 2.0 overall = max = 2 Random effects u_i ~ Gaussian Wald chi2(1) = corr(u_i, X) = 0 (assumed) Prob > chi2 = cd4 Coef. Std. Err. z P> z [95% Conf. Interval] medvl _cons sigma_u estimate of σ 2 0 sigma_e estimate of σ 2 rho (fraction of variance due to u_i)

94 Multiple and varying observations per person CD4 (Y) vs. continuous (log) Viral Load (X) E[Y i j X i1 = x i1,x ij = x i j] = β 0 + β 1 x i1 + β 2 (x ij x i1 ) β 2 represents the expected change in Y given a change in X ij relative to the baseline value (X i1 ) - longitudinal effect. β 1 represents the expected difference in average Y across two sub-populations that differ by their baseline values, X i1 - cross-sectional effect. 94

95 Summary of Results of Association of Time- Varying Covariate (VL) on CD4 count multiple observations per person β 0 (SE) Naive Robust Unweighted OLS 618.9(11.6) 618.9(35.2) Weighted LS 509.1(31.2) 509.1(32.9) β 1 (SE) Naive Robust Unweighted OLS -83.7(3.0) -83.7(8.3) Weighted LS -52.7(7.4) -52.7(7.7) β 2 (SE) Naive Robust Unweighted OLS -99.2(2.4) -99.2(6.8) Weighted LS -54.7(2.2) -54.7(3.2) 95

96 Random Effects Model of CD4 vs. log 10 (viral load) Fit simple random effects model: Y i j = β 0 + β 0i + β 1 X i1 + β 2 (X ij X i1 ) + e ij Random-effects GLS regression Number of obs = 7053 Group variable (i): id Number of groups = 406 R-sq: within = Obs per group: min = 1 between = avg = 17.4 overall = max = 58 Random effects u_i ~ Gaussian Wald chi2(2) = corr(u_i, X) = 0 (assumed) Prob > chi2 = cd4 Coef. Std. Err. z P> z [95% Conf. Interval] logvlbase logvlchange _cons sigma_u sigma_e rho (fraction of variance due to u_i)

97 Random Coefficients Model of CD4 vs. log 10 (viral load) Fit random coef. model: Y i j = (β 0 + β 0i ) + (β 1 + β 1i )X i1 + β 2 (X ij X i1 ) + e ij Fixed Effects (Coefficient) Estimates. xtmixed cd4 logvlbase logvlchange id: logvlbase Mixed-effects REML regression Number of obs = 7053 Group variable: id Number of groups = 406 Obs per group: min = 1 avg = 17.4 max = 58 Wald chi2(2) = Log restricted-likelihood = Prob > chi2 = cd4 Coef. Std. Err. z P> z [95% Conf. Interval] logvlbase logvlchange _cons

98 Random Coefficients Model of CD4 vs. log 10 (viral load) Fit random coef. model: Y i j = (β 0 + β 0i ) + (β 1 + β 1i )X i1 + β 2 (X ij X i1 ) + e ij Variance Components Estimates Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] id: Independent var(β 1i ) sd(logvlb~e) var(β 0i ) sd(_cons) var(e ij ) sd(residual) LR test vs. linear regression: chi2(2) = Prob > chi2 =

99 Random Effects Model for Teenage Sex and Drug-Use logit[p(y ij =1 β 0i, X ij = x ij )] = log P(Y ij =1 β 0i,X ij ) = β * P(Y ij = 0 β 0i,X ij ) 0i + β 1 X ij,β * 0i = β 0+ β 0i,β 0i ~ N(0,τ 2 ) Assume that the repeated observations for the ith teenager are independent of one another given β i0 and X ij. Must assume parametric distribution for the β i0, usually β i0 ~N(0,τ 2 ). exp(β 1 ) is odds ratio for having sex infection when subject i reports drug-use relative to when same subject does not report drug-use. 99

100 Random effects for teenage sex vs drug use. xtlogit sx24hrs drgalcoh, or i(eid) re Random-effects logit Number of obs = 1708 Group variable (i) : eid Number of groups = 109 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 15.7 max = 33 Wald chi2(1) = 5.48 Log likelihood = Prob > chi2 = sx24hrs OR Std. Err. z P> z [95% Conf. Interval] Β 1 drgalcoh /lnsig2u τ sigma_u rho Likelihood ratio test of rho=0: chibar2(01) = Prob >= chibar2 =

101 Random Effects Model for Diarrhea Study in Children P( Yijk log 1 P( Y Measurements made at children (k) within households (j) within villages (k). Want to know the greatest sources of variation: households var(β 0ij ) or villages var(β 0i ) = 1) = 1) ijk = β + β + β 0 0i 0ij Assumes children in same household have same probability of diarrhea. Use gllamm in STATA (also xtmelogit) 101

102 Random Effects Model for Diarrhea Study in Children gllamm diarrhea, i(hhid vilid) nip(5) family(binomial) number of level 1 units = 4736 number of level 2 units = 18 number of level 3 units = 4 Condition Number = gllamm model log likelihood = diarrhea Coef. Std. Err. z P> z [95% Conf. Interval] _cons

103 Random Effects Model for Diarrhea Study in Children Variances and covariances of random effects ***level 2 (hhid) var(β 0ij ) var(1): ( ) ***level 3 (vilid) var(β 0i ) var(1): ( ) Cluster correlation coefficient (based on latent response model): ρ ρ house village = var( β 0ij = var( β 0ij var( β 0ij ) + var( β ) 0i var( β0i ) ) + var( β 0i ) + π 2 ) + π 2 = = = =

104 Bangladesh Fertility Study 104

105 Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

106 Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

107 Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

108 Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

109 Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

110 Bangladesh Fertility Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

111 Tower of London Task Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

112 Tower of London Task Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

113 Tower of London Task Study Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

114 References Slides from: Roberto G. Gutierrez, Director of Statistics, StataCorp LP North American Stata Users Group Meeting, Boston

115 Using conditional logistic regression for estimating within unit OR in logistic regression models 115

116 Treat Individual as a stratification variable for Teen Sex and Drugs For the teen sex and drugs example, we can represent the data on each individual, i, as a simple 2x2 table: Sex yes no D r u g s yes a i b i no c i d i n i Can get the OR for every subject: ˆ O R i = a id i b i c i Because our model it P( Yij = 1 β0i, Xij = xij) log [ P( Yij = 1 β 0 i, Xij = xij)] = log = β * + β0i + β * x 0 ij P( Yij = 0 β i Xij xij 1 0, = ) assumes every person has the same OR, we can average each estimated OR to get the estimate. 116

117 Mantel-Haenszel Average of Stratified OR s Then the MH estimate is: O ˆ R MH = exp( β ˆ 1 * ) = m i=1 m w i ˆ O R i w i i=1 i=1 Note, that for any subject who has identical exposure (drug use) or outcomes (sex) for all observations, the OR is undefined and that person does not contribute to the estimate (their 2x2 table are dropped). = m i=1 m (a i d i ) /n i (b i c i ) /n i 117

118 Conditional Logistic Regression To illustrate, use the teenage sex and drugs example, assume just two observation for a person, and that one had the outcome (Y i1 =1) with drugs (X i1 =1) one observation had neither (Y i2 =0, X i2 =0). Then, the conditional likelihood contribution for this observation is: CondLik i = P(Y i1 =1 X i1 =1)P(Y i2 = 0 X i2 = 0) P(Y i1 =1 X i1 =1)P(Y i2 = 0 X i2 = 0) + P(Y i1 =1 X i1 = 0)P(Y i2 = 0 X i2 =1) After plugging in the model for Y ij: log P( Y = 1 β, X x ) it ij 0i ij = ij [ P( Yij = 1 β 0 i, Xij = xij)] = log = β * + β 0 0i + β * P( Yij = 0 β i Xij xij 1 0, = ) and doing some algebra, one gets: CondLik i = 1 1+ exp(β 1 * (X i2 X i1 )) x ij Notice, the individual level intercept (whether random or not) drops out. 118

119 Conditional Logistic Regression What it means is that the estimate of the within subject OR no longer depends on assumptions on the distribution of the random effect. Can only use this to estimate the association of timevarying covariates. Subjects with identical outcomes will be dropped from analysis. For those covariates that do not change in a subject, they will not contribute to estimation of the OR for that covariate. 119

120 Conditional Logistic Regression More generally, you might want to estimate the within subject OR for several variables simultaneously and/or the OR for a unit change in a continuous variable. Can still do so by using the conditional likelihood - a method used to estimated OR s for matched case-control studies. The conditional likelihood (in example of a cohort) is the probability of observing that the cases have covariates they have and the controls have their observed covariates, given the distribution of covariates observed over all the repeated measurements. To define the likelihood, one normalizes the probability of observing the outcomes conditional on the covariates by the summed probabilities over all possible combinations of covariates and outcomes. 120

121 Teenage Sex and Drug-Use Using M-H summary OR.. cs sx24hrs drgalcoh, by(eid) or eid OR [95% Conf. Interval] M-H Weight (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) (Cornfield) M-H combined

122 Conditional Logistic Estimate. clogit sx24hrs drgalcoh, or group(eid) note: multiple positive outcomes within groups encountered. note: 23 groups (161 obs) dropped due to all positive or all negative outcomes. Iteration 0: log likelihood = Iteration 1: log likelihood = Iteration 2: log likelihood = Conditional (fixed-effects) logistic regression Number of obs = 1547 LR chi2(1) = 2.93 Prob > chi2 = Log likelihood = Pseudo R2 = sx24hrs Odds Ratio Std. Err. z P> z [95% Conf. Interval] drgalcoh

123 Pitfalls of Latent Variable Models in General (including MultiLevel Mixed Models) 123

124 General Contrast of Mixed and GEE Models General Mixed Effects Model Specific Mixed Model (logistic) 124

125 Mixed Models General Likelihood of Observed Data based on Mixed Model. Specific from example 125

126 Latent Variable Models Nonparametric Nonidentifiability Point 1 for Mixed Models 126

127 Parameter returned by GEE: Population Average Models Parameter specific estimating function Estimating Function 127

128 Special Case: MultiLevel model coefficients have interpretations in both latent variable and observed data-generating worlds 128

129 Usually, different interpretations Logistic Case 129

130 Often difference in interpretations (Pop. Ave. vs. Mixed) (near) meaningless 130

131 It gets worse: What if Model is mis-specified? Parameter of Interest as Projection 131

132 Interpretation of coefficients in mis-specified multilevel mixed models can get wacky True model 132

133 Summary of MultiLevel Mixed vs. GEE (Population Ave.) Models 133

134 New Golden Rules of Estimation with Latent Variable Models? 134

135 Should avoid relying solely on unverifiable assumptions for inferences 135

136 Multilevel Analysis and Complex Surveys Part 2: Estimation and Inference from Complex Surveys Alan Hubbard UC Berkeley - Division of Biostatistics 136

137 Foundation Finite population U={1,..,N}. Sample s, subset of U. V=(Y,X) observations of outcome, covariates, on each units. Values in finite fixed population are v U =v 1,..,v N and the process by which one draws these will be called the observation process. Parameters can be defined with regard to a finite population, or superpopulation that is, there is some data generating model of interest, P V, and we want to estimate parameters of it, θ(p V ). 137

138 Sampling Mechanism δ=(i t, t=1,..,n) with δ t =1 if t s, 0 otherwise g(δ V=v,Z) sampling mechanism, where Z are so-called design variables, so z U =z 1,..,z N. Thus, the observed data O is generated by a combination of the mechanisms, P V Z and g: O=(δ*v U *z U ) define the joint distribution of observed data, P 0 (O), O=(V, Z δ=1) Special cases: g(δ V=v,Z) = g(δ Z) (noninformative sampling). Parameters of interest from distribution of V Z=z (disaggregated analysis) Parameters of interest from distribution of P V (aggregated) P V (v;θ ) = P(V = v Z = z) p(z = z) 138 z

139 Types of Inference Design-based inference Model-based inference 139

140 Full Likelihood Consider one more source of missingness (say R=1 respondent), e.g., nonrespondent. Full likelihood is then: f (r γ,z,v) f (δ z,v) f (v z) f (z) Missingness caused by mechanisms for both δ and r. Different assumptions imply different conditional independences. 140

141 Pseudo-Likelihood Estimating Equation Approaches Most practical applications of survey data do not contain enough information to define the entire likelihood of the joint missingness/ sampling mechanisms and the distribution of interest. In addition, the parameter of interest can often be identified without having the entire joint likelihood identifiable, but just some of the design elements. Thus, most survey analyses rely on pseudolikelihood estimation. 141

142 Simple Example Assume an exponential family: If the entire population was a simple random draw from this distribution (and you observed everyone s value) then the score equation based on the likelihood would be: with obvious solution f (y;θ) = θe θy, E(Y) =1/θ s(θ;v ) = N t =1 ( ) Y 1 θ 1 ˆ θ = Y ave(y) 142

143 Pseudo-Likelihood, Continued However, let s now assume all we have is the usual subset of the population U defined by s, and the probability that a observation was sampled, given it s observed values (for now, no Z): π t =P(δ t =1 Y t ). An unbiased estimating function for the population average (and thus the parameter of f(y;θ)) is: s(θ;o,π) = 1 ( Y 1 ) θ = t s π t Thus, just treat as a general missingness problem and use inverse weighting. N t =1 δ t π t ( Y 1 ) θ 143

144 It works (in this case)! Consistent estimating equation, which of course results in estimator, when solved: E Y E δ π Y 1 θ ( ) Y Note π = E(δ Y), so get E Y ( Y 1 ) θ = 0 ˆ µ s 1 ˆ θ s = E ( Y Y 1 θ)e δ π Y = t s t s π t 1 Y t π t 1 So, a re-weighted score equation provides consistent, pseudo-likelihood estimate. 144

145 Inference From Estimating Equation Designed based Approach parametric estimation uncertainty from repeated samplings (of the type done) from a fixed target population (Y U fixed). Model based from repeated draws from the underlying data generating distribution (Y U random). Both variance comes both from underlying data-generating mechanism and sampling mechanism if finite population large, model-based portion contributes almost nothing var( θ ˆ s ) = v ar(e( θ ˆ s U)) + E(var( θ ˆ 145 s U)) Model Source Design Source

146 Designed-based Inference, cont. All from sampling mechanism. Need simple empirical estimate derived from estimating equation. Often called sandwich estimator. Can be generally derived as the variancecovariance of the influence curve of the estimator. ˆ θ θ + 1 n t s IC(O t ;γ,θ), so var( θ ˆ ) var(ic(o ;γ,θ)) t n 146

147 Designed-based Inference In this general framework the things to account for in inference: stratified design (not a simple random sample) finite sample population correction (sometime samples not from an infinite population) clustered (correlated) data 147

148 Multilevel Analysis and Complex Surveys Part 3: Putting MultiLevel Models and Complex Survey Data Together Alan Hubbard UC Berkeley - Division of Biostatistics Slides are from Sophia Rabe-Hesketh, UC Berkeley, School of Education and Division of Biostatistics 148

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format: Lab 5 Linear Regression with Within-subject Correlation Goals: Data: Fit linear regression models that account for within-subject correlation using Stata. Compare weighted least square, GEE, and random

More information

Sample Size Calculation for Longitudinal Studies

Sample Size Calculation for Longitudinal Studies Sample Size Calculation for Longitudinal Studies Phil Schumm Department of Health Studies University of Chicago August 23, 2004 (Supported by National Institute on Aging grant P01 AG18911-01A1) Introduction

More information

Correlated Random Effects Panel Data Models

Correlated Random Effects Panel Data Models INTRODUCTION AND LINEAR MODELS Correlated Random Effects Panel Data Models IZA Summer School in Labor Economics May 13-19, 2013 Jeffrey M. Wooldridge Michigan State University 1. Introduction 2. The Linear

More information

10 Dichotomous or binary responses

10 Dichotomous or binary responses 10 Dichotomous or binary responses 10.1 Introduction Dichotomous or binary responses are widespread. Examples include being dead or alive, agreeing or disagreeing with a statement, and succeeding or failing

More information

Models for Longitudinal and Clustered Data

Models for Longitudinal and Clustered Data Models for Longitudinal and Clustered Data Germán Rodríguez December 9, 2008, revised December 6, 2012 1 Introduction The most important assumption we have made in this course is that the observations

More information

xtmixed & denominator degrees of freedom: myth or magic

xtmixed & denominator degrees of freedom: myth or magic xtmixed & denominator degrees of freedom: myth or magic 2011 Chicago Stata Conference Phil Ender UCLA Statistical Consulting Group July 2011 Phil Ender xtmixed & denominator degrees of freedom: myth or

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

Introduction to Longitudinal Data Analysis

Introduction to Longitudinal Data Analysis Introduction to Longitudinal Data Analysis Longitudinal Data Analysis Workshop Section 1 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section 1: Introduction

More information

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS Nađa DRECA International University of Sarajevo [email protected] Abstract The analysis of a data set of observation for 10

More information

Prediction for Multilevel Models

Prediction for Multilevel Models Prediction for Multilevel Models Sophia Rabe-Hesketh Graduate School of Education & Graduate Group in Biostatistics University of California, Berkeley Institute of Education, University of London Joint

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Assignments Analysis of Longitudinal data: a multilevel approach

Assignments Analysis of Longitudinal data: a multilevel approach Assignments Analysis of Longitudinal data: a multilevel approach Frans E.S. Tan Department of Methodology and Statistics University of Maastricht The Netherlands Maastricht, Jan 2007 Correspondence: Frans

More information

Longitudinal Data Analysis

Longitudinal Data Analysis Longitudinal Data Analysis Acknowledge: Professor Garrett Fitzmaurice INSTRUCTOR: Rino Bellocco Department of Statistics & Quantitative Methods University of Milano-Bicocca Department of Medical Epidemiology

More information

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week 10 + 0.0077 (0.052)

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week 10 + 0.0077 (0.052) Department of Economics Session 2012/2013 University of Essex Spring Term Dr Gordon Kemp EC352 Econometric Methods Solutions to Exercises from Week 10 1 Problem 13.7 This exercise refers back to Equation

More information

VI. Introduction to Logistic Regression

VI. Introduction to Logistic Regression VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models

More information

Biostatistics Short Course Introduction to Longitudinal Studies

Biostatistics Short Course Introduction to Longitudinal Studies Biostatistics Short Course Introduction to Longitudinal Studies Zhangsheng Yu Division of Biostatistics Department of Medicine Indiana University School of Medicine Zhangsheng Yu (Indiana University) Longitudinal

More information

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Quantile Treatment Effects 2. Control Functions

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

More information

Longitudinal Data Analysis: Stata Tutorial

Longitudinal Data Analysis: Stata Tutorial Part A: Overview of Stata I. Reading Data: Longitudinal Data Analysis: Stata Tutorial use Read data that have been saved in Stata format. infile Read raw data and dictionary files. insheet Read spreadsheets

More information

From the help desk: Swamy s random-coefficients model

From the help desk: Swamy s random-coefficients model The Stata Journal (2003) 3, Number 3, pp. 302 308 From the help desk: Swamy s random-coefficients model Brian P. Poi Stata Corporation Abstract. This article discusses the Swamy (1970) random-coefficients

More information

Milk Data Analysis. 1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED

Milk Data Analysis. 1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED 1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED 2. Introduction to SAS PROC MIXED The MIXED procedure provides you with flexibility

More information

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

More information

HLM software has been one of the leading statistical packages for hierarchical

HLM software has been one of the leading statistical packages for hierarchical Introductory Guide to HLM With HLM 7 Software 3 G. David Garson HLM software has been one of the leading statistical packages for hierarchical linear modeling due to the pioneering work of Stephen Raudenbush

More information

Analysis of Correlated Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Analysis of Correlated Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington Analysis of Correlated Data Patrick J Heagerty PhD Department of Biostatistics University of Washington Heagerty, 6 Course Outline Examples of longitudinal data Correlation and weighting Exploratory data

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

More information

Chapter 1. Longitudinal Data Analysis. 1.1 Introduction

Chapter 1. Longitudinal Data Analysis. 1.1 Introduction Chapter 1 Longitudinal Data Analysis 1.1 Introduction One of the most common medical research designs is a pre-post study in which a single baseline health status measurement is obtained, an intervention

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2 University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

Geostatistics Exploratory Analysis

Geostatistics Exploratory Analysis Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras [email protected]

More information

Multilevel Modelling of medical data

Multilevel Modelling of medical data Statistics in Medicine(00). To appear. Multilevel Modelling of medical data By Harvey Goldstein William Browne And Jon Rasbash Institute of Education, University of London 1 Summary This tutorial presents

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

More information

GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

More information

gllamm companion for Contents

gllamm companion for Contents gllamm companion for Rabe-Hesketh, S. and Skrondal, A. (2012). Multilevel and Longitudinal Modeling Using Stata (3rd Edition). Volume I: Continuous Responses. College Station, TX: Stata Press. Contents

More information

Multilevel Modeling of Complex Survey Data

Multilevel Modeling of Complex Survey Data Multilevel Modeling of Complex Survey Data Sophia Rabe-Hesketh, University of California, Berkeley and Institute of Education, University of London Joint work with Anders Skrondal, London School of Economics

More information

Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure

Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure Technical report Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure Table of contents Introduction................................................................ 1 Data preparation

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group Introduction to Multilevel Modeling Using HLM 6 By ATS Statistical Consulting Group Multilevel data structure Students nested within schools Children nested within families Respondents nested within interviewers

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

More information

Multilevel Models for Longitudinal Data. Fiona Steele

Multilevel Models for Longitudinal Data. Fiona Steele Multilevel Models for Longitudinal Data Fiona Steele Aims of Talk Overview of the application of multilevel (random effects) models in longitudinal research, with examples from social research Particular

More information

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén

More information

Panel Data Analysis Josef Brüderl, University of Mannheim, March 2005

Panel Data Analysis Josef Brüderl, University of Mannheim, March 2005 Panel Data Analysis Josef Brüderl, University of Mannheim, March 2005 This is an introduction to panel data analysis on an applied level using Stata. The focus will be on showing the "mechanics" of these

More information

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 21, 2015

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 21, 2015 Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 21, 2015 References: Long 1997, Long and Freese 2003 & 2006 & 2014,

More information

Department of Epidemiology and Public Health Miller School of Medicine University of Miami

Department of Epidemiology and Public Health Miller School of Medicine University of Miami Department of Epidemiology and Public Health Miller School of Medicine University of Miami BST 630 (3 Credit Hours) Longitudinal and Multilevel Data Wednesday-Friday 9:00 10:15PM Course Location: CRB 995

More information

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors. Analyzing Complex Survey Data: Some key issues to be aware of Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 24, 2015 Rather than repeat material that is

More information

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

More information

Overview of Methods for Analyzing Cluster-Correlated Data. Garrett M. Fitzmaurice

Overview of Methods for Analyzing Cluster-Correlated Data. Garrett M. Fitzmaurice Overview of Methods for Analyzing Cluster-Correlated Data Garrett M. Fitzmaurice Laboratory for Psychiatric Biostatistics, McLean Hospital Department of Biostatistics, Harvard School of Public Health Outline

More information

Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

Qualitative vs Quantitative research & Multilevel methods

Qualitative vs Quantitative research & Multilevel methods Qualitative vs Quantitative research & Multilevel methods How to include context in your research April 2005 Marjolein Deunk Content What is qualitative analysis and how does it differ from quantitative

More information

Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS

Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS Philip Merrigan ESG-UQAM, CIRPÉE Using Big Data to Study Development and Social Change, Concordia University, November 2103 Intro Longitudinal

More information

BIO 226: APPLIED LONGITUDINAL ANALYSIS COURSE SYLLABUS. Spring 2015

BIO 226: APPLIED LONGITUDINAL ANALYSIS COURSE SYLLABUS. Spring 2015 BIO 226: APPLIED LONGITUDINAL ANALYSIS COURSE SYLLABUS Spring 2015 Instructor: Teaching Assistants: Dr. Brent Coull HSPH Building II, Room 413 Phone: (617) 432-2376 E-mail: [email protected] Office

More information

Standard errors of marginal effects in the heteroskedastic probit model

Standard errors of marginal effects in the heteroskedastic probit model Standard errors of marginal effects in the heteroskedastic probit model Thomas Cornelißen Discussion Paper No. 320 August 2005 ISSN: 0949 9962 Abstract In non-linear regression models, such as the heteroskedastic

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship

More information

FIXED EFFECTS AND RELATED ESTIMATORS FOR CORRELATED RANDOM COEFFICIENT AND TREATMENT EFFECT PANEL DATA MODELS

FIXED EFFECTS AND RELATED ESTIMATORS FOR CORRELATED RANDOM COEFFICIENT AND TREATMENT EFFECT PANEL DATA MODELS FIXED EFFECTS AND RELATED ESTIMATORS FOR CORRELATED RANDOM COEFFICIENT AND TREATMENT EFFECT PANEL DATA MODELS Jeffrey M. Wooldridge Department of Economics Michigan State University East Lansing, MI 48824-1038

More information

Introducing the Multilevel Model for Change

Introducing the Multilevel Model for Change Department of Psychology and Human Development Vanderbilt University GCM, 2010 1 Multilevel Modeling - A Brief Introduction 2 3 4 5 Introduction In this lecture, we introduce the multilevel model for change.

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing Sector

A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing Sector Journal of Modern Accounting and Auditing, ISSN 1548-6583 November 2013, Vol. 9, No. 11, 1519-1525 D DAVID PUBLISHING A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing

More information

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

More information

Individual Growth Analysis Using PROC MIXED Maribeth Johnson, Medical College of Georgia, Augusta, GA

Individual Growth Analysis Using PROC MIXED Maribeth Johnson, Medical College of Georgia, Augusta, GA Paper P-702 Individual Growth Analysis Using PROC MIXED Maribeth Johnson, Medical College of Georgia, Augusta, GA ABSTRACT Individual growth models are designed for exploring longitudinal data on individuals

More information

Statistical Rules of Thumb

Statistical Rules of Thumb Statistical Rules of Thumb Second Edition Gerald van Belle University of Washington Department of Biostatistics and Department of Environmental and Occupational Health Sciences Seattle, WA WILEY AJOHN

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s

I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s Linear Regression Models for Panel Data Using SAS, Stata, LIMDEP, and SPSS * Hun Myoung Park,

More information

Module 14: Missing Data Stata Practical

Module 14: Missing Data Stata Practical Module 14: Missing Data Stata Practical Jonathan Bartlett & James Carpenter London School of Hygiene & Tropical Medicine www.missingdata.org.uk Supported by ESRC grant RES 189-25-0103 and MRC grant G0900724

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

SYSTEMS OF REGRESSION EQUATIONS

SYSTEMS OF REGRESSION EQUATIONS SYSTEMS OF REGRESSION EQUATIONS 1. MULTIPLE EQUATIONS y nt = x nt n + u nt, n = 1,...,N, t = 1,...,T, x nt is 1 k, and n is k 1. This is a version of the standard regression model where the observations

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

More information

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. A General Formulation 3. Truncated Normal Hurdle Model 4. Lognormal

More information

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation Average Redistributional Effects IFAI/IZA Conference on Labor Market Policy Evaluation Geert Ridder, Department of Economics, University of Southern California. October 10, 2006 1 Motivation Most papers

More information

UNIVERSITY OF WAIKATO. Hamilton New Zealand

UNIVERSITY OF WAIKATO. Hamilton New Zealand UNIVERSITY OF WAIKATO Hamilton New Zealand Can We Trust Cluster-Corrected Standard Errors? An Application of Spatial Autocorrelation with Exact Locations Known John Gibson University of Waikato Bonggeun

More information

Two Tools for the Analysis of Longitudinal Data: Motivations, Applications and Issues

Two Tools for the Analysis of Longitudinal Data: Motivations, Applications and Issues Two Tools for the Analysis of Longitudinal Data: Motivations, Applications and Issues Vern Farewell Medical Research Council Biostatistics Unit, UK Flexible Models for Longitudinal and Survival Data Warwick,

More information

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract

More information

Technical report. in SPSS AN INTRODUCTION TO THE MIXED PROCEDURE

Technical report. in SPSS AN INTRODUCTION TO THE MIXED PROCEDURE Linear mixedeffects modeling in SPSS AN INTRODUCTION TO THE MIXED PROCEDURE Table of contents Introduction................................................................3 Data preparation for MIXED...................................................3

More information

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Examples: Monte Carlo Simulation Studies CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Monte Carlo simulation studies are often used for methodological investigations of the performance of statistical

More information

Lecture 18: Logistic Regression Continued

Lecture 18: Logistic Regression Continued Lecture 18: Logistic Regression Continued Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Chapter 29 The GENMOD Procedure. Chapter Table of Contents Chapter 29 The GENMOD Procedure Chapter Table of Contents OVERVIEW...1365 WhatisaGeneralizedLinearModel?...1366 ExamplesofGeneralizedLinearModels...1367 TheGENMODProcedure...1368 GETTING STARTED...1370

More information

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples

More information

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

More information

Introduction to Hierarchical Linear Modeling with R

Introduction to Hierarchical Linear Modeling with R Introduction to Hierarchical Linear Modeling with R 5 10 15 20 25 5 10 15 20 25 13 14 15 16 40 30 20 10 0 40 30 20 10 9 10 11 12-10 SCIENCE 0-10 5 6 7 8 40 30 20 10 0-10 40 1 2 3 4 30 20 10 0-10 5 10 15

More information

Clustering in the Linear Model

Clustering in the Linear Model Short Guides to Microeconometrics Fall 2014 Kurt Schmidheiny Universität Basel Clustering in the Linear Model 2 1 Introduction Clustering in the Linear Model This handout extends the handout on The Multiple

More information

data visualization and regression

data visualization and regression data visualization and regression Sepal.Length 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 I. setosa I. versicolor I. virginica I. setosa I. versicolor I. virginica Species Species

More information

Introduction to Data Analysis in Hierarchical Linear Models

Introduction to Data Analysis in Hierarchical Linear Models Introduction to Data Analysis in Hierarchical Linear Models April 20, 2007 Noah Shamosh & Frank Farach Social Sciences StatLab Yale University Scope & Prerequisites Strong applied emphasis Focus on HLM

More information

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

More information

Lecture 14: GLM Estimation and Logistic Regression

Lecture 14: GLM Estimation and Logistic Regression Lecture 14: GLM Estimation and Logistic Regression Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South

More information

Panel Data Analysis Fixed and Random Effects using Stata (v. 4.2)

Panel Data Analysis Fixed and Random Effects using Stata (v. 4.2) Panel Data Analysis Fixed and Random Effects using Stata (v. 4.2) Oscar Torres-Reyna [email protected] December 2007 http://dss.princeton.edu/training/ Intro Panel data (also known as longitudinal

More information

Power and sample size in multilevel modeling

Power and sample size in multilevel modeling Snijders, Tom A.B. Power and Sample Size in Multilevel Linear Models. In: B.S. Everitt and D.C. Howell (eds.), Encyclopedia of Statistics in Behavioral Science. Volume 3, 1570 1573. Chicester (etc.): Wiley,

More information