Missing Data in Longitudinal Studies: Dropout, Causal Inference, and Sensitivity Analysis

Transcription

1 Missing Data in Longitudinal Studies: Dropout, Causal Inference, and Sensitivity Analysis Michael J. Daniels University of Florida Joseph W. Hogan Brown University

2

3 Contents I Regression and Inference 1 1 Datasets Schizophrenia trial Growth hormone trial Smoking cessation trials HERS: HIV Natural History Study OASIS Study Pediatric AIDS Trial 14 2 Regression Models Introduction Generalized linear models Conditionally-specified models Directly specified (marginal) models Nonlinear and semiparametric regression Interpreting covariate effects Further reading 43 3 Bayesian inference Likelihood Prior Distributions Computation of the Posterior Distribution 57 iii

4 iv CONTENTS 3.4 Model comparison and Model fit Semiparametric Bayes Further reading 73 4 Data Analysis Analysis of GH study using complete data (continuation of Data Example 1.2) Analysis of schizophrenia using complete data (continuation of Data Example 1.1) Analysis of CTQ I using complete data (continuation of Data Example 1.3) Analysis of HERS CD4 data (continuation of Data Example 1.4) 86 II Missing Data 91 5 Missing data mechanisms Introduction Full versus observed data Full-data models and missing data mechanism Assumptions about missing data mechanism MAR and dropout Posterior inference Posterior inference under ignorability Posterior inference under non-ignorability Summary Further reading Inference under MAR General issues in model specification Posterior sampling using data augmentation Covariance structures for univariate longitudinal processes 134

5 CONTENTS v 6.4 Covariate dependent covariance structures Multivariate processes Model comparison and fit Further reading Data examples: Ignorability Re-analysis of GH study under MAR (cont. of Example 4.1) Analysis of schizophrenia clinical trial under MAR (cont. of Example 4.2) Analysis of CTQ I using all the data under MAR (cont. of Example 4.3) Analysis of weekly smoking and weight change from CTQ II using all the data under MAR Models under nonignorable dropout Introduction Selection models Mixture models Shared parameter models (SPM) Model comparisons and assessing model fit Further reading Informative Priors and Sensitivity analysis Introduction Pattern mixture models Selection models Elicitation of expert opinion and formulation of sensitivity analyses A note on sensitivity analysis in fully parametric models Further reading 229

6 vi CONTENTS 10 Case studies: Missing not at random Growth hormone study Pediatric AIDS Trial OASIS Study: Analysis via pattern mixture models OASIS Study: Analysis via selection models Application of missing data methods to problems in causal inference Introduction Framework Instrumental variables (IV) Principal stratification Further reading 268 A Notation 269 References 271 Index 287

7 PART I Regression and Inference

8

9 General introduction goes here. 3

10

11 CHAPTER 1 Datasets This chapter describes, in detail, several datasets that will be used throughout the book to motivate the material and to illustrate data analysis methods. For each dataset, we describe the associated study, the primary research goals to be met by analyses, and the key complications presented by missing data. Empirical summaries are given here; for detailed analyses we provide references to subsequent chapters. These datasets derive primarily from our own collaborations, introducing the potential for selection bias in the coverage of topics. However we have selected these because they cover a range of different types of study design and missing data issues. Although the data are primarily from clinical trials, we have XX observational studies. The datasets have both continuous and discrete longitudinal endpoints for analysis. Both continuous-time and discrete-time dropout processes are covered here. In XX of the studies, the goal is to capture the joint distribution of two processes evoloving simultaneously. For motivating and illustrating issues related to causal inference, we use an observational study and a clinical trial with noncompliance. It is our hope that readers can find within these examples connections to their own work or data-analytic problem of interest. The literature on missing data continues to grow, and it is impossible to address every situation with representative examples, so we have included at the end of each chapter a list of supplementary reading to assist readers in their navigation of current research in this field. 1.1 Dose-finding trial of an experimental treatment for schizophrenia Study and data These data were collected as part of a randomized, double-blind clinical trail of a new pharmacologic treatment of schizophrenia (Lapierre, Nai, 5

12 6 DATASETS Chauinard et al., 1990). Other published analyses of these data can be found in Hogan and Laird (1997) and Cnaan, Laird and Slasor (1997). The trial compares three doses of the new treatment (low, medium, high) to the standard dose of halperidol, an effective antipsychotic that has known side effects. At the time of the trial, the experimental therapy was thought to have similar antipsychotic effectiveness with fewer side effects; the trial was designed to find the appropriate dosing level. The study enrolled 242 patients at 13 different centers, and randomized them to one of the four treatment arms. The intended length of follow up was six weeks, with measures taken weekly except for week 5. Schizophrenia severity was assessed using the Brief Psychiatric Rating Scale, or BPRS, a sum of scores of 18 items that reflect behaviors, mood and feelings (Overall and Gorham, 1988). The scores ranged from 0 to 108 with higher scores indicating higher severity. A minimum score of 20 was required for entry into the study Questions of interest The primary objective is to compare mean change from baseline to week 6 between the four treatment groups Missing data Dropout in this study was substantial; only 139 of the 245 participants had a measurement at week 6. The mean BPRS on each treatment arm showed differences between dropouts and completers. Reasons for dropout included adverse experience (e.g., side effects), lack of treatment effect, and withdrawal for unspecified reasons. Reasons such as lack of observed treatment effect are clearly related to the primary efficacy outcome, but others such as adverse events and participant withdrawal may not be. Figure 1: Graphs (4 panels), separating completers from dropouts Figure 2: KM curve of dropout by tx arm Table 1: Dropout by reason, stratified by treatment arm. Need to insert material in chapters 5 and 7 about handling combinations of informative and noninformative dropout. Reference to appropriate material on extrapolations Re-add some material on intermittent missingness in the presence of dropout, in Chapters 5 or 7 What about centers? Include as covariate?

13 GROWTH HORMONE TRIAL Data Analyses In Section XXX we analyze these data using a standard random effects model for longitudinal data, fitted under the missing at random (MAR) constraints. Because valid inference under MAR depends on correct model specification, we emphasize the role of variance-covariance specification under MAR. For some individuals, drop out occurs for reasons that clearly are related to outcome, and for others the connection between dropout and outcome is less clear (e.g. dropout due to adverse side effects). Hence the MAR assumption may not be entirely appropriate, and in Section 9.X, we show how to handle combinations of MAR and MNAR dropout using a pattern mixture model; some attention is given in this example to assumptions being made about intermittently missing values. Need to do MAR analysis Need to do Ch 9 analysis 1.2 Clinical trial of recombinant human growth hormone (rhgh) for increasing muscle strength in the elderly Study and data The data come from a randomized clinical trial conducted to examine the effects of recombinant human growth hormone (rhgh) therapy for building and maintaining muscle strength in the elderly (Kiel, Puhl, Rosen et al., 1998). The study enrolled 161 participants and randomized them to one of four treatment arms: placebo (P), growth hormone (GH) only, exercise plus placebo (E), and exercise plus growth hormone (E+GH). Various muscle strength measures were recorded at baseline, six months, and 12 months. Here, we focus on mean quadriceps strength (QS), measured as the maximum foot-pounds of torque that can be exerted against resistance provided by a mechanical device Questions of interest The primary objective of our analyses is to compare mean QS at month 12 in the four treatment arms among all those randomized (i.e., draw inference about the intention to treat effect).

14 8 DATASETS Missing data Roughly 75% of randomized individuals completed all 12 months of follow-up, and most of the dropout was thought to be related to the unobserved responses at the dropout times. Table summarizes mean and standard deviation of QS for available follow up data, both aggregated and stratified by dropout time Data analyses These data are used several times throughout the book to illustrate various models. In Section?? we analyze data on completers only to illustrate multivariate normal regression; in Section?? we fit pattern mixture models under the missing at random (MAR) constraint, and illustrate the use of interior family constraints (Molenberghs et al., 1998; Kenward and Molenberghs, 2000 need cite); in Section?? we use more general pattern mixture models that permit missingness not at random (MNAR) and sensitivity analyses. 1.3 Clinical Trials of Exercise as an Aid to Smoking Cessation in Women: The Commit to Quit Studies Studies and data The Commit to Quit studies were randomized clinical trials to examine the impact of exercise on the ability to quit smoking. The women in each study were aged 18-65, smoked 5 or more cigarettes per day for at least one year, and participated in moderate or vigorous intensity activity for less than ninety minutes per week. The first trial (hereafter CTQ I) enrolled 281 women and tested the effect on smoking cessation of supervised vigorous exercise versus equivalent staff contact time (Marcus, Albrecht, King et al., 1999); the second trial (hereafter CTQ II) enrolled XXX female smokers and was designed to examine the effect of moderate partially supervised exercise (Marcus, Lewis, Hogan et al., 2005). Other analyses of these data can be found in Hogan, Roy and Korkontzelou (2004), add ref who illustrate weighted regression using inverse propensity scores, and Roy and Hogan (2007), who use principal stratification methods to infer the causal effect of compliance with vigorous exercise. In each study, smoking cessation was assessed weekly using self-report,

15 SMOKING CESSATION TRIALS 9 Table 1.1 need treatment labels Growth hormone trial: Sample means (standard deviations) stratified by treatment group and dropout pattern k. Patterns defined by last visit (0: baseline; 1: 3 months; 2: 6 months), and n k is the number in pattern k. Month Treatment k n k (26) (15) 68 (26) (24) 90 (32) 88 (32) All (25) 87 (32) 88 (32) (17) (33) 81 (42) (22) 64 (21) 63 (20) All (23) 66 (25) 63 (20) (32) (52) 86 (51) (24) 81 (25) 73 (21) All (26) 82 (26) 73 (21) (29) (19) 62 (31) (23) 62 (20) 63 (19) All (24) 62 (22) 63 (19) with confirmation via laboratory testing of saliva and exhaled carbon monoxide. As is typical in short-term intervention trials for smoking cessation, the target date for quitting smoking followed an initial runin period during which the intervention was administered but the participants were not asked to quit smoking. In CTQ I, measurements on smoking status were taken weekly for 12 weeks; women were asked to quit at week 5. In CTQ II, total follow up lasted 8 weeks, and women were asked to quit in week Defining treatment effects In each study, the question of interest was whether the intervention under study reduced the rate of smoking. This can be answered in terms of the effect of randomization to treatment versus randomization to control, or in terms of the effect of complying with treatment versus not com-

16 10 DATASETS plying. The former can be answered using an intention to treat analysis, contrasting outcomes based on treatment arm assignment. The latter poses additional challenges, but can be addressed using methods for inferring causal effects; these include instrumental variables, propensity score methods, and principal stratification. These topics are addressed some detail in Chapter??; key references include Angrist, Imbens and Rubin (1996) need cite, Robins (XX) need cite and Frangakis and Rubin need cite. In our analyses, we frame the treatment effect in terms of either (a) timeaveraged weekly cessation rate following the target quit date or (b) cessation rate at the final week of follow up. More details about analyses are given below Missing data Each of the studies had substantial dropout; in CTQ I, XX percent (XX/XX) dropped out on the exercise arm, and XX percent (XX/XX) on the control arm; for CTQ II the proportions were XX on exercise, and XX on control. Figure XX shows, for CTQ I, weekly cessation rates for all observed data, and then stratified by dropout status (yes/no), making clear that dropout is related at least to observed smoking status during the study. There also exists some empirical support for the notion that dropout is related to missing outcomes in smoking cessation studies, in the sense that dropouts are more likely to be smoking once they have withdrawn from the study (Liectenstein et al, 19XX) need citation Data analyses Analysis of CTQ I under standard MAR assumptions In Chapters XX, we use the CTQ I data to infer the effect of being randomized to either exercise or control under the standard MAR assumption; inferences are compared to other approaches such as complete-case analysis and the common practice of assuming dropouts are smokers. Analysis of CTQ II using auxiliary variables Weight change is generally associated with smoking cessation. In Chapter XX, we illustrate the use of auxiliary information on longitudinal

17 HERS: HIV NATURAL HISTORY STUDY 11 weight changes to inform the distribution of smoking cessation outcomes in making treatment comparisons in CTQ II. The weight change data is incorporated through a joint model for longitudinal smoking cessation and weight, and the marginal distribution of smoking cessation is used for treatment comparisons. Inferring the causal effect of exercise in CTQ I An attractive feature of many behavioral intervention trials is that compliance with the intervention is directly observed; in CTQ I, participants attended on-site sessions to participate in exercise and counseling. In Chapter XX, we use the method of principal stratification to estimate the causal effect, on week-12 cessation rate, of attending the exercise sessions versus attending the educational sessions (see also Roy and Hogan, 2007). This effect is estimated for the subpopulation (stratum) who would comply with either intervention, if offered. Graph of observed smoking rates each study Graph stratified by dropouts and non-dropouts for each study Graph of weight and smoking status in CTQ II? 1.4 Natural history of HIV infection in women: HIV Epidemiology Research Study (HERS) Cohort Have to decide what we will do with these data. So far we have summarized CD4 using spline models. I suppose that is useful for the discussion about MAR but I m not sure where else that gets us Study and data The HIV Epidemiology Research Study (HERS) was a longitudinal cohort study of the natural history of HIV in women. Between 1993 and 1996, the HERS enrolled 1310 women who were either HIV-positive or at high risk for infection; 871 were HIV positive at study entry. Every six months for up to five years, several outcomes were recorded for each participant, including standard measures of immunologic function and viral burden, plus a comprehensive set of measures characterizing health status and behavioral patterns (e.g., body mass index, depression status, drug use behavior). Our analyses of HERS data will focus on modeling

18 12 DATASETS CD4 progression in the presence of dropout and mortality, and on estimating effect of treatment on CD4 count. Many analyses of HERS, addressing a variety of topics, have been published in both the medical and statistical literature. A general study of CD4 and viral load progression can be found in Mayer, Hogan, Smith et al. (2003) ; covariation of CD4 and body mass index is investigated in Jones, Hogan, Snyder et al. (2003) ; development and application of methods for estimating the effect of time-varying treatment can be found in Ko, Hogan and Mayer (2003), Hogan and Lee (2004), Hogan and Lancaster (2004), and Roy et al. (2006). check biblio for citations of hogan/lancaster and Roy et al CD4 or depression over time, with population smoother? Should align with HAART? Here, could potentially look at depression as a function of Race within CD4 200, where dropout seems to matter. KM plots of dropout, or plots showing dropout vs non-dropouts. Need something on mortality... (a) KM plot of time to death, treating dropout and non-hiv death as censoring?? Study objectives The HERS is a multi-site study with substudies numbering in the hundreds; relative to their scope, our objectives for illustrating data analyses are necessarily simplified. We carry out two analyses of data from HERS: in the first, our interest is in characterizing the trajectory of CD4 or depression or something else...?; in the second, we are interested in quantifying the causal effect of highly-active antiviral therapy (HAART) on CD4 count. The latter question requires methodology for handling a time-varying nonrandomized treatment Missing data issues All individuals enrolled in HERS were scheduled for five-year follow up (12 visits); of the 871 HIV-positive women, XXX completed all 12 visits; XXX dropped out and XXX died before completing the study. Reason for death is classified as HIV-related or not. Having dropout due to death necessitates careful definition of the target quantities for inference, which is given more detailed attention in Chapter XX need to put in placeholder for section - put it in the section.

19 OASIS STUDY Data analyses In Chapter XX, we illustrate the use of regression splines under MAR, with emphasis on the importance of choosing an appropriate variancecovariance model. Regression splines are used because the actual dates of the measurements, rather than just visit number, are available. The analyses in Chapter XX uses the method of instrumental variables to draw inferences about the effect of time-varying HAART on CD4 cell count. A similar analysis using moment-based methods can be found in Hogan and Lancaster (2004). 1.5 Clinical trial of smoking cessation among substance abusers: OASIS Study Data description The OASIS Trial studied compared standard versus enhanced counseling intervention for various behaviors such as smoking and alcohol abuse among substance abusers check - or alc abusers?; the focus for our analyses is on smoking cessation. The trial enrolled XX individuals, randomized to standard versus more intensive counseling details from JYL paper. Follow up occurred at one, three and six months following randomization. Table of cessation rates under completers only and filling in dropouts as smokers; include percent dropout; from JYL paper Analysis objectives The primary goal of our analysis is comparison, by treatment randomization, of smoking cessation rates at 12 months post baseline (i.e., intention to treat effect) Missing data issues Dropout rate was relatively high in the OASIS study (XX percent on standard, XX percent on the enhanced intervention). In our analyses we do not distinguish between dropout reason or type.

20 14 DATASETS Data analyses These data are analyzed in detail in Chapter XX, using models that allow for MNAR dropout. The first analysis uses a pattern mixture model where, conditional on dropout time, the longitudinal smoking outcomes follow a Markov transition model. The model is fit under MAR assumptions, then elaborated to allow for MNAR mechanisms. Sensitivity analyses and the use of informative prios are illustrated. The second analysis uses a semiparametric selection model approach, also allowing for MNAR dropout. The two models are compared in terms of treatment effect inference and qualitative characteristics. 1.6 Equivalence trial of competing doses of AZT in HIV-infected children: Protocol 128 of the AIDS Clinical Trials Group Data description Study of XX children randomized to two interventions. Dropout rate is XX percent. Data have been analyzed in other papers, including Hogan and Laird (1996), Hogan and Daniels (2000), and Hogan, Lin and Herman (2004). Tables and figures: Graph of all data, with highlighted profiles showing dropouts tend to have lower slopes etc. Graph of OLS slopes vs dropout time Inferential objectives Compare difference in CD4 change by end of study Missing data issues Dropout for various reasons. Here we will treat dropouts as the same, but for an analysis that considers reasons for dropout see Hogan and Laird (1996) and Hogan and Daniels (2000).

21 PEDIATRIC AIDS TRIAL Data analyses These data are analyzed in Chapter XX using a mixtures of varying coefficient models, which are compared to the standard random effects approach and to the conditional linear model of Wu and Bailey (1988).

22

23 CHAPTER 2 Regression Models for Longitudinal Data 2.1 Introduction Longitudinal data The material in this book is organized around regression models for repeated measurements. Appealing to first principles, one can think of longitudinal data as arising from the joint evolution of response and covariates, {Y i (t), x i (t) : t 0} If the process is observed at a discrete set of time points t = (t 1,..., t J ) T that is common to all individuals, the resulting response data can be written in terms of the J 1 vector Y i = {Y i (t) : t t} = (Y i1,..., Y ij ) T. The covariate process {x i (t)} is p 1. At time t j, the observed covariates are collected in the vector x ij = (x ij1,..., x ijp ) T. Hence the full collection of observed covariates is contained in the J p matrix x T i1 x T i2 X i =.. x T ij When the set of observation times is common to all individuals, we say the responses are balanced or temporally aligned. It is sometimes the case that observation times are unbalanced, or temporally misaligned, in that they vary by subject. in which case the times are t i1,..., t iji and the dimensions of Y i and X i are J i 1 and J i p, respectively. In regression, we are interested in characterizing the effect of covariates 17

24 18 REGRESSION MODELS X on a longitudinal dependent variable Y. Formally, we wish to draw inference about the joint distribution of the vector Y i of responses, conditionally on X i, [Y i X i ] = [Y i1,..., Y iji X i ]. Likelihood-based regression models for longitudinal data require a specification of this joint distribution using a model f(y x, θ). The parameter θ is a finite-dimensional vector of parameters indexing the model; it might include regression coefficients, variance components, and parameters indexing serial correlation. The joint distribution of responses can specified directly or indirectly. Directly specified models are written in terms of the marginal mean at each measurement occasion or time point t, together with a model for the variance-covariance structure. Indirectly specified models typically use a multilevel format, for example involving subject-specific random effects or latent variables b i to partition within- and between-subject variation. The usual strategy is to specify the joint distribution of responses and random effects, factored as [Y i, b i X i ] = [Y i b i, X i ] [b i X i ]. The distribution of interest, [Y i X i ], is obtained by integrating over the b i. Both directly- and indirectly-specified models are common for modeling longitudinal data, and in our review we will give several examples Regression models The literature on regression models for longitudinal data is vast, and we make no attempt to be comprehensive here. Our review is designed to highlight predominant approaches to regression modeling, emphasizing those models used in later chapters. Readers are referred to Diggle et al. [DHLZ02b], Fitzmaurice et al. [FLW04], Laird [Lai04], Jones [Jon93], Davidian and Giltinan [DG98], Crowder and Hand [CH90], Verbeke and Molenberghs [VM00], and Lindsey [Lin99] for a variety of perspectives. As we review several different regression models, the intent is to give the reader a sense of the rich variety of models that can be used to characterize longitudinal data, and to illustrate that these fit coherently into a single framework. As a result, missing data strategies described in later chapters can be applied very generally. Specific models described here will be familiar to those with experience analyzing longitudinal data (e.g. multivariate normal regression model, random effects models), but

25 INTRODUCTION 19 others represent fairly new developments (e.g. marginalized transition models [Hea02], regression splines [EM96, LZ99, RWC03]). Here we focus on specification and interpretation; Chapter 3 covers various aspect of inference. Because many regression models for longitudinal data have their foundation in the generalized linear model (GLM) for cross-sectional data [MN99], our review begins with a concise description of GLMs. Coverage of models for longitudinal data begins with random effects models; these build directly on the GLM structure by introducing individual-level random effects to capture between-subject variation. Conditionally on the random effects, within-level variation can be described by a simpler model, such as a GLM. Random effects models are very attractive in that they naturally partition variation in the dependent variable into its betweenand within-subject components, and they can be used to model both balanced and unbalanced data. At the same time, there is sometimes the disadvantage that the implied marginal distribution of responses is opaque. An alternative to random effects models is directly-specified models of the joint marginal distribution of responses (Section 2.4). Frequently referred to as marginal models, directly-specified models have a natural construction when the error distribution is multivariate normal, but for binary, count, and other discrete data, the choice of an appropriate joint distribution is less obvious. Our review touches on some recent developments for discrete longitudinal responses, such as the marginalized transition model [Hea02] and others. For a detailed review of likelihoodbased models of multivariate discrete responses, see Chapter 11 of Diggle et al. [DHLZ02a] and Chapter 7 of Laird [Lai04]. For all models covered in the first part of the chapter, the regression function is linear in covariates and takes a known functional form. Section 2.5 describes models in which the regression function can be nonlinear, either through a known function of the covariates or through an unspecified smooth function. The latter type of model is typically called semiparametric, because the regression is left unspecified but distributional assumptions are made about the error structure. Nonlinear and semiparametric models have a close connection to the GLM structure; our discussion of these models emphasizes that connection and illustrates that regression models as a whole can be very generally characterized [HTF01, RWC03]. The final element of our review concerns interpretation of covariate effects in longitudinal models. Because the response and covariates change with time, models of longitudinal data afford the opportunity to infer

26 20 REGRESSION MODELS both within- and between-subject covariate effects; however the importance of underlying assumptions to the interpretation of covariate effects should not be underestimated. Section 2.6 discusses three key aspects of interpretation and specification for longitudinal models: crosssectional versus longitudinal effects of a time-varying covariate, marginal (population-averaged) versus conditional (subject-specific) covariate effects, and the assumptions governing the use of time-varying covariates Full vs. observed data Throughout Chapters 2 and 3, the models refer to a full-data distribution. The distinction between full and observed data is particularly important when drawing inference from incomplete longitudinal data. We define the full data as those observations intended to be collected on a pre-specified interval, such as [0, T ]. For example, if intended collection times t 1,..., t J are common to all individuals, then the full response and covariate data are (Y i1, X i1 ),..., (Y ij, X ij ), where Y ij = Y i (t j ) and X ij = X i (t j ). In most applications, interest lies in the effect of covariates on the mean structure. When data are fully observed, the variance and covariance models can frequently be treated as nuisance parameters. Correct specification of variance and covariance allows more efficient use of the data, but it is not always necessary for obtaining proper inferences about mean parameters. When data are not fully observed, variance-covariance specification takes on heightened importance because missing data will effectively be imputed or extrapolated from observed data, based on modeling assumptions. For longitudinal data, unobserved responses will be imputed from observed responses for the same individual; the assumed correlation structure will usually have considerable influence on the imputation. This theme recurs throughout the book, and therefore our review pays particular attention to aspects of variance-covariance specification Additional notation Random variables and their realizations are denoted by Roman letters (e.g., X, x), and parameters are represented by Greek letters (e.g. α, θ). In Chapter 5, we expand the definition of full data to include random variables such as dropout time that characterize the missing data process.

27 GENERALIZED LINEAR MODELS 21 Vector- and matrix-valued random variables and parameters are represented using boldface (e.g. x, Y, β, Σ). For any matrix or vector A, we use A T to denote transpose. If A is invertible, then A 1 is its inverse and S = A 1/2 is the lower triangular matrix square root such that SS T = A. A full listing of notational conventions appears in the Appendix. 2.2 Generalized linear models for cross sectional data The generalized linear model (GLM) forms the foundation for many approaches to regression with multivariate responses, such as longitudinal or clustered data. Models such as random effects or mixed effects models, latent variable and latent class models, and regression splines, all highly flexible and general, are based on the GLM framework. Moment-based methods such as generalized estimating equations (GEE) also follow directly from the GLM for cross-sectional data [LZ86]. The GLM is a regression model for a dependent variable arising from the exponential family of distributions, f(y θ, ψ) = exp {(yθ b(θ)) /a(ψ) + c(y, ψ)}, where a, b and c are known functions, θ is the canonical parameter, and ψ is a scale parameter. The exponential family includes several commonlyused distributions, such as normal, Poisson, binomial, and gamma. It can be readily shown that E(Y ) = b (θ) var(y ) = a(ψ)b (θ) (see McCullagh and Nelder [MN89], Section for details). The effect of covariates x i = (x i1,..., x ip ) T can be modeled by introducing the linear predictor η i (x i, β) = x T i β, where β = (β 1,..., β p ) T is a vector of regression coefficients. Now define µ i = µ(x i, β) = E(Y x i ). A smooth, monotone function g links the mean µ i to the linear predictor η i via g(µ i ) = η i = x i β. (2.1) In many exponential family distributions, it is possible to identify a link function g such that X T Y is the sufficient statistic for β (here, X is the n p design matrix and Y = (Y 1,..., Y n ) T is the n 1 vector of responses). In this case, the canonical parameter is θ = η. Examples are well-known and widespread: for the Poisson distribution, the canonical

28 22 REGRESSION MODELS parameter is log(µ); for binomial distribution, it is the log odds (logit), log{µ/(1 µ)}. Although canonical links are sometimes convenient, their use is not necessary to form a GLM. In general, it only requires specification of a mean and variance function, conditionally on covariates. The mean follows (2.1), and the variance is given by v(µ i, φ) = φh(µ i ), where h( ) is some function of the mean and φ > 0 is a scale factor. Certain choices of g and h will yield likelihood score equations for common parametric regression models based on for exponential family distributions. For example, setting g(µ) = log{µ/(1 µ)}, h(µ) = µ(1 µ) and φ = 1 yields logistic regression under a Bernoulli distribution [Y i x i ]. Similarly, Poisson regression can be specified by setting g(µ) = log µ, h(µ) = µ and φ = Conditionally specified (random effects) models Conditionally specified models using random effects or latent variables provide a highly flexible class of models for handling longitudinal data. A defining characteristic of these models is that they impose structure on marginal variance and correlation using individual-specific random effects or latent variables. The models can be applied either to balanced or unbalanced response patterns, and can be used to capture key features of both between- and within-subject variation using relatively few parameters. A standard approach is to specify a regression model that includes subject-level random effects or latent variables b, and then to assume that conditionally on the latent variables, the distribution [Y X, b] has a simple form (e.g. its elements are independent). Integrating out the random effects yields marginal correlations between the {Y ij } within subject [BK99]. Many models, regression and otherwise, can be represented using a random effects or latent variable formulation. These include standard random effects regression models for responses that are continuous [LW82, Dig88], or discrete [SLW84, GH97, HG94, NMK00]; see Breslow and Clayton [BC93] and Daniels and Gatsonis [DG99] for an overview. This class of models also includes include regression models with factor-analytic and latent class structures [SR96, AW00, SL96, RLR99, RA01]. See Bartholomew and Knott [BK99] for a full account. Here we briefly review conditionally-specified regression models where

29 CONDITIONALLY-SPECIFIED MODELS 23 conditioning is done on random effects; these models also are known by a variety of names, including mixed effects models, random effects models, and random coefficient models. We use the term random effects models. The most common random effects models for longitudinal data specify the joint distribution [Y i, b i X i, θ] as [Y i b i, X i, θ 1 ] [b i X i, θ 2 ]. The parameter θ 1 captures the conditional effect of X on Y. The marginal distribution [Y i X i ] is obtained by integrating b i out of the joint distribution, and is indexed by the full set of parameters θ = (θ 1, θ 2 ) Random effects models based on GLMs By including random effects, generalized linear models can be used to model longitudinal and clustered data. For common distributions such as Bernoulli and Poisson, the GLM with random effects can be written in terms of the conditional mean and variance. The conditional mean takes the form g{e(y ij x ij, z ij, b i )} = g(µ b ij ) = x ijβ + z ij b i, where g( ) is a link function and z ij is a design matrix for the subjectspecific random effects. This representation of the conditional mean motivates the term mixed-effects model because the coefficients quantify both population-level (β) and individual-level (b i ) effects. The conditional variance is V b ij = var(y ij x ij, z ij, b i ) = φh(µ b ij). Finally, within subject correlation is specified through a covariance function C b ijk (γ) = cov(y ij, Y ik x ij, x ik, b i, γ). In many cases it is assumed that Cijk b = 0; i.e., that the random effects capture relevant within-subject correlation (after averaging over their distribution), but this assumption may not always be appropriate for longitudinal responses. At the second level, the random effects b i follow some distribution such as multivariate normal. The model for the marginal joint distribution of (Y i1,..., Y ij X i ) is obtained by integrating over b i, f(y 1,..., y J X i, θ) = f(y 1,..., y J X i, b i, θ 1 ) df (b i X i, θ 2 ).

30 24 REGRESSION MODELS The relationship between marginal and conditional (random effects) models is important to understand, particularly as it relates to interpreting covariate effects. In what follows we give several examples to illustrate Random effects models for continuous response A natural choice for modeling continuous or measured responses is the normal distribution. In random effects models, allowing both withinand between-subject variation to follow a normal distribution, or more generally a Gaussian process, affords considerable modeling flexibility while retaining interpretability. Example 2.1. Normal random effects model for continuous responses. A common model for continuous longitudinal responses is the normal random effects model. This model illustrates well the concept of an indirectly-specified joint distribution because the variance-covariance structure in [Y i X i, θ] is a by-product of the assumed random effects distribution. Like many random effects models, it is easiest to describe in two stages. At the first stage, the responses Y i are normal conditionally on a q 1 vector of random effects b i, [Y i X i, b i, θ 1 ] N(µ b i, Σ b i), where superscript b is added to emphasize that the mean and covariance are conditional on b i. To incorporate covariate effects, let µ b i = X iβ + Z i b i, where Z i is the design matrix for random effects. The variance matrix Σ b i = Σb i (φ) captures within-subject variation and is parameterized by the r 1 vector of φ of nonredundant parameters. Hence θ 1 = (β, φ). When Z i X i, as is usually the case, the b i can be thought of as error terms for regression coefficients, which gives rise to the term random coefficient model. For example, if X i = Z i, we obtain a randomcoefficients model, µ b i = X i β i = X i (β + b i ). (2.2) where the random effects b i can be interpreted as individual-specific deviations from β. The within-subject variance Σ i (φ) usually has a simplified structure, parameterized through a covariance function C ijk (φ). For example, an

31 CONDITIONALLY-SPECIFIED MODELS 25 exponential structure takes the form where φ = (σ 2, ρ) and 0 ρ 1. C ijk (φ) = σ 2 ρ tij t ik, At the second level, the random effects are assigned a distribution that can depend on covariates. The (multivariate) normal is a common choice, [b i X i ] N(0, Ω), where Ω = Ω(η) is a q q variance matrix indexed by η (hence θ 2 = η). It also is possible to allow η to depend on individual-level covariates through appropriate specifications [DZ03]; this is covered in more detail in Chapter 6. The marginal distribution of Y i follows the multivariate normal distribution [Y i X i, Z i, θ] N(X i β, Z i ΩZ T i + Σ). (2.3) The marginal variance var(y i X i ) is indirectly specified because it depends on parameters from both [Y i X i, b i ] and [b i X i ]. Moreover, we see from by comparing (2.2) and (2.3) that β can be interpreted both as a marginal and a conditional effect of X i on Y i. A version of this model is used to analyze data described in Example 1.1, a longitudinal clinical trial comparing three doses of an antipsychotic to the standard of care in schizophrenia patients. The analysis appears in Data Analysis Random effects models for discrete responses Random effects specifications can be very useful for modeling longitudinal discrete responses, where the joint distribution rarely takes an obvious form and principles from generalized linear models are not easily applied. In the case of longitudinal binary data, for example, it is straightforward to show that the joint distribution of a J-dimensional response variable can be represented by a multinomial distribution with 2 J categories. When J is appreciably large, however, parameter constraints must be imposed to make modeling practical. See Laird [Lai04], Chapter 7 for a more detailed discussion. Compared to direct specification of the joint distribution, random effects models offer the advantage of being parsimonious, providing a natural decomposition of sources of variation, and applying equally well to balanced and unbalanced response profiles. The regression parameters

32 26 REGRESSION MODELS represent covariate effects in the conditional rather than marginal joint distribution of Y, however, and because the link functions are nonlinear transformations of the mean (e.g., log, logit), these do not generally coincide. Therefore care must be taken when interpreting regression effects. The logistic regression with normal random effects illustrates several of these points rather well. Example 2.2. Logistic regression with random effects. As in Example 2.1, a logistic random effects model is specified in terms of the joint distribution [Y i, b i X i, θ] = [Y i X i, b i, θ 1 ] [b i X i, θ 2 ], where θ = (θ 1, θ 2 ). The conditional distribution of each component in Y i follows the Bernoulli model, where [Y ij x ij, b i, θ 1 ] Ber(µ b ij ), g(µ b ij ) = x ijβ + z ij b i (2.4) (hence θ 1 = β). The random effects distribution follows so θ 2 = Ω. [b i X i, θ 2 ] N(0, Ω), The parameter β characterizes the conditional, or subject-specific effect of X i on Y i. By contrast, the marginal or population-averaged distribution [Y i X i, θ] must be obtained by integrating over b i. The marginal mean µ ij (β, Ω) = E(Y ij x ij, β, Ω) is µ ij (β, Ω) = µ b ij (β) df (b i x ij, Ω) = exp(x ij β + z ij b i ) 1 + exp(x ij β + z ij b i ) df (b i x ij, Ω). The marginal effect of X i differs from the conditional effect in that it is a function of both β and Ω, and on the logit scale, it is no longer linear. Zeger and Liang [ZL92] show that in some cases, the marginal effect in the logit-normal model is approximately linear on the logit scale, and differs from the conditional effect by a scale factor that depends on Ω; i.e., g(µ ij ) = x ij β, where β = βk(ω) and k : R q R p is a known function. In many cases the population-averaged effect β is attenuated relative to the subject-specific effect β; for example, when q = 1 (random intercept model), b i is normally distributed, b i is independent of X i, and X i has a single time-constant covariate x i, then β β, with the difference

33 DIRECTLY SPECIFIED (MARGINAL) MODELS 27 β β increasing with var(b i ). Interpreting the marginal and conditional effects is considered further in Section 2.6. In Chapter 4, we use this model to characterize the effect of a behavioral intervention on weekly smoking cessation status using longitudinal binary data from a recent clinical trial. The data are described in Dataset?? and analyzed in Data Analysis??. Examples 2.1 and 2.2 assume the random effects b i follow a normal distribution; this is not necessary and in many cases it may be inappropriate or incorrect. Zhang and Davidian [ZD01] describe models where the random effects distribution belongs to a flexible class of densities that includes the normal as a special case. Verbeke and Lesaffre [VL96] describe random effects distributions that follow discrete mixtures of normal distributions. For some simple models, it is sometimes possible to use exploratory analysis in order to ascertain whether a normal or other symmetric distribution is suitable for describing the random effects. In other cases more formal methods of model choice may be needed. 2.4 Directly specified (marginal) models This section reviews the family of models in which the joint distribution [Y X] is directly specified by a model f(y x, θ). Usually the most challenging aspect of model specification is finding a suitable parameterization for the correlation and/or covariance, particularly when observations are unbalanced in time or when the number of observations per subject is large relative to sample size. In these cases, sensible decisions about dimension reduction must be made. For continuous data that can be characterized using a normal distribution or Gaussian process, model specification (though not necessarily selection) can be reasonably straightforward, owing to the natural separation of mean and variance parameters in the normal distribution. The analyst can focus efforts separately on sensible models for mean and variance/covariance structures. Other types of data also pose more significant challenges to the process of direct specification due to a lack of obvious choices for joint distribution models. Unlike the normal distribution, which generalizes naturally to the multivariate and even stochastic process setting, common distributions like gamma, binomial and Poisson do not have obvious multivariate analogues. The main problems are that the mean shares parameters with the variance and, even for simple specifications, with the covariance. Another potential problem is that unlike with the normal model,

34 28 REGRESSION MODELS higher order associations do not necessarily follow from pairwise associations, hence they need to be specified or explicitly constrained [FL93]. The joint distribution of J binary responses, for example, has 2 J J parameters governing the association structure. With count data, appropriately specifying even a simple correlation structure is not immediately obvious. This section describes various approaches to direct model specification, illustrated with examples from the normal and binomial distributions. The first examples use the normal distribution. For longitudinal binary data, we describe an extension of the log-linear model that allows transparent interpretation of both the mean and serial correlation. Another useful approach to modeling association in binary data is the multivariate probit model, which exploits properties of the normal distribution by assuming the binary variables are manifestations of an underlying normally-distributed latent process Multivariate normal and Gaussian process models The multivariate normal distribution provides a highly flexible starting point for modeling continuous response data, both temporally aligned and misaligned. It also is useful for handling situations where the number of observation times is relatively large relative to the number of units being followed. The most straightforward situation is where data are temporally aligned and n J, allowing both the mean and variance to be unstructured. When responses are temporally misaligned, or when J is large relative to n, structure must be imposed. A key characteristic of the normal distribution that allows for flexible modeling across a wide variety of settings is that the mean and variance have separate parameters. The next two examples illustrate a variety of model specifications using the normal distribution. Each assumes that the response variable Y, or suitable transformation, is well-characterized by the normal model. Example 2.3. Multivariate normal regression for temporally aligned observations. Assume that observations on the primary response variable are taken at a fixed schedule of times t 1,..., t J. For a response vector Y i = (Y i1,..., Y ij ) T with associated J p covariate matrix X i = (x T i1,..., xt ij )T, the multivariate normal regression is written as [Y i X i, θ] N(µ i, Σ i ), where µ i is J 1 and Σ i is J J. The mean E(Y i X i ) follows a regression µ i (β) = µ(x i, β) = X i β,

35 DIRECTLY SPECIFIED (MARGINAL) MODELS 29 where X i is a J p covariate matrix and β is a p 1 vector of regression coefficients. The covariance is parameterized with a vector of nonredundant parameters φ. To emphasize that the covariance var(y i X i ) may depend on X i through φ, we sometimes write Σ i (φ) = Σ(X i, φ). If Σ is assumed constant across individuals, it has J(J +1)/2 unique parameters, but structure can be imposed to reduce this number [JS86]. As an alternative to leaving Σ fully parameterized, common structures for longitudinal data include banded or Toeplitz (with common parameter along each off-diagonal), and autoregressive correlations of pre-specified order. The model also can be extended to allow Σ to depend on covariates [DP02, NnANAZ00a, Pou00]. This model is written in general terms, and the X i matrix can be arbitrary. For example it can include information about measurement time, baseline covariates and the like. If we set x ij = (1, t j ) T and β = (β 0, β 1 ) T, then β 1 corresponds to the average slope over time, where the average is taken over the population from which the sample of individuals is drawn. When J is small enough, x ij can include a vector of time indicators, allowing the mean to be unstructured in time. In Data Analysis 4.1, this model is used to analyze data from a clinical trial of recombinant human growth hormone for increasing muscle strength in the elderly. The data are fully described in Dataset 1.2. In the previous example, it is sometimes possible to allow both the mean and variance to remain unstructured in time because of the relatively few time points and covariate levels. When time points are temporally misaligned, or when the number of observation times is large relative to the sample size, information at the unique measurement times will be sparse and additional structure needs to be imposed. Our focus in the next example is on covariance parameterization in terms of a covariance function. Further details can be found in Diggle et al. [DHLZ02a], Chapter 4. Example 2.4. Multivariate normal regression model for temporally misaligned observations. The main difference in model specification when observations are temporally misaligned has mainly to do with the covariance parameterization. As with Example 2.3, a normal distribution may be assumed, but with covariance Σ i whose dimension and structure depend on the number and timing of observations for individual i. The joint distribution follows [Y i X i, θ] N(µ i, Σ i ).