Missing Data in Longitudinal Studies: Dropout, Causal Inference, and Sensitivity Analysis

Transcription

1 Missing Data in Longitudinal Studies: Dropout, Causal Inference, and Sensitivity Analysis Michael J. Daniels University of Florida Joseph W. Hogan Brown University

2

3 Contents I Regression and Inference 1 1 Datasets Schizophrenia trial Growth hormone trial Smoking cessation trials HERS: HIV Natural History Study OASIS Study Pediatric AIDS Trial 14 2 Regression Models Introduction Generalized linear models Conditionally-specified models Directly specified (marginal) models Nonlinear and semiparametric regression Interpreting covariate effects Further reading 43 3 Bayesian inference Likelihood Prior Distributions Computation of the Posterior Distribution 57 iii

4 iv CONTENTS 3.4 Model comparison and Model fit Semiparametric Bayes Further reading 73 4 Data Analysis Analysis of GH study using complete data (continuation of Data Example 1.2) Analysis of schizophrenia using complete data (continuation of Data Example 1.1) Analysis of CTQ I using complete data (continuation of Data Example 1.3) Analysis of HERS CD4 data (continuation of Data Example 1.4) 86 II Missing Data 91 5 Missing data mechanisms Introduction Full versus observed data Full-data models and missing data mechanism Assumptions about missing data mechanism MAR and dropout Posterior inference Posterior inference under ignorability Posterior inference under non-ignorability Summary Further reading Inference under MAR General issues in model specification Posterior sampling using data augmentation Covariance structures for univariate longitudinal processes 134

5 CONTENTS v 6.4 Covariate dependent covariance structures Multivariate processes Model comparison and fit Further reading Data examples: Ignorability Re-analysis of GH study under MAR (cont. of Example 4.1) Analysis of schizophrenia clinical trial under MAR (cont. of Example 4.2) Analysis of CTQ I using all the data under MAR (cont. of Example 4.3) Analysis of weekly smoking and weight change from CTQ II using all the data under MAR Models under nonignorable dropout Introduction Selection models Mixture models Shared parameter models (SPM) Model comparisons and assessing model fit Further reading Informative Priors and Sensitivity analysis Introduction Pattern mixture models Selection models Elicitation of expert opinion and formulation of sensitivity analyses A note on sensitivity analysis in fully parametric models Further reading 229

6 vi CONTENTS 10 Case studies: Missing not at random Growth hormone study Pediatric AIDS Trial OASIS Study: Analysis via pattern mixture models OASIS Study: Analysis via selection models Application of missing data methods to problems in causal inference Introduction Framework Instrumental variables (IV) Principal stratification Further reading 268 A Notation 269 References 271 Index 287

7 PART I Regression and Inference

8

9 General introduction goes here. 3

10

11 CHAPTER 1 Datasets This chapter describes, in detail, several datasets that will be used throughout the book to motivate the material and to illustrate data analysis methods. For each dataset, we describe the associated study, the primary research goals to be met by analyses, and the key complications presented by missing data. Empirical summaries are given here; for detailed analyses we provide references to subsequent chapters. These datasets derive primarily from our own collaborations, introducing the potential for selection bias in the coverage of topics. However we have selected these because they cover a range of different types of study design and missing data issues. Although the data are primarily from clinical trials, we have XX observational studies. The datasets have both continuous and discrete longitudinal endpoints for analysis. Both continuous-time and discrete-time dropout processes are covered here. In XX of the studies, the goal is to capture the joint distribution of two processes evoloving simultaneously. For motivating and illustrating issues related to causal inference, we use an observational study and a clinical trial with noncompliance. It is our hope that readers can find within these examples connections to their own work or data-analytic problem of interest. The literature on missing data continues to grow, and it is impossible to address every situation with representative examples, so we have included at the end of each chapter a list of supplementary reading to assist readers in their navigation of current research in this field. 1.1 Dose-finding trial of an experimental treatment for schizophrenia Study and data These data were collected as part of a randomized, double-blind clinical trail of a new pharmacologic treatment of schizophrenia (Lapierre, Nai, 5

12 6 DATASETS Chauinard et al., 1990). Other published analyses of these data can be found in Hogan and Laird (1997) and Cnaan, Laird and Slasor (1997). The trial compares three doses of the new treatment (low, medium, high) to the standard dose of halperidol, an effective antipsychotic that has known side effects. At the time of the trial, the experimental therapy was thought to have similar antipsychotic effectiveness with fewer side effects; the trial was designed to find the appropriate dosing level. The study enrolled 242 patients at 13 different centers, and randomized them to one of the four treatment arms. The intended length of follow up was six weeks, with measures taken weekly except for week 5. Schizophrenia severity was assessed using the Brief Psychiatric Rating Scale, or BPRS, a sum of scores of 18 items that reflect behaviors, mood and feelings (Overall and Gorham, 1988). The scores ranged from 0 to 108 with higher scores indicating higher severity. A minimum score of 20 was required for entry into the study Questions of interest The primary objective is to compare mean change from baseline to week 6 between the four treatment groups Missing data Dropout in this study was substantial; only 139 of the 245 participants had a measurement at week 6. The mean BPRS on each treatment arm showed differences between dropouts and completers. Reasons for dropout included adverse experience (e.g., side effects), lack of treatment effect, and withdrawal for unspecified reasons. Reasons such as lack of observed treatment effect are clearly related to the primary efficacy outcome, but others such as adverse events and participant withdrawal may not be. Figure 1: Graphs (4 panels), separating completers from dropouts Figure 2: KM curve of dropout by tx arm Table 1: Dropout by reason, stratified by treatment arm. Need to insert material in chapters 5 and 7 about handling combinations of informative and noninformative dropout. Reference to appropriate material on extrapolations Re-add some material on intermittent missingness in the presence of dropout, in Chapters 5 or 7 What about centers? Include as covariate?

13 GROWTH HORMONE TRIAL Data Analyses In Section XXX we analyze these data using a standard random effects model for longitudinal data, fitted under the missing at random (MAR) constraints. Because valid inference under MAR depends on correct model specification, we emphasize the role of variance-covariance specification under MAR. For some individuals, drop out occurs for reasons that clearly are related to outcome, and for others the connection between dropout and outcome is less clear (e.g. dropout due to adverse side effects). Hence the MAR assumption may not be entirely appropriate, and in Section 9.X, we show how to handle combinations of MAR and MNAR dropout using a pattern mixture model; some attention is given in this example to assumptions being made about intermittently missing values. Need to do MAR analysis Need to do Ch 9 analysis 1.2 Clinical trial of recombinant human growth hormone (rhgh) for increasing muscle strength in the elderly Study and data The data come from a randomized clinical trial conducted to examine the effects of recombinant human growth hormone (rhgh) therapy for building and maintaining muscle strength in the elderly (Kiel, Puhl, Rosen et al., 1998). The study enrolled 161 participants and randomized them to one of four treatment arms: placebo (P), growth hormone (GH) only, exercise plus placebo (E), and exercise plus growth hormone (E+GH). Various muscle strength measures were recorded at baseline, six months, and 12 months. Here, we focus on mean quadriceps strength (QS), measured as the maximum foot-pounds of torque that can be exerted against resistance provided by a mechanical device Questions of interest The primary objective of our analyses is to compare mean QS at month 12 in the four treatment arms among all those randomized (i.e., draw inference about the intention to treat effect).

14 8 DATASETS Missing data Roughly 75% of randomized individuals completed all 12 months of follow-up, and most of the dropout was thought to be related to the unobserved responses at the dropout times. Table summarizes mean and standard deviation of QS for available follow up data, both aggregated and stratified by dropout time Data analyses These data are used several times throughout the book to illustrate various models. In Section?? we analyze data on completers only to illustrate multivariate normal regression; in Section?? we fit pattern mixture models under the missing at random (MAR) constraint, and illustrate the use of interior family constraints (Molenberghs et al., 1998; Kenward and Molenberghs, 2000 need cite); in Section?? we use more general pattern mixture models that permit missingness not at random (MNAR) and sensitivity analyses. 1.3 Clinical Trials of Exercise as an Aid to Smoking Cessation in Women: The Commit to Quit Studies Studies and data The Commit to Quit studies were randomized clinical trials to examine the impact of exercise on the ability to quit smoking. The women in each study were aged 18-65, smoked 5 or more cigarettes per day for at least one year, and participated in moderate or vigorous intensity activity for less than ninety minutes per week. The first trial (hereafter CTQ I) enrolled 281 women and tested the effect on smoking cessation of supervised vigorous exercise versus equivalent staff contact time (Marcus, Albrecht, King et al., 1999); the second trial (hereafter CTQ II) enrolled XXX female smokers and was designed to examine the effect of moderate partially supervised exercise (Marcus, Lewis, Hogan et al., 2005). Other analyses of these data can be found in Hogan, Roy and Korkontzelou (2004), add ref who illustrate weighted regression using inverse propensity scores, and Roy and Hogan (2007), who use principal stratification methods to infer the causal effect of compliance with vigorous exercise. In each study, smoking cessation was assessed weekly using self-report,

15 SMOKING CESSATION TRIALS 9 Table 1.1 need treatment labels Growth hormone trial: Sample means (standard deviations) stratified by treatment group and dropout pattern k. Patterns defined by last visit (0: baseline; 1: 3 months; 2: 6 months), and n k is the number in pattern k. Month Treatment k n k (26) (15) 68 (26) (24) 90 (32) 88 (32) All (25) 87 (32) 88 (32) (17) (33) 81 (42) (22) 64 (21) 63 (20) All (23) 66 (25) 63 (20) (32) (52) 86 (51) (24) 81 (25) 73 (21) All (26) 82 (26) 73 (21) (29) (19) 62 (31) (23) 62 (20) 63 (19) All (24) 62 (22) 63 (19) with confirmation via laboratory testing of saliva and exhaled carbon monoxide. As is typical in short-term intervention trials for smoking cessation, the target date for quitting smoking followed an initial runin period during which the intervention was administered but the participants were not asked to quit smoking. In CTQ I, measurements on smoking status were taken weekly for 12 weeks; women were asked to quit at week 5. In CTQ II, total follow up lasted 8 weeks, and women were asked to quit in week Defining treatment effects In each study, the question of interest was whether the intervention under study reduced the rate of smoking. This can be answered in terms of the effect of randomization to treatment versus randomization to control, or in terms of the effect of complying with treatment versus not com-

16 10 DATASETS plying. The former can be answered using an intention to treat analysis, contrasting outcomes based on treatment arm assignment. The latter poses additional challenges, but can be addressed using methods for inferring causal effects; these include instrumental variables, propensity score methods, and principal stratification. These topics are addressed some detail in Chapter??; key references include Angrist, Imbens and Rubin (1996) need cite, Robins (XX) need cite and Frangakis and Rubin need cite. In our analyses, we frame the treatment effect in terms of either (a) timeaveraged weekly cessation rate following the target quit date or (b) cessation rate at the final week of follow up. More details about analyses are given below Missing data Each of the studies had substantial dropout; in CTQ I, XX percent (XX/XX) dropped out on the exercise arm, and XX percent (XX/XX) on the control arm; for CTQ II the proportions were XX on exercise, and XX on control. Figure XX shows, for CTQ I, weekly cessation rates for all observed data, and then stratified by dropout status (yes/no), making clear that dropout is related at least to observed smoking status during the study. There also exists some empirical support for the notion that dropout is related to missing outcomes in smoking cessation studies, in the sense that dropouts are more likely to be smoking once they have withdrawn from the study (Liectenstein et al, 19XX) need citation Data analyses Analysis of CTQ I under standard MAR assumptions In Chapters XX, we use the CTQ I data to infer the effect of being randomized to either exercise or control under the standard MAR assumption; inferences are compared to other approaches such as complete-case analysis and the common practice of assuming dropouts are smokers. Analysis of CTQ II using auxiliary variables Weight change is generally associated with smoking cessation. In Chapter XX, we illustrate the use of auxiliary information on longitudinal

17 HERS: HIV NATURAL HISTORY STUDY 11 weight changes to inform the distribution of smoking cessation outcomes in making treatment comparisons in CTQ II. The weight change data is incorporated through a joint model for longitudinal smoking cessation and weight, and the marginal distribution of smoking cessation is used for treatment comparisons. Inferring the causal effect of exercise in CTQ I An attractive feature of many behavioral intervention trials is that compliance with the intervention is directly observed; in CTQ I, participants attended on-site sessions to participate in exercise and counseling. In Chapter XX, we use the method of principal stratification to estimate the causal effect, on week-12 cessation rate, of attending the exercise sessions versus attending the educational sessions (see also Roy and Hogan, 2007). This effect is estimated for the subpopulation (stratum) who would comply with either intervention, if offered. Graph of observed smoking rates each study Graph stratified by dropouts and non-dropouts for each study Graph of weight and smoking status in CTQ II? 1.4 Natural history of HIV infection in women: HIV Epidemiology Research Study (HERS) Cohort Have to decide what we will do with these data. So far we have summarized CD4 using spline models. I suppose that is useful for the discussion about MAR but I m not sure where else that gets us Study and data The HIV Epidemiology Research Study (HERS) was a longitudinal cohort study of the natural history of HIV in women. Between 1993 and 1996, the HERS enrolled 1310 women who were either HIV-positive or at high risk for infection; 871 were HIV positive at study entry. Every six months for up to five years, several outcomes were recorded for each participant, including standard measures of immunologic function and viral burden, plus a comprehensive set of measures characterizing health status and behavioral patterns (e.g., body mass index, depression status, drug use behavior). Our analyses of HERS data will focus on modeling

18 12 DATASETS CD4 progression in the presence of dropout and mortality, and on estimating effect of treatment on CD4 count. Many analyses of HERS, addressing a variety of topics, have been published in both the medical and statistical literature. A general study of CD4 and viral load progression can be found in Mayer, Hogan, Smith et al. (2003) ; covariation of CD4 and body mass index is investigated in Jones, Hogan, Snyder et al. (2003) ; development and application of methods for estimating the effect of time-varying treatment can be found in Ko, Hogan and Mayer (2003), Hogan and Lee (2004), Hogan and Lancaster (2004), and Roy et al. (2006). check biblio for citations of hogan/lancaster and Roy et al CD4 or depression over time, with population smoother? Should align with HAART? Here, could potentially look at depression as a function of Race within CD4 200, where dropout seems to matter. KM plots of dropout, or plots showing dropout vs non-dropouts. Need something on mortality... (a) KM plot of time to death, treating dropout and non-hiv death as censoring?? Study objectives The HERS is a multi-site study with substudies numbering in the hundreds; relative to their scope, our objectives for illustrating data analyses are necessarily simplified. We carry out two analyses of data from HERS: in the first, our interest is in characterizing the trajectory of CD4 or depression or something else...?; in the second, we are interested in quantifying the causal effect of highly-active antiviral therapy (HAART) on CD4 count. The latter question requires methodology for handling a time-varying nonrandomized treatment Missing data issues All individuals enrolled in HERS were scheduled for five-year follow up (12 visits); of the 871 HIV-positive women, XXX completed all 12 visits; XXX dropped out and XXX died before completing the study. Reason for death is classified as HIV-related or not. Having dropout due to death necessitates careful definition of the target quantities for inference, which is given more detailed attention in Chapter XX need to put in placeholder for section - put it in the section.

19 OASIS STUDY Data analyses In Chapter XX, we illustrate the use of regression splines under MAR, with emphasis on the importance of choosing an appropriate variancecovariance model. Regression splines are used because the actual dates of the measurements, rather than just visit number, are available. The analyses in Chapter XX uses the method of instrumental variables to draw inferences about the effect of time-varying HAART on CD4 cell count. A similar analysis using moment-based methods can be found in Hogan and Lancaster (2004). 1.5 Clinical trial of smoking cessation among substance abusers: OASIS Study Data description The OASIS Trial studied compared standard versus enhanced counseling intervention for various behaviors such as smoking and alcohol abuse among substance abusers check - or alc abusers?; the focus for our analyses is on smoking cessation. The trial enrolled XX individuals, randomized to standard versus more intensive counseling details from JYL paper. Follow up occurred at one, three and six months following randomization. Table of cessation rates under completers only and filling in dropouts as smokers; include percent dropout; from JYL paper Analysis objectives The primary goal of our analysis is comparison, by treatment randomization, of smoking cessation rates at 12 months post baseline (i.e., intention to treat effect) Missing data issues Dropout rate was relatively high in the OASIS study (XX percent on standard, XX percent on the enhanced intervention). In our analyses we do not distinguish between dropout reason or type.

20 14 DATASETS Data analyses These data are analyzed in detail in Chapter XX, using models that allow for MNAR dropout. The first analysis uses a pattern mixture model where, conditional on dropout time, the longitudinal smoking outcomes follow a Markov transition model. The model is fit under MAR assumptions, then elaborated to allow for MNAR mechanisms. Sensitivity analyses and the use of informative prios are illustrated. The second analysis uses a semiparametric selection model approach, also allowing for MNAR dropout. The two models are compared in terms of treatment effect inference and qualitative characteristics. 1.6 Equivalence trial of competing doses of AZT in HIV-infected children: Protocol 128 of the AIDS Clinical Trials Group Data description Study of XX children randomized to two interventions. Dropout rate is XX percent. Data have been analyzed in other papers, including Hogan and Laird (1996), Hogan and Daniels (2000), and Hogan, Lin and Herman (2004). Tables and figures: Graph of all data, with highlighted profiles showing dropouts tend to have lower slopes etc. Graph of OLS slopes vs dropout time Inferential objectives Compare difference in CD4 change by end of study Missing data issues Dropout for various reasons. Here we will treat dropouts as the same, but for an analysis that considers reasons for dropout see Hogan and Laird (1996) and Hogan and Daniels (2000).

21 PEDIATRIC AIDS TRIAL Data analyses These data are analyzed in Chapter XX using a mixtures of varying coefficient models, which are compared to the standard random effects approach and to the conditional linear model of Wu and Bailey (1988).

22

23 CHAPTER 2 Regression Models for Longitudinal Data 2.1 Introduction Longitudinal data The material in this book is organized around regression models for repeated measurements. Appealing to first principles, one can think of longitudinal data as arising from the joint evolution of response and covariates, {Y i (t), x i (t) : t 0} If the process is observed at a discrete set of time points t = (t 1,..., t J ) T that is common to all individuals, the resulting response data can be written in terms of the J 1 vector Y i = {Y i (t) : t t} = (Y i1,..., Y ij ) T. The covariate process {x i (t)} is p 1. At time t j, the observed covariates are collected in the vector x ij = (x ij1,..., x ijp ) T. Hence the full collection of observed covariates is contained in the J p matrix x T i1 x T i2 X i =.. x T ij When the set of observation times is common to all individuals, we say the responses are balanced or temporally aligned. It is sometimes the case that observation times are unbalanced, or temporally misaligned, in that they vary by subject. in which case the times are t i1,..., t iji and the dimensions of Y i and X i are J i 1 and J i p, respectively. In regression, we are interested in characterizing the effect of covariates 17

24 18 REGRESSION MODELS X on a longitudinal dependent variable Y. Formally, we wish to draw inference about the joint distribution of the vector Y i of responses, conditionally on X i, [Y i X i ] = [Y i1,..., Y iji X i ]. Likelihood-based regression models for longitudinal data require a specification of this joint distribution using a model f(y x, θ). The parameter θ is a finite-dimensional vector of parameters indexing the model; it might include regression coefficients, variance components, and parameters indexing serial correlation. The joint distribution of responses can specified directly or indirectly. Directly specified models are written in terms of the marginal mean at each measurement occasion or time point t, together with a model for the variance-covariance structure. Indirectly specified models typically use a multilevel format, for example involving subject-specific random effects or latent variables b i to partition within- and between-subject variation. The usual strategy is to specify the joint distribution of responses and random effects, factored as [Y i, b i X i ] = [Y i b i, X i ] [b i X i ]. The distribution of interest, [Y i X i ], is obtained by integrating over the b i. Both directly- and indirectly-specified models are common for modeling longitudinal data, and in our review we will give several examples Regression models The literature on regression models for longitudinal data is vast, and we make no attempt to be comprehensive here. Our review is designed to highlight predominant approaches to regression modeling, emphasizing those models used in later chapters. Readers are referred to Diggle et al. [DHLZ02b], Fitzmaurice et al. [FLW04], Laird [Lai04], Jones [Jon93], Davidian and Giltinan [DG98], Crowder and Hand [CH90], Verbeke and Molenberghs [VM00], and Lindsey [Lin99] for a variety of perspectives. As we review several different regression models, the intent is to give the reader a sense of the rich variety of models that can be used to characterize longitudinal data, and to illustrate that these fit coherently into a single framework. As a result, missing data strategies described in later chapters can be applied very generally. Specific models described here will be familiar to those with experience analyzing longitudinal data (e.g. multivariate normal regression model, random effects models), but

25 INTRODUCTION 19 others represent fairly new developments (e.g. marginalized transition models [Hea02], regression splines [EM96, LZ99, RWC03]). Here we focus on specification and interpretation; Chapter 3 covers various aspect of inference. Because many regression models for longitudinal data have their foundation in the generalized linear model (GLM) for cross-sectional data [MN99], our review begins with a concise description of GLMs. Coverage of models for longitudinal data begins with random effects models; these build directly on the GLM structure by introducing individual-level random effects to capture between-subject variation. Conditionally on the random effects, within-level variation can be described by a simpler model, such as a GLM. Random effects models are very attractive in that they naturally partition variation in the dependent variable into its betweenand within-subject components, and they can be used to model both balanced and unbalanced data. At the same time, there is sometimes the disadvantage that the implied marginal distribution of responses is opaque. An alternative to random effects models is directly-specified models of the joint marginal distribution of responses (Section 2.4). Frequently referred to as marginal models, directly-specified models have a natural construction when the error distribution is multivariate normal, but for binary, count, and other discrete data, the choice of an appropriate joint distribution is less obvious. Our review touches on some recent developments for discrete longitudinal responses, such as the marginalized transition model [Hea02] and others. For a detailed review of likelihoodbased models of multivariate discrete responses, see Chapter 11 of Diggle et al. [DHLZ02a] and Chapter 7 of Laird [Lai04]. For all models covered in the first part of the chapter, the regression function is linear in covariates and takes a known functional form. Section 2.5 describes models in which the regression function can be nonlinear, either through a known function of the covariates or through an unspecified smooth function. The latter type of model is typically called semiparametric, because the regression is left unspecified but distributional assumptions are made about the error structure. Nonlinear and semiparametric models have a close connection to the GLM structure; our discussion of these models emphasizes that connection and illustrates that regression models as a whole can be very generally characterized [HTF01, RWC03]. The final element of our review concerns interpretation of covariate effects in longitudinal models. Because the response and covariates change with time, models of longitudinal data afford the opportunity to infer

26 20 REGRESSION MODELS both within- and between-subject covariate effects; however the importance of underlying assumptions to the interpretation of covariate effects should not be underestimated. Section 2.6 discusses three key aspects of interpretation and specification for longitudinal models: crosssectional versus longitudinal effects of a time-varying covariate, marginal (population-averaged) versus conditional (subject-specific) covariate effects, and the assumptions governing the use of time-varying covariates Full vs. observed data Throughout Chapters 2 and 3, the models refer to a full-data distribution. The distinction between full and observed data is particularly important when drawing inference from incomplete longitudinal data. We define the full data as those observations intended to be collected on a pre-specified interval, such as [0, T ]. For example, if intended collection times t 1,..., t J are common to all individuals, then the full response and covariate data are (Y i1, X i1 ),..., (Y ij, X ij ), where Y ij = Y i (t j ) and X ij = X i (t j ). In most applications, interest lies in the effect of covariates on the mean structure. When data are fully observed, the variance and covariance models can frequently be treated as nuisance parameters. Correct specification of variance and covariance allows more efficient use of the data, but it is not always necessary for obtaining proper inferences about mean parameters. When data are not fully observed, variance-covariance specification takes on heightened importance because missing data will effectively be imputed or extrapolated from observed data, based on modeling assumptions. For longitudinal data, unobserved responses will be imputed from observed responses for the same individual; the assumed correlation structure will usually have considerable influence on the imputation. This theme recurs throughout the book, and therefore our review pays particular attention to aspects of variance-covariance specification Additional notation Random variables and their realizations are denoted by Roman letters (e.g., X, x), and parameters are represented by Greek letters (e.g. α, θ). In Chapter 5, we expand the definition of full data to include random variables such as dropout time that characterize the missing data process.

27 GENERALIZED LINEAR MODELS 21 Vector- and matrix-valued random variables and parameters are represented using boldface (e.g. x, Y, β, Σ). For any matrix or vector A, we use A T to denote transpose. If A is invertible, then A 1 is its inverse and S = A 1/2 is the lower triangular matrix square root such that SS T = A. A full listing of notational conventions appears in the Appendix. 2.2 Generalized linear models for cross sectional data The generalized linear model (GLM) forms the foundation for many approaches to regression with multivariate responses, such as longitudinal or clustered data. Models such as random effects or mixed effects models, latent variable and latent class models, and regression splines, all highly flexible and general, are based on the GLM framework. Moment-based methods such as generalized estimating equations (GEE) also follow directly from the GLM for cross-sectional data [LZ86]. The GLM is a regression model for a dependent variable arising from the exponential family of distributions, f(y θ, ψ) = exp {(yθ b(θ)) /a(ψ) + c(y, ψ)}, where a, b and c are known functions, θ is the canonical parameter, and ψ is a scale parameter. The exponential family includes several commonlyused distributions, such as normal, Poisson, binomial, and gamma. It can be readily shown that E(Y ) = b (θ) var(y ) = a(ψ)b (θ) (see McCullagh and Nelder [MN89], Section for details). The effect of covariates x i = (x i1,..., x ip ) T can be modeled by introducing the linear predictor η i (x i, β) = x T i β, where β = (β 1,..., β p ) T is a vector of regression coefficients. Now define µ i = µ(x i, β) = E(Y x i ). A smooth, monotone function g links the mean µ i to the linear predictor η i via g(µ i ) = η i = x i β. (2.1) In many exponential family distributions, it is possible to identify a link function g such that X T Y is the sufficient statistic for β (here, X is the n p design matrix and Y = (Y 1,..., Y n ) T is the n 1 vector of responses). In this case, the canonical parameter is θ = η. Examples are well-known and widespread: for the Poisson distribution, the canonical

28 22 REGRESSION MODELS parameter is log(µ); for binomial distribution, it is the log odds (logit), log{µ/(1 µ)}. Although canonical links are sometimes convenient, their use is not necessary to form a GLM. In general, it only requires specification of a mean and variance function, conditionally on covariates. The mean follows (2.1), and the variance is given by v(µ i, φ) = φh(µ i ), where h( ) is some function of the mean and φ > 0 is a scale factor. Certain choices of g and h will yield likelihood score equations for common parametric regression models based on for exponential family distributions. For example, setting g(µ) = log{µ/(1 µ)}, h(µ) = µ(1 µ) and φ = 1 yields logistic regression under a Bernoulli distribution [Y i x i ]. Similarly, Poisson regression can be specified by setting g(µ) = log µ, h(µ) = µ and φ = Conditionally specified (random effects) models Conditionally specified models using random effects or latent variables provide a highly flexible class of models for handling longitudinal data. A defining characteristic of these models is that they impose structure on marginal variance and correlation using individual-specific random effects or latent variables. The models can be applied either to balanced or unbalanced response patterns, and can be used to capture key features of both between- and within-subject variation using relatively few parameters. A standard approach is to specify a regression model that includes subject-level random effects or latent variables b, and then to assume that conditionally on the latent variables, the distribution [Y X, b] has a simple form (e.g. its elements are independent). Integrating out the random effects yields marginal correlations between the {Y ij } within subject [BK99]. Many models, regression and otherwise, can be represented using a random effects or latent variable formulation. These include standard random effects regression models for responses that are continuous [LW82, Dig88], or discrete [SLW84, GH97, HG94, NMK00]; see Breslow and Clayton [BC93] and Daniels and Gatsonis [DG99] for an overview. This class of models also includes include regression models with factor-analytic and latent class structures [SR96, AW00, SL96, RLR99, RA01]. See Bartholomew and Knott [BK99] for a full account. Here we briefly review conditionally-specified regression models where

29 CONDITIONALLY-SPECIFIED MODELS 23 conditioning is done on random effects; these models also are known by a variety of names, including mixed effects models, random effects models, and random coefficient models. We use the term random effects models. The most common random effects models for longitudinal data specify the joint distribution [Y i, b i X i, θ] as [Y i b i, X i, θ 1 ] [b i X i, θ 2 ]. The parameter θ 1 captures the conditional effect of X on Y. The marginal distribution [Y i X i ] is obtained by integrating b i out of the joint distribution, and is indexed by the full set of parameters θ = (θ 1, θ 2 ) Random effects models based on GLMs By including random effects, generalized linear models can be used to model longitudinal and clustered data. For common distributions such as Bernoulli and Poisson, the GLM with random effects can be written in terms of the conditional mean and variance. The conditional mean takes the form g{e(y ij x ij, z ij, b i )} = g(µ b ij ) = x ijβ + z ij b i, where g( ) is a link function and z ij is a design matrix for the subjectspecific random effects. This representation of the conditional mean motivates the term mixed-effects model because the coefficients quantify both population-level (β) and individual-level (b i ) effects. The conditional variance is V b ij = var(y ij x ij, z ij, b i ) = φh(µ b ij). Finally, within subject correlation is specified through a covariance function C b ijk (γ) = cov(y ij, Y ik x ij, x ik, b i, γ). In many cases it is assumed that Cijk b = 0; i.e., that the random effects capture relevant within-subject correlation (after averaging over their distribution), but this assumption may not always be appropriate for longitudinal responses. At the second level, the random effects b i follow some distribution such as multivariate normal. The model for the marginal joint distribution of (Y i1,..., Y ij X i ) is obtained by integrating over b i, f(y 1,..., y J X i, θ) = f(y 1,..., y J X i, b i, θ 1 ) df (b i X i, θ 2 ).

30 24 REGRESSION MODELS The relationship between marginal and conditional (random effects) models is important to understand, particularly as it relates to interpreting covariate effects. In what follows we give several examples to illustrate Random effects models for continuous response A natural choice for modeling continuous or measured responses is the normal distribution. In random effects models, allowing both withinand between-subject variation to follow a normal distribution, or more generally a Gaussian process, affords considerable modeling flexibility while retaining interpretability. Example 2.1. Normal random effects model for continuous responses. A common model for continuous longitudinal responses is the normal random effects model. This model illustrates well the concept of an indirectly-specified joint distribution because the variance-covariance structure in [Y i X i, θ] is a by-product of the assumed random effects distribution. Like many random effects models, it is easiest to describe in two stages. At the first stage, the responses Y i are normal conditionally on a q 1 vector of random effects b i, [Y i X i, b i, θ 1 ] N(µ b i, Σ b i), where superscript b is added to emphasize that the mean and covariance are conditional on b i. To incorporate covariate effects, let µ b i = X iβ + Z i b i, where Z i is the design matrix for random effects. The variance matrix Σ b i = Σb i (φ) captures within-subject variation and is parameterized by the r 1 vector of φ of nonredundant parameters. Hence θ 1 = (β, φ). When Z i X i, as is usually the case, the b i can be thought of as error terms for regression coefficients, which gives rise to the term random coefficient model. For example, if X i = Z i, we obtain a randomcoefficients model, µ b i = X i β i = X i (β + b i ). (2.2) where the random effects b i can be interpreted as individual-specific deviations from β. The within-subject variance Σ i (φ) usually has a simplified structure, parameterized through a covariance function C ijk (φ). For example, an

31 CONDITIONALLY-SPECIFIED MODELS 25 exponential structure takes the form where φ = (σ 2, ρ) and 0 ρ 1. C ijk (φ) = σ 2 ρ tij t ik, At the second level, the random effects are assigned a distribution that can depend on covariates. The (multivariate) normal is a common choice, [b i X i ] N(0, Ω), where Ω = Ω(η) is a q q variance matrix indexed by η (hence θ 2 = η). It also is possible to allow η to depend on individual-level covariates through appropriate specifications [DZ03]; this is covered in more detail in Chapter 6. The marginal distribution of Y i follows the multivariate normal distribution [Y i X i, Z i, θ] N(X i β, Z i ΩZ T i + Σ). (2.3) The marginal variance var(y i X i ) is indirectly specified because it depends on parameters from both [Y i X i, b i ] and [b i X i ]. Moreover, we see from by comparing (2.2) and (2.3) that β can be interpreted both as a marginal and a conditional effect of X i on Y i. A version of this model is used to analyze data described in Example 1.1, a longitudinal clinical trial comparing three doses of an antipsychotic to the standard of care in schizophrenia patients. The analysis appears in Data Analysis Random effects models for discrete responses Random effects specifications can be very useful for modeling longitudinal discrete responses, where the joint distribution rarely takes an obvious form and principles from generalized linear models are not easily applied. In the case of longitudinal binary data, for example, it is straightforward to show that the joint distribution of a J-dimensional response variable can be represented by a multinomial distribution with 2 J categories. When J is appreciably large, however, parameter constraints must be imposed to make modeling practical. See Laird [Lai04], Chapter 7 for a more detailed discussion. Compared to direct specification of the joint distribution, random effects models offer the advantage of being parsimonious, providing a natural decomposition of sources of variation, and applying equally well to balanced and unbalanced response profiles. The regression parameters

32 26 REGRESSION MODELS represent covariate effects in the conditional rather than marginal joint distribution of Y, however, and because the link functions are nonlinear transformations of the mean (e.g., log, logit), these do not generally coincide. Therefore care must be taken when interpreting regression effects. The logistic regression with normal random effects illustrates several of these points rather well. Example 2.2. Logistic regression with random effects. As in Example 2.1, a logistic random effects model is specified in terms of the joint distribution [Y i, b i X i, θ] = [Y i X i, b i, θ 1 ] [b i X i, θ 2 ], where θ = (θ 1, θ 2 ). The conditional distribution of each component in Y i follows the Bernoulli model, where [Y ij x ij, b i, θ 1 ] Ber(µ b ij ), g(µ b ij ) = x ijβ + z ij b i (2.4) (hence θ 1 = β). The random effects distribution follows so θ 2 = Ω. [b i X i, θ 2 ] N(0, Ω), The parameter β characterizes the conditional, or subject-specific effect of X i on Y i. By contrast, the marginal or population-averaged distribution [Y i X i, θ] must be obtained by integrating over b i. The marginal mean µ ij (β, Ω) = E(Y ij x ij, β, Ω) is µ ij (β, Ω) = µ b ij (β) df (b i x ij, Ω) = exp(x ij β + z ij b i ) 1 + exp(x ij β + z ij b i ) df (b i x ij, Ω). The marginal effect of X i differs from the conditional effect in that it is a function of both β and Ω, and on the logit scale, it is no longer linear. Zeger and Liang [ZL92] show that in some cases, the marginal effect in the logit-normal model is approximately linear on the logit scale, and differs from the conditional effect by a scale factor that depends on Ω; i.e., g(µ ij ) = x ij β, where β = βk(ω) and k : R q R p is a known function. In many cases the population-averaged effect β is attenuated relative to the subject-specific effect β; for example, when q = 1 (random intercept model), b i is normally distributed, b i is independent of X i, and X i has a single time-constant covariate x i, then β β, with the difference

33 DIRECTLY SPECIFIED (MARGINAL) MODELS 27 β β increasing with var(b i ). Interpreting the marginal and conditional effects is considered further in Section 2.6. In Chapter 4, we use this model to characterize the effect of a behavioral intervention on weekly smoking cessation status using longitudinal binary data from a recent clinical trial. The data are described in Dataset?? and analyzed in Data Analysis??. Examples 2.1 and 2.2 assume the random effects b i follow a normal distribution; this is not necessary and in many cases it may be inappropriate or incorrect. Zhang and Davidian [ZD01] describe models where the random effects distribution belongs to a flexible class of densities that includes the normal as a special case. Verbeke and Lesaffre [VL96] describe random effects distributions that follow discrete mixtures of normal distributions. For some simple models, it is sometimes possible to use exploratory analysis in order to ascertain whether a normal or other symmetric distribution is suitable for describing the random effects. In other cases more formal methods of model choice may be needed. 2.4 Directly specified (marginal) models This section reviews the family of models in which the joint distribution [Y X] is directly specified by a model f(y x, θ). Usually the most challenging aspect of model specification is finding a suitable parameterization for the correlation and/or covariance, particularly when observations are unbalanced in time or when the number of observations per subject is large relative to sample size. In these cases, sensible decisions about dimension reduction must be made. For continuous data that can be characterized using a normal distribution or Gaussian process, model specification (though not necessarily selection) can be reasonably straightforward, owing to the natural separation of mean and variance parameters in the normal distribution. The analyst can focus efforts separately on sensible models for mean and variance/covariance structures. Other types of data also pose more significant challenges to the process of direct specification due to a lack of obvious choices for joint distribution models. Unlike the normal distribution, which generalizes naturally to the multivariate and even stochastic process setting, common distributions like gamma, binomial and Poisson do not have obvious multivariate analogues. The main problems are that the mean shares parameters with the variance and, even for simple specifications, with the covariance. Another potential problem is that unlike with the normal model,

34 28 REGRESSION MODELS higher order associations do not necessarily follow from pairwise associations, hence they need to be specified or explicitly constrained [FL93]. The joint distribution of J binary responses, for example, has 2 J J parameters governing the association structure. With count data, appropriately specifying even a simple correlation structure is not immediately obvious. This section describes various approaches to direct model specification, illustrated with examples from the normal and binomial distributions. The first examples use the normal distribution. For longitudinal binary data, we describe an extension of the log-linear model that allows transparent interpretation of both the mean and serial correlation. Another useful approach to modeling association in binary data is the multivariate probit model, which exploits properties of the normal distribution by assuming the binary variables are manifestations of an underlying normally-distributed latent process Multivariate normal and Gaussian process models The multivariate normal distribution provides a highly flexible starting point for modeling continuous response data, both temporally aligned and misaligned. It also is useful for handling situations where the number of observation times is relatively large relative to the number of units being followed. The most straightforward situation is where data are temporally aligned and n J, allowing both the mean and variance to be unstructured. When responses are temporally misaligned, or when J is large relative to n, structure must be imposed. A key characteristic of the normal distribution that allows for flexible modeling across a wide variety of settings is that the mean and variance have separate parameters. The next two examples illustrate a variety of model specifications using the normal distribution. Each assumes that the response variable Y, or suitable transformation, is well-characterized by the normal model. Example 2.3. Multivariate normal regression for temporally aligned observations. Assume that observations on the primary response variable are taken at a fixed schedule of times t 1,..., t J. For a response vector Y i = (Y i1,..., Y ij ) T with associated J p covariate matrix X i = (x T i1,..., xt ij )T, the multivariate normal regression is written as [Y i X i, θ] N(µ i, Σ i ), where µ i is J 1 and Σ i is J J. The mean E(Y i X i ) follows a regression µ i (β) = µ(x i, β) = X i β,

35 DIRECTLY SPECIFIED (MARGINAL) MODELS 29 where X i is a J p covariate matrix and β is a p 1 vector of regression coefficients. The covariance is parameterized with a vector of nonredundant parameters φ. To emphasize that the covariance var(y i X i ) may depend on X i through φ, we sometimes write Σ i (φ) = Σ(X i, φ). If Σ is assumed constant across individuals, it has J(J +1)/2 unique parameters, but structure can be imposed to reduce this number [JS86]. As an alternative to leaving Σ fully parameterized, common structures for longitudinal data include banded or Toeplitz (with common parameter along each off-diagonal), and autoregressive correlations of pre-specified order. The model also can be extended to allow Σ to depend on covariates [DP02, NnANAZ00a, Pou00]. This model is written in general terms, and the X i matrix can be arbitrary. For example it can include information about measurement time, baseline covariates and the like. If we set x ij = (1, t j ) T and β = (β 0, β 1 ) T, then β 1 corresponds to the average slope over time, where the average is taken over the population from which the sample of individuals is drawn. When J is small enough, x ij can include a vector of time indicators, allowing the mean to be unstructured in time. In Data Analysis 4.1, this model is used to analyze data from a clinical trial of recombinant human growth hormone for increasing muscle strength in the elderly. The data are fully described in Dataset 1.2. In the previous example, it is sometimes possible to allow both the mean and variance to remain unstructured in time because of the relatively few time points and covariate levels. When time points are temporally misaligned, or when the number of observation times is large relative to the sample size, information at the unique measurement times will be sparse and additional structure needs to be imposed. Our focus in the next example is on covariance parameterization in terms of a covariance function. Further details can be found in Diggle et al. [DHLZ02a], Chapter 4. Example 2.4. Multivariate normal regression model for temporally misaligned observations. The main difference in model specification when observations are temporally misaligned has mainly to do with the covariance parameterization. As with Example 2.3, a normal distribution may be assumed, but with covariance Σ i whose dimension and structure depend on the number and timing of observations for individual i. The joint distribution follows [Y i X i, θ] N(µ i, Σ i ).

36 30 REGRESSION MODELS Also as in Example 2.3, µ i = X i β. The covariance Σ i (φ) has dimension J i, and structure is imposed by specifying a model for elements σ ikl (φ). For example, the model σ ikl = C ikl (φ 1, φ 2 ) = cov(y ik, Y il X i, φ) = φ 1 φ t ik t il 2 requires only φ 1 and φ 2 to fully parameterize Σ i (φ) (with the constraints φ 1 > 0 and φ 2 1). This covariance model implies (i) var(y ij X i, φ) = φ 1 (setting t ik = t il ) (ii) For any two observations Y ik, Y il separated by lag kl = t k t l, corr(y ik, Y il X i, φ) = φ kl 2. This process is stationary because for any given lag, correlation is a constant function of t. Further discussion of stationarity is given in Chapter 6. The variance and correlation parameters may depend on covariates through an appropriately-specified model [DP02]. Let Z i X i. A simple model for the variance is log φ 1i = Z i α. A semiparametric version of this model will be used to characterize the response of longitudinal CD4 counts to initiation of highly-active antiretroviral therapy (HAART) in a natural history study of HIV-infected women (Data Analysis 4.4) Directly specified models for discrete longitudinal responses Although the multivariate normal distribution is a natural choice for characterizing the joint distribution of continuous longitudinal responses, approaches to dealing with discrete observations such as binary, categorical, or count data are less obvious. Specifications of time-specific mean and variance can be built directly from GLMs, but the main difficulty lies in specifying a sensible and appropriate variance-covariance structure. This section describes two models for longitudinal binary data, both of which can be generalized to handle ordinal or multinomial response. A challenge in formulating models for the joint distribution of discrete responses is having a sensible parameterization of the correlation while Though this model is linear in the covariates, the design matrix can be used to specify highly flexible structures such as penalized splines. This is discussed further in Section 2.5

37 DIRECTLY SPECIFIED (MARGINAL) MODELS 31 maintaining an interpretable regression structure for the mean. This difficulty arises because the correlation is a function of the mean. One approach is to fully parameterize higher-order interactions and then constrain some of them to be zero [FL93, FLL94, FLZ96]. Another approach is to formulate models that deal explicitly with serial correlation. Our examples here include the marginalized transition model (MTM) [Hea02], which is a constrained version of a log-linear model, and a multivariate probit model, which uses a latent multivariate normal structure to induce correlation [CG98a]. The probit model is widely used in econometric modeling but less so for biostatistical applications. Model coefficients lack the simplicity of odds ratio interpretation, but the assumption of an underlying latent normal distribution provides computational tractability and flexible modeling of serial covariance structures. Example 2.5. Marginalized transition model for temporally aligned binary data. Let Y i = (Y i1,..., Y ij ) T denote the vector of responses, and let X i denote the design matrix. The marginal mean is µ ij = E(Y ij x ij ) = pr(y ij = 1 x ij ). To describe the model, some additional notation is needed. For any timedependent variable Z, let Z j = {Z 1,..., Z j } denote its history up to and including time j. The MTM is actually constructed as a transition model where the distribution [Y ij Y i(j 1), x ij ] follows a Bernoulli distribution, but is constrained to allow g(µ ij ) to be linear in covariates. Hence the marginal mean retains its form as a regression, but the distributional assumptions are made on the serial correlation model. For illustration, we use the logistic regression formulation. The MTM is based on a likelihood for the serial dependence model. The joint distribution is factored as [Y i1,..., Y ij X i ] = [Y i1 x i1 ][Y i2 Y i1, x i2 ] [Y ij Y i(j 1), x ij ]. Each components of the joint distribution is assumed to follow a Bernoulli distribution, [Y ij x ij, Y i(j 1), x ij ] Ber(φ ij ), where φ ij = E(Y ij Y i(j 1), x ij ). Under first-order dependence, denoted MTM(1), we have [Y ij Y i(j 1), x ij ] = [Y ij Y i(j 1), x ij ]. In what remains we focus on models that assume first-order serial dependence, denoted by MTM(1). The model is specified in terms of two In many cases, we implicitly assume xij includes relevant covariate history up to and including time j, obviating the need for overbar notation.

38 32 REGRESSION MODELS simultaneous equations. The first allows logit of the marginal mean µ ij to depend linearly on covariates x ij ; the second characterizes the dependence structure described by the conditional mean φ ij, logit(µ ij ) = x ij β (2.5) logit(φ ij ) = ij (x ij ) + Y i,j 1 γ j. (2.6) If serial correlation γ j depends on individual-level covariates z ij x ij, then we replace γ j by γ ij (α), where log γ ij = z ij α. For longitudinal binary data, the marginal and conditional means µ ij and φ ij are functionally dependent. To see this clearly, first write φ ij = φ( ij, γ j, y) = E(Y ij Y i,j 1 = y) to emphasize its dependence on Y i,j 1. Then µ ij (β) = pr(y ij = 1) = pr(y ij = 1 Y i,j 1 = 0, γ, )pr(y i,j 1 = 0 β) + pr(y ij = 1 Y i,j 1 = 1, γ, )pr(y i,j 1 = 1 β) = φ ij (γ,, 0){1 µ i,j 1 (β)} + φ ij (γ,, 1)µ i,j 1 (β). These constraints imply that ij = (x ij ) is a deterministic function of γ j, β, and y; see Heagerty [Hea02] for details. In Data Analysis?? we use this model to analyze data from the longitudinal smoking cessation study, and compare the results to those obtained with the logistic-normal random effects model described in Example 2.2. In practice, the MTM works well for first-order Markov dependence, but higher order dependence structures can make computation of ij more difficult. Because the models are derived from loglinear specifications, in principle they can be directly extended to handle ordinal or multinomial response. MTM s are well-suited to temporally aligned data, but can be expanded to handle temporally misaligned observations through properly-specified design matrices in both the mean and serial correlation. Su and Hogan [SH05] develop a flexible approach using penalized splines. When interest focuses only on the transition probabilities φ ij and not the marginal mean, then the conditional part (2.6) of the MTM can be used for inference. The intercept (x ij ) is replaced by x ij ψ, and the parameters of interest are either the γ j or α. The MTM illustrates a central difficulty in modeling longitudinal discrete data, namely dependence between the mean and covariance. In the

39 DIRECTLY SPECIFIED (MARGINAL) MODELS 33 case of the MTM, constraints must be used to harmonize the mean and correlation functions. This is in contrast to models based on the normal distribution, where the mean, variance and covariance can be effectively separated, eliminating the need for constraints on the parameter space. Another approach to handling longitudinal discrete data, the multivariate probit model, which exploits the separation of mean and variance in the normal distribution by assuming the existence of a latent multivariate normal structure that gives rise to the observed data. Probit models have been used in psychometrics and econometrics, where justification for and interpretation of the latent scales can frequently be made on subject matter grounds (see Lancaster [Lan04], Section 5.2 for example). In epidemiology and public health applications, justification for the use of latent scales is less obvious. In any application, the data offer no justification for the assumption that the latent variable follows a normal distribution, which has invited criticism of these models. However, they also can be viewed as a method of constructing multivariate distribution for binary longitudinal data, whose fit can be empirically assessed (Chapter 3). Nevertheless, probit models afford considerable opportunity to parameterize variance and covariance structures for discrete data, a property that is difficult to overlook. Moreover, the assumed latent structure makes inference via posterior sampling rather easy to implement (see Chapter 3). The probit link is functionally quite similar to the logit link, so fit to data is typically very similar between the two models. This example gives some details about specification. More details on the multivariate probit model, including methods for inference, are found in Chapters 3 and 6. Example 2.6. Multivariate probit model. The probit model based on an underlying multivariate normal model can be used to directly model the joint distribution of longitudinal binary data. The multivariate probit model assumes the existence of a vector of latent variables Z i = (Z i1,..., Z ij ) T that follow a multivariate normal distribution [Z i X i, θ] N(µ i, Ψ i ), where µ i = X i β, Ψ i = Ψ i (φ) is a J J correlation matrix with elements {ψ kl (φ)}, and θ = (β, φ). The diagonal elements ψ kk are set to one for identifiability [CG98a]. Interpretation of regression parameters for probit models is most easily done on the latent scale: the coefficient β p is the average difference in standard deviations of Z j corresponding to one-unit differences in x ijp. The binary observations Y ij are manifestations of the underlying latent

40 34 REGRESSION MODELS variables such that Y ij = I{Z ij > 0}. It is implied that marginally, [Y ij x ij, θ] Ber(µ ij ), with µ ij = µ(x ij, β) = Φ(x ij β). Within-subject dependence structure is modeled through Ψ(φ). The underlying multivariate normal distribution of Z i induces the correlation and use of the normal cdf for the transformation of x ij β to the probability scale. The multivariate distribution [Y i X i, θ], and in particular its serial correlation structure, is obtained by integrating [Z i X i, θ] over a subspace of R J defined by the observed binary outcomes Y i, f(y 1,..., y J X i, θ) = f(z X i, θ) dz, Q(y) where f(, θ) is the multivariate normal density function having mean and variance parameters in θ, Q(y) R J defines the limits of integration over z such that the jth integral is taken over (0, ) when y j = 1 and over (, 0] if y j = Nonlinear and semiparametric regression Our focus thus far has been on models where the regression function is specified as a linear function of covariates. Although it is possible, by transforming covariates, to specify models that are implicitly nonlinear, in practice linear models can be limiting in some settings. Models where the regression function is explicitly nonlinear, either through specification of some known function or by allowing flexible, unspecified functions, offer a wide array of alternatives and significantly expand the capability of regression modeling. Our focus in this section is on models that allow nonlinearity in time trends, although in principle the models we discuss can be used to model nonlinear effects of any covariate. In Section 2.5.1, we describe models where nonlinear time trends are specified directly; the canonical reference for this line of work is the book by Davidian and Giltinan [DG98] and a recent update [DG03]. These models are especially useful in settings like pharmacokinetics, viral dynamics, and growth curve modeling, where processes like drug uptake, viral replication and growth rate can be described by known but complicated functions of time. Section discusses semiparametric models, highlighting those models where covariate effects involving time can be captured with unspecified smooth functions. Here we divide the model types into two categories, generalized additive models (GAM) and varying coefficient models (VCM). In a GAM, the main effect of time is characterized by a

41 NONLINEAR AND SEMIPARAMETRIC REGRESSION 35 smooth function f(t). Certain random effects specifications permit f(t) to be estimated with smoothing spline [ZLRS98, RWC03]. In the VCM, covariate effects are allowed to vary with time via β(t)x, where the regression coefficient β(t) is a smooth function of time [HRWY98, CRW01, Zha04]. The review here is confined to likelihood-based methods, but readers should be aware of the sizable literature on moment-based methods for fitting semiparametric regression [LC01, RWC03, WCL05]. The emerging literature connecting functional data analysis with longitudinal repeated measures also is highly relevant; for an accessible review, see Guo [Guo04] Nonlinear models Models that capture nonlinear time trends can be either directly or indirectly specified. Direct specifications for [Y X] based on a normal distribution take the form with [Y i X i ] N(µ i, Σ i ), (2.7) µ ij (β) = f(x ij, β) and Σ i parameterized through a variance-covariance function. In practice, however, nonlinear models are frequently used to capture temporal processes at the individual level, suggesting that individual-level features of the nonlinear trajectory, rather than the averaged trajectory itself, are of primary interest. Consequently, nonlinear models with subject-level random effects are highly popular in fields such as pharmacokinetics, cellular and viral dynamics, and growth modeling. We use an example in modeling of HIV dynamics, taken from Wu and Wu [WW02]. Example 2.7. Nonlinear random effects model of HIV dynamics. Referring to (2.7), suppose X i is simply a design matrix for time, such that x ij = (1, t ij ). Then f(x ij, β) is a nonlinear function that is appropriate to the process being modeled. As an example, the level of plasma HIV-1 rna in response to antiviral therapy can be modeled as a one- or two-phase model that captures viral decay. Here we illustrate using a one-phase model, where f(t ij, β i ) = log 10 {β 0i + β 1i exp( β 2i t ij )}, (2.8) The key parameter is the viral decay rate β 2i. The random effects formulation permits β i = (β 0i, β 1i, β 2i ) T to vary by

42 36 REGRESSION MODELS subject; covariate effects on viral decay are captured by modeling each of the subject-specific coefficients in terms of covariates. The first-level model specification is as follows, [Y i t i, β i ] N(µ β i, Σβ i ). The mean follows µ β i = f(t i, β i ), specified as in (2.8). At the second level, covariates can be used to explain individual-specific components of the nonlinear model, [β i X i ] N(ν i, Ω), where ν i = A(X i )α, A(X i ) is a transformation of the covariate matrix X i, and Ω is a q q covariance matrix (here q = 3). One objective in modeling viral dynamics is to quantify the effect of immune system function, measured via CD4 cell counts, on first-phase viral decay in response to treatment. A simple version of that model sets β 0i = α 00 + b 0i β 1i = α 10 + b 1i β 2i = α 20 + α 21 CD4 i + b 2i, with Σ β i = σ2 I i. Unlike with linear models, indirectly specified nonlinear models with random effects do not have a simple decomposition of the marginal variancecovariance structure Semiparametric regression models for longitudinal data For longitudinal data, two highly flexible semiparametric models are the generalized additive model (GAM) and the varying coefficient model (VCM). In a GAM, the error distribution follows a parametric model (e.g. normal), but covariates act on the mean through an unspecified (nonparametric) function. The VCM is similar, except that the regression function takes a known form and its coefficients vary nonparametrically. Generalized additive models based on regression splines Consider the cross-sectional data model [Y i X i ] N(µ i, σ), with µ i = m(x i ) for an unknown function m( ). A natural cubic smoothing spline

43 NONLINEAR AND SEMIPARAMETRIC REGRESSION 37 estimator of m is the value f that minimizes a penalized sum of squares n {Y i f(x i )} 2 + λ {f (x)} 2 dx, (2.9) i=1 where f denotes the second derivative of f. For a suitably chosen n n matrix K, the penalty term can be rewritten as λf T Kf, where f = {f(x 1 ),..., f(x n )} T ; this term penalizes for over fitting, and in particular the parameter λ governs smoothness of the fitted curve f. Regression splines for longitudinal data can be similarly constructed. When it is desired to use smoothing splines to model time trends, longitudinal data affords the advantage of replicated curves across individuals. This structure can be exploited to permit automated smoothing by formulating models where λ is a variance component [ZLRS98, RWC03], and can be used for a variety of different splines (e.g., cubic smoothing splines, B-splines, etc.), and permits fitting both GAM and VCM. The basic principle of formulating a semiparametric model as a linear regression model is illustrated in the context of penalized splines (called P-splines). Example 2.8. A regression spline for continuous longitudinal data. This example is designed to show that nonparametric estimation of the time trend f(t) can be estimated using a linear model formulation. The model is with [Y i µ i, Σ i ] N Ji (µ i, Σ i ), (2.10) µ ij = f(t ij ) (2.11) { τ σ jk (φ) = j j = k exp( φ t ij t ik ) j k. A P-spline model represents f(t) in terms of an R-dimensional basis B(t) = {B 1 (t),..., B R (t)} T, where R is less than or equal to the number of unique observation times. A common choice is based on pth degree polynomial splines with pre-selected knot points s 1,..., s k along the time axis [BCR02], B(t) = {1, t, t 2,..., t q, (t s 1 ) q +, (t s 2) q +,..., (t s k) q + }T. (2.12) Here, a + = ai(a > 0). This basis has dimension R = q + k + 1; the knots can be chosen in a variety of ways; Berry et al. [BCR02] recommend using percentiles of the unique observation times. In practice, a linear (q = 1) or quadratic (q = 2) basis is often sufficient. The linear model formulation replaces f(t ij ) in (2.11) with B(t ij ) T β,

44 38 REGRESSION MODELS where β is R 1. Inference is then based on a likelihood having a penalty term λβ T Dβ, where D is an R R matrix. The choice of basis function will motivate the form of D. If (2.12) is used, coefficients of the terms involving knot points correspond to jumps in the pth derivative of f(t) at each knot points. To ensure smoothness, one could penalize the sum of squared jumps at the knot points via a simple form of D that places 1 s along the diagonal corresponding to β q+2,..., β R [RC00]. To illustrate in a simple case, we describe the model in two stages, borrowing from Crainiceanu et al. [CRW05]. Elaborating (2.10), we introduce a k 1 vector of random effects u. It is important to note that these random effects vary not across individuals but across knot points, as in a crossed design. Let s = (s 1,..., s k ) denote the set of knot points. At the first level, [Y i X i, u, µ i, Σ i ] N(µ u i, Σ u i ). (2.13) Now let X i = X i (s) = (Z i, H i (s)), where 1 t i1 t q i1 1 t i2 t q (t i1 s 1 ) q + (t i1 s k ) q + i2 Z i =...., H i(s) = t i,ji t q (t i,ji s 1 ) q + (t i,j i s k ) q + i,j i Partition the regression parameters as β = (α 0, α 1,..., α q, u 1,..., u k ) T = (α T, u T ) T. We can now specify the mean and variance in the first-level model (2.13), µ u i = Z i α + H i u Σ u i = σ 2 I i. At the second level, we assign a distribution to the elements of u = (u 1,..., u k ) T, [u τ] N k (0, τ 2 ). Under certain choices for prior parameter distributions, the posterior mode of Zα + Hu is a penalized smoothing spline for inferring f(t). In frequentist terms, it minimizes the penalized log likelihood described above, with D chosen as D = 0 (q+1) (q+1) 0 (q+1 k), 0 k (q+1) τ 2 I k k where λ = τ 2 /σ 2. In Example 4.4, we use this model to analyze longitudinal CD4 data from the HER Study.

45 INTERPRETING COVARIATE EFFECTS 39 This approach can be used to estimate f(t) via natural cubic smoothing splines [ZLRS98], and applies to discrete data in a GLM framework; see Lin and Zhang [LZ99] for a comprehensive account. Varying coefficient models Another type of semiparametric model for longitudinal data, similar to the GAM, is the varying coefficient model. The model allows covariate effects to change with time by allowing the coefficient β(t) to be an unspecified function of time, a particularly useful feature in settings where covariate effects are not time-constant and at the same time do not have an obvious functional form. Although VCM s have a different specification than GAM s, the method of estimation is similar. Recent work by Zhang [Zha04] shows that β(t) can be estimated by smoothing splines using a modified linear model similar to the one described in Example 2.8. Example 2.9. Varying coefficient model for continuous responses. Let x i represent a baseline covariate, and let t denote time elapsed from baseline. The following VCM allows the effect of x to vary over time through an unspecified function for the regression coefficient. At the first level, we assume [Y ij x i, t ij ] N(µ ij, σ 2 (t ij )), where the correlation function from Example 2.8 applies. One can specify the following VCM that allows an overall linear time trend, but an effect of x that varies with time: µ ij = β 0 + β 1 t ij + β 2 (t ij )x i. Here, β 0 and β 1 are unknown parameters, and β 2 (t) is a smooth function of t. Hence E(Y ij x i, t = t ) = β 0 + β 1 t + β 2 (t )x is the mean at some fixed time t = t. The model can be extended to allow the intercept to depend on time via β 0 (t), making the VCM a generalization of the GAM s described above, and to models for discrete data [Zha04]. 2.6 Interpreting covariate effects Proper interpretation of covariate effects in models for longitudinal data requires careful attention to model specification and assumptions about

46 40 REGRESSION MODELS covariate processes. The use of time-varying covariates in particular allows investigators to study both within- and between-individual covariate effects, but incorporation of these covariates and interpretation of their effects must be done with care. This section focuses on several aspects of model specification and interpretation, including assumptions about covariates, differentiation of longitudinal and cross-sectional effects of a time-varying covariate, and differentiation of marginal and conditional effects in random effects models. These topics are covered in considerable detail elsewhere; readers are referred to Chapters 1 and 12 of Diggle et al. [DHLZ02a], Chapter 15 of Fitzmaurice et al. [FLW04], and Chapter 1 of Laird [Lai04] Assumptions regarding time-varying covariates Throughout the book, we assume covariates are exogenous and measured without error. Exogeneity is a standard assumption for nearly all regression models, and in its simplest form amounts to independence between errors and covariates. In longitudinal models, exogeneity must be met for both time-constant and time-varying covariates. Time-varying covariates can be either deterministic or stochastic functions of time. Deterministic functions of time include calendar date or participant age; if they are exogenous at baseline, then their time-varying counterparts also will be. Stochastic covariate processes can be more problematic and must be treated with caution. A stochastic time-varying covariate will usually be exogenous if it is completely external to the measurement process (e.g. daily air pollution levels measured at the city level in a study where response is daily number of asthma events measured at the individual level). However, stochastic covariates measured at the same level as the response will tend to be endogenous because they may actually be systematically predicted by prior responses. In cohort studies of HIV infection, suppose that for measurement occasion j, x j is exposure to antiviral treatment and Y j is CD4 cell count. It is highly probable that Y j is predictive of x j+1 because variations in CD4 count are clinically important to initiation or adjustment of treatment. Consider another example involving physical functioning. Suppose x ij represented physical functioning. It is easy to imagine that changes in physical functioning will be associated with subsequent changes in depression (i.e. x j predicts Y j ), but the converse also may be true: changes in depression may also lead to changes in physical functioning (i.e. Y j or some function of Y 1,..., Y j predicts x j+1 ). Regression models are not usually suitable for characterizing the effects

47 INTERPRETING COVARIATE EFFECTS 41 of time-varying stochastic covariates, unless those covariates are exogenous to the measurement process. Effects of endogenous covariates are best summarized by models designed to capture causal effects; this is taken up in Chapter 11. In other circumstances, it is more appropriate to treat (Y t, x t ) as a joint process and use a model to characterize its joint evolution and association structure. Measurement error is another important concern, particularly with timevarying stochastic covariates and with nonlinear models, where measurement error can introduce appreciable bias. The literature on this topic is deep and wide-ranging; see Fuller [Ful87] and Carroll et al. [CRS95] for a full treatment. We make the simplifying (and simplistic!) assumption that all covariates are measured without error Longitudinal vs. cross-sectional effects Longitudinal data affords researchers the opportunity to study both within- and between-subject effects for certain covariates. For example, in aging research there is interest in the average effect of age on variables like physical functioning and depression. Suppose that for measurement j, Y ij represents level of depression (measured, for example, using the Hamilton scales) and x ij is age at assessment j. Age is exogenous, so a regression model can be used to estimate both the cross-sectional and longitudinal effects of age. The cross-sectional effect is a between-subject effect, and represents the difference in mean depression score between cohorts of individuals that differ by one unit in age. In the simplest formulation of the model, this difference is assumed to be time-constant. The longitudinal effect is a within-subject measure of association, and represents the average change in depression score per unit change in age, within an individual. Clearly the longitudinal and cross-sectional effects measure different associations and usually they take different values. The linear model with identity link provides a convenient illustration (but see Laird [Lai04], Section 1.4 for a general treatment). The ideas here apply more generally to other models where the covariate is exogenous. For the model E(Y ij x ij ) = β 0 + β 1 x i1 + β 2 (x ij x i1 ), β 1 measures cross-sectional (between-subject) association and β 2 captures longitudinal (within-subject) association. The choice of centering on x i1 is somewhat arbitrary; in this case, the cross-sectional effect is β 1 = E(Y i1 x 1 = x + 1) E(Y i1 x 1 = x), or the between-subjects effect of x at baseline. The longitudinal effect captures the within-subject

48 42 REGRESSION MODELS effect of the same covariate, β 2 = E(Y ij Y ij x ij x ij = 1), or more generally, E(Y ij Y ij x ij, x ij ) = β 2 (x ij x ij ). Besides the importance to interpretation of covariate effects, this decomposition is central to design of longitudinal studies [DHLZ02a] Marginal versus conditional covariate effects As indicated earlier in this chapter, the regression coefficients have distinctly different interpretations in marginal and conditional models. The marginal mean is defined as the mean response, averaged over the population from which the sample is drawn. Regression coefficients in directlyspecified marginal models capture the effects of covariates between groups of individuals defined by differences in covariate values; these effects are sometimes referred to as population-averaged effects [ZL92]. By contrast, regression coefficients in indirectly-specified random effects models capture the effect of a covariate at the level of aggregation in the random effect. For example, in model (2.4), random effects capture individuallevel variation; the regression coefficients β therefore represent covariate effects at the individual level. When the level of aggregation being captured by random effects is an individual or subject, the conditional regression parameters are sometimes referred to as subject-specific effects. An important distinction in the interpretation of regression parameters from conditional models is whether the covariate varies only between individuals, or whether it also varies within individuals [FLW04]. Two examples serve to illustrate the point. Consider first a longitudinal study of smoking cessation [MAK + 99, MLH + 05], where the binary response is cessation status Y ij at measurement occasions j = 1,..., J. Further suppose that each individual is enrolled in one of two treatment programs, denoted by X i {0, 1}. A random effects logistic regression can be used to estimate a conditional effect of X, logit{e(y ij X i, b i, β)} = b i + β 0 + β 1 X i, where (say) b i N(0, σ 2 ). The random effect b i captures all observable variation in the mean of Y ij that is not explained by x, and in that sense two individuals with the same value of b i can be thought of as perfectly matched with respect to unobserved individual-level covariate effects. Put another way, b i captures information that makes each individual unique. Hence the coefficient β 1 can be interpreted as the effect Here, as in all random effects models, it is being assumed that unobserved charac-

49 FURTHER READING 43 of treatment on smoking cessation for two individuals with the same unobserved covariate profile (i.e., the same value of b i ), or for two individuals whose predictors of smoking cessation, other than treatment, are exactly the same. When X i is a covariate that could potentially be manipulated (such as treatment or environmental exposure), and if the study design is appropriate, then β 1 sometimes can be interpreted as the causal effect of X on Y because it is the conditional effect of X, given a covariate measure that is technically unique to each individual and captures information from omitted individual-specific covariates. If X i cannot be manipulated, for example as with gender or race at the individual level, then causal interpretations are not appropriate because they are not well-defined and information on within-subject variation is not available. Chapter 13 in Fitzmaurice et al. [FLW04] provides clear discussion of the key issues. 2.7 Further reading 1. Covariate endogeneity. 2. Copula models 3. Marginally specified multilevel models 4. Recent review of GEE for longitudinal categorical data is found in Sutradhar [Sut03]. 5. Functional data analysis teristics can be captured with a scalar value b i that follows a normal distribution. Both are strong assumptions, and the matching interpretation hinges critically on correct specification of the model.

50

51 CHAPTER 3 Methods of Bayesian Inference The Bayesian paradigm provides a natural venue for making inferences from incomplete data. Bayesian inference is characterized by first, specifying a model, then specifying prior distributions for the parameters of the models, and lastly, updating the prior information on the parameters using the model and the data to obtain the posterior distribution of the parameters. Analyses with missing data involve assumptions that cannot be verified; assumptions about the missing data can be made explicit through prior distributions. This chapter reviews some key ideas in Bayesian statistics for complete data, including specifying prior distributions, computation of the posterior distribution, and strategies for assessing model fit. Many of our examples are based on models models described in Chapter 2. In the following, we will suppress conditioning on the model covariates, X. 3.1 Likelihood Consider observing J i -dimensional vectors of longitudinal responses on i = 1,..., n subjects, denoted as Y i = (Y i1,..., Y i,ji ) T. Independence is assumed across subjects. For a parameter, θ the likelihood, L(θ y) is any function proportional in θ to the joint density evaluated at the observed sample, L(θ y) p(y 1,..., y n θ) (3.1) n = p(y i θ). i=1 The function p(y i θ) denotes a probability density (or mass) function for the J i observations for subject i. The likelihood summarizes 45

52 46 BAYESIAN INFERENCE the information the data, y, has about the parameters, θ. We will see in subsequent chapters that in incomplete data settings the likelihood does not always provide information about all the components of θ. The likelihood for some of the models introduced in Chapter 2 are given next. Example 3.1. Likelihood for Multivariate Normal model for temporally aligned observations. (continuation of Example 2.3) Recall for the model with temporally aligned observations, J i = J. The likelihood then takes the form: L(θ y) = n i=1 1 (2π) J/2 Σ(φ) 1/2 exp{ 1 2 e i(β) T Σ(φ) 1 e i (β) T } (3.2) where e i (β) = y i x i β and θ = (β, φ). Example 3.2. Likelihood for Multivariate Normal model for temporally misaligned observations. (cont. of Example 2.4) Suppose subject i has J i observations at times {t i1,..., t i,ji }. In Example 3.1 each individual was assumed to have a common set of J observation times. Let C(, ; φ) be a covariance function as defined in Example 2.4. The likelihood contribution for subject i has the same form as (3.2), but with the (k, l) element of Σ i parameterized via C(t ik, t il ; φ). Here, we are assuming the covariance for observations at times t k and t l is the same across subjects. Example 3.3. Likelihood for Normal random effects model. (cont. of Example 2.1) The marginal likelihood (the likelihood after integrating out the random effects b i ) is typically used for inference. It is given by L(θ y) = = n i=1 n i=1 1 (2π) J/2 Σ 1/2 exp( 1 2 e i(β, b i ) T Σ 1 e i (β, b i )p(b i Ω)db i 1 (2π) J/2 Σ + W T i ΩW i 1/2 exp( 1 2 e i(β) T (Σ + W T i ΩW i) 1 e i (β) where e i (β, b i ) = y i x i β w i b i, e i (β) = y i x i β, p( Ω) is a normal density with mean 0 and covariance matrix Ω and θ = (β, Σ, Ω).

53 LIKELIHOOD 47 Example 3.4. Likelihood for Marginalized transition model of order 1. (cont. of Example 2.5) The likelihood takes the form L(θ y) = i {π yi1 i1 (1 π i1) 1 yi1 J j=2 φ yij ij (1 φ ij) 1 yij }, (3.3) where φ ij = pr(y ij = 1 x ij, y i,j 1 ; β, α) for j > 1 and π i1 = pr(y i1 = 1 x i1 ; β). The first product corresponds to the contribution from each subject at the initial time and the second, each subject s contribution over the other times. The forms of these probabilities are given in Example 2.5. The likelihood for non-marginalized transition models takes the same form, but the conditional probabilities are not constrained by the model for the marginal mean (Diggle, Heagerty, Liang, and Zeger, 2002). Example 3.5. Likelihood for Multivariate probit model. (cont. of Example 2.6) The likelihood takes the form, L(θ y) = i Q(y i ) 1 (2π) J/2 Ψ 1/2 exp{ 1 2 (z i x i β) T Ψ 1 (z i x i β)}dz i (3.4) where Q(y i ) is the set restricting the values of z i ; specifically, if y ij = 0, z ij < 0 and if y ij = 1, z ij > 0. Here, θ = (β, Ψ). The likelihood takes a complex form, but this form will prove useful when computations are discussion in Section 3.3. Score function and information matrix Two important quantities related to the likelihood are the score function/vector and the information matrix. Define l(θ y) = log L(θ y). The score function, S(θ) is defined as S(θ) = Dl(θ y). (3.5) Dθ When the score function is set to 0, the resulting set of equations are called the score equations. The root of the score equations is the maximum likelihood estimator for θ and is typically used as the point estimator in frequentist inference. A variety of optimization procedures are

54 48 BAYESIAN INFERENCE available to do this including Newton-Raphson, Fisher-scoring, or the EM algorithm (Dempster, Laird, and Rubin, 1977). A second quantity of interest is the information matrix I(θ) defined as I(θ) = D2 l(θ y) Dθdθ T. (3.6) Ths matrix quantifies the rate of curvature of the log likelihood at θ. In a frequentist setting, uncertainty (standard errors) are typically computed using the diagonal elements of the observed information I(ˆθ) 1. and are justified by large sample theory (Lehmann, 1983). Although Bayesian methods combine the likelihood with a prior distribution for the parameter, the posterior distribution of θ can usually be approximated with analogues of the score and information based on the likelihood and the prior (Tierney and Kadane, 1986). The posterior distribution Bayesian inference involves updating the prior distribution p(θ) given the information in the data as quantified by the likelihood, L(θ y). The posterior distribution of θ is defined as, p(θ y) = L(θ y)p(θ) p(θ). (3.7) L(θ y)p(θ)dθ The denominator is a normalizing constant, so as a function of θ, we have p(θ y) L(θ y)p(θ). If p(θ) is proportional to a constant, then the posterior mode of θ will be equivalent to the maximum likelihood estimator. However the posterior mode is not invariant to a re-parameterization of θ via some one-to-one vector valued function θ = h(θ) since the prior on p(θ ), is not equal to p(h 1 (θ )). In most cases, expressing the posterior distribution in closed form is complicated because the denominator, the normalizing constant for the posterior density, is frequently an intractable high dimensional integral (more on this later). The most common approach to make inference using the posterior is to obtain a sample from the posterior distribution using techniques which do not require explicit evaluation of the integral in the denominator. The approximations discussed above will prove useful in implementing some of these sampling based approaches (see Section 3.3.1). In later chapters, we call the posterior distribution consistent if it approaches a point mass at the true value of the parameters as the sample size goes to infinity.

55 PRIOR DISTRIBUTIONS 49 Posterior Summaries The posterior is usually summarized with a few carefully chosen quantities. For point estimates,the posterior mean, or median are common choices. Uncertainty is represented either by the standard deviation of the posterior (which has as its counterpart the standard error in frequentist analysis) or by forming a credible interval based on the percentiles of the posterior distribution. For example, 95% credible intervals of a scalar parameter are often constructed using the 2.5th and 97.5th percentiles of the posterior distribution. Credible intervals that do not require equal mass in each tail, see Tanner (1996). The choice of summary measures for the posterior distribution can be derived formally under a decision theoretic framework by minimizing a loss function, L(θ, a), which is a function of the parameter θ and the estimator of this value a. From a Bayesian decision-theoretic perspective, a is chosen to minimize the expectation of this loss with respect to the posterior distribution of θ, p(θ y) and is called a Bayes estimator. For example under L(θ, a) = (θ a) 2 (squared error loss), the Bayes estimator is the posterior mean. We refer the reader to Berger (1985) or Carlin and Louis (2000) for other loss functions and additional details. Hypothesis testing can be conducted formally using Bayes Factors (ref.) or more informally by examining relevant features of the posterior distribution. As an example of the latter, a point null hypothesis could be conducted by examining whether the credible interval covers the null value. In addition for a one-sided test, we could quantify the strength of evidence by computing, e.g., the posterior probability that the parameter is greater than the null value. We will illustrate these approaches in the data examples in Chapter 4. Hypothesis testing as it related to comparing the fit of different models will be discussed in Section Prior Distributions Prior distributions quantify a priori knowledge about θ. Prior information can be represented through a probability distribution, usually called an informative prior. In the absence of prior information, vague prior beliefs can be reflected using diffuse probability distributions or flat priors (often called default or non-informative priors). Identifiability Particularly, in incomplete data settings, some components of θ are wellidentified by the data and others are not. A parameter θ 1 θ is not

56 50 BAYESIAN INFERENCE identified by the data when p(θ 1 y) p(θ 1 ) for any prior choice for θ 1 that is independent of the other parameters; for a weakly identified parameter and/or for dependent priors, this equivalence will hold approximately; for a very careful discussion of identifiability, see Dawid (1979). We give a simple example next. Example 3.6. Nonidentifiability in a mixture of bivariate normals Consider the following normal mixture model, Y i1, Y i2 r = 1, µ (1), Σ N(µ (1), Σ) Y i1, Y i2 r = 0, µ (0), Σ N(µ (0), Σ) R φ Ber(φ) (3.8) where in the first component of the mixture (r = 1), we observe (Y i1, Y i2 ) and for the second component (r = 0), we only observe Y i1. Suppose we specify independent priors on µ (r), Σ, and φ. Then, one can show p(µ (0) 2 y) L(µ, Σ y)p(µ (0) 2 ) 1 = exp( 1 (2π) n0/2 σ n0/ p(µ (0) 2 ). i:r=0 (y i1 µ (0) 1 )T σ11 1 (y i1 µ (0) 1 ))p(µ(0) 2 ) where σ 11 is the (1, 1) element of Σ and n 0 is the number of subjects with Y i2 missing. The 2nd equality holds since the observed data likelihood does not contain µ (0) 2 so the posterior is proportional to the prior for µ (0) 2 ; clearly the data provide no information about µ(0) In Chapter 5, we will discuss (implicit) priors that are often used to identify µ (0) 2. In missing data problems, the Bayesian approach allows assumptions about non-identified parameters to be formalized through prior distributions. By contrasts, in frequentist inference, non-identified parameters are typically fixed at some value, or constrained via untestable assumptions. We will discuss this issue in more detail in Chapters??-8. We begin the review of priors by introducing conjugate and related priors which are the most commonly used priors in Bayesian modelling. We also mention that the parameters of the prior distribution itself are often termed hyperparameters. After reviewing common priors, we give recommendations on choosing priors for parameters in longitudinal models. Conjugate priors 2.

57 PRIOR DISTRIBUTIONS 51 Conjugate priors are constructed such that the prior and posterior distribution are from the same family of distributions (e.g., both normal distributions). As a result, the posterior distribution, can be expressed in closed form. Example 3.7. Conjugate priors for a normal linear regression model Consider a normal linear regression model, Y i X i, β, σ 2 N(X i β, σ 2 ) with σ 2 known. A conjugate prior on the regression coefficients β is β β 0, V β N(β 0, V β ) (3.9) where β 0 and V β, the hyperparameters, are known. The posterior distribution for β is N( β, Ṽ ), where β = ( i X i X T i + V 1 β ) 1 (V 1 β ˆβ + ( i X i X T i )β 0 ) and Ṽ = ( i X i X T i + V 1 β ) 1 and ˆβ is the ordinary least squares estimator of β. When conjugate priors cannot be found for the model of interest, priors that are conditionally conjugate are often specified to facilitate posterior sampling via MCMC (see Section 3.3). To define conditional conjugacy, we first need to define a full conditional distribution. The full conditional distribution for a parameter θ j θ is the distribution of θ j, given ({θ k : k j}, y). A prior for a parameter (or set of parameters) is conditionally conjugate when it has the same form as its full conditional distribution. We give some examples next. Example 3.8. Conditionally conjugate priors for a normal linear regression model Consider again the normal linear regression model from Example 3.7, now with σ 2 unknown. The normal prior on β is a conjugate prior, conditional on σ 2, i.e., p(β y, σ 2 ) is normal with mean and variance given above. However, p(β y) is not a normal distribution. If we place a

58 52 BAYESIAN INFERENCE Gamma(a, b) prior on 1/σ 2, say Gamma(a, b), the full conditional distribution of 1/σ 2, p(1/σ 2 β, y) is also a gamma distribution with parameters, (n/2 + a) and (1/b + i (y i X i β) 2 /2) 1. The marginal posterior p(1/σ 2 y), however, is not a gamma distribution. These two full conditional distributions can be sampled from to obtain a sample from the joint posterior distribution of (β, σ 2 ). Details will follow in Section 3.3. Example 3.9. (continuation of Example 2.4) Conditionally conjugate priors for a multivariate normal model for temporally aligned observations. In this model, the conditionally conjugate prior for β is a multivariate normal prior and for Σ 1, a Wishart prior (see appendix for a definition). When Σ 1 is assumed to have a particular structure (recall Example 2.4), conjugate priors of known form are often not available. Example (continuation of Example 2.1) Conditionally conjugate priors for normal random effects model. The conditionally conjugate priors for this model are the same as Example 3.9 supplemented by a Wishart prior on Ω 1. For a summary of (conditionally) conjugate priors, see give a reference Noninformative priors In many models, some components of θ are not of primary interest (nuisance parameters) and/or may not have prior information available. Various approaches are available to express ignorance via an appropriately specified prior distribution. These default or non-informative priors are often constructed to satisfy some reasonable criterion such as invariance, admissibility, minimaxity, and/or favorable frequentist properties (Berger and Bernardo, 1992; Mukerjee and Ghosh, 1997). For example, the Jeffreys prior (Jeffreys, 1961) is p(θ) I(θ) 1/2 and is invariant to the parameterization of θ. Example Jeffreys prior for a normal linear regression model Again, consider the regression model in Example 3.7 with σ 2 known, Jeffreys prior for the regression coefficients β is p(β) 1. The posterior for β is thus proportional to the likelihood and is a normal distribution, N( β, Ṽ ), where β = ˆβ

59 PRIOR DISTRIBUTIONS 53 and Ṽ = ( i X i X T i ) 1 and ˆβ is the ordinary least squares estimator of β. Unfortunately, Jeffreys prior tends to have suboptimal properties in higher dimensions (Berger and Bernardo, 1992). Since Jeffreys work, there have been considerable developments in this area. The reference priors of Berger and Bernardo (1992) tried to overcome the flaws in Jeffreys prior in multi-dimensions. In one-dimensional problems with a continuous parameter, where Jeffreys prior typically works well, the reference prior is the same as Jeffreys prior (Theorem 8, p. 26 of Bernardo (2005)). However, in multi-dimensional settings, the reference prior is superior (in terms of frequentist properties). Unfortunately, it is sometimes unclear how to order the parameters which is needed to construct these priors (different priors often result from different ordering of the parameters) and they can be difficult to derive (Bernardo, 2005). A variety of other default priors have been proposed in recent years including the unit information priors of Kass and Wasserman (1995), probability matching priors (Datta and Ghosh, 1995; Mukerjee and Ghosh, 1997), and intrinsic priors (Berger and Pericchi, 1996). For a good review of non-informative priors see Kass and Wasserman (1996) and Bernardo (2005) Improper priors Many of these vague/non-informative priors are not proper density functions; that is, p(θ)dθ =. For example, the reference prior for the independent normal linear regression model is p(β, σ 2 ) 1/σ 2 which is not integrable. When using improper priors, there is no guarantee that the posterior distribution will be proper in the sense that L(θ y)p(θ)dθ <. The investigator either needs to show that the posterior is proper or replace the improper priors with either proper or just proper priors. As an example of a just proper prior, again consider the example of the prior on the regression coefficients in the normal linear regression model. Jeffreys prior (and the reference prior) is p(β) 1; this is typically replaced by p(β) = N(0, ki) where k >> 0 and fixed. This prior is proper but as k, it approaches the (improper) Jeffreys prior. For WinBUGS, improper priors are not permitted. must use just proper alternatives. Default priors for variances

60 54 BAYESIAN INFERENCE Priors for variance (and covariances) have increased importance for incomplete longitudinal data. In complete data, variances/covariances are often mostly important in terms of efficiency, but in incomplete data, they have wider impact and when specified without care, can result in incorrect estimation of mean parameters (see Chapter 6). Common default choices for variance parameters in a normal model include p(σ 2 ) 1/σ 2 (Jeffreys ), p(σ 2 ) I{σ 2 > 0} (flat), and p(σ 2 ) 1/(c + σ 2 ) 2 (uniform shrinkage), where c is fixed (and corresponds to the prior median). The uniform shrinkage prior is often used when the variance σ 2 is at the 2nd level of a two-level model (see, e.g., Example 2.1) and c can be chosen as a prior guess at the value of the first level variance. It is sensible when there is a prior belief that the variance component might actually be zero, but little other prior information is available (Christensen and Morris, 1997; Daniels, 1999). For variances, Jeffreys prior tends to pull the variance toward zero and the flat prior pulls the variance away from zero. For a more detailed explanation and exploration of all these priors, including their frequentist properties, and further discussion of the choice of the constant c for the uniform shrinkage prior, see Daniels (1999). The just proper prior most frequently used in the literature is a gamma prior (on 1/σ 2 ) with parameters (.001, 1/.001). Gelman (2005) recommends against using this prior and instead recommends bounded uniform or folded non-central t- (or normal) priors on σ. All the proper priors for variances described above can be used in WinBUGS. appendix?? For a summary of some of the most common default priors, see Table 3.2 Distribution Parameter Prior N(X T β, σ 2 ) β p(β) 1 p(β) (1 + β 2 ) σ 2 p(σ 2 ) 1 p(σ 2 ) 1 σ 2 N(X T β, Σ) Σ p(σ) Σ (dim(σ)+1)/2 p(σ) 1 Table 3.1 Table of non-informative priors (dim(y i) 1)/2 Priors for a covariance/correlation matrix For a covariance matrix for a multivariate normal likelihood (see Example 2.3), the two most common default choices are Jeffreys prior,

61 PRIOR DISTRIBUTIONS 55 p(σ) Σ (dim(σ)+1)/2 and the just proper Wishart distribution on Σ 1 with degrees of freedom ν equal to dim(σ). The latter choice avoids concern about the propriety of the posterior that arise when using improper priors (and can be used in WinBUGS) but requires the choice of the scale matrix Σ 0 ; this choice is difficult and as a default, is often chosen to be a diagonal matrix with marginal variances chosen to be roughly consistent with the variation in the data or chosen to be the maximum likelihood estimator of Σ. Also, this prior can only be made non-informative to a limited extent; the minimum value for ν is dim(σ), which can be thought of as a prior sample size. So, if dim(σ) is large and the sample size is small, this prior may be more informative than desired (Daniels and Kass, 1999). Other default priors proposed in the literature, include the reference prior of Yang and Berger (1994), which is improper and lacks the conditional conjugacy property. have superior frequentist (risk) properties to Jeffreys prior. For two-level models with a normal distribution on the random effects (for example, the models in Examples?? and??), the common choices are the just proper Wishart prior and the flat prior, p(σ) 1 (see, e.g., Everson and Morris, 2000). Other priors proposed recently include a modification to the reference prior of Yang and Berger (Berger, Strawderman, and Tang, 2005) and the approximate uniform shrinkage prior in Natarajan and Kass (2000).. Advantages of Jeffreys and the flat prior default choice), is that they are both conditionally conjugate and do not require the choice of the scale matrix, Σ 0. Priors for the parameters in a structured covariance matrix, e.g., a first order autoregressive covariance structure, have been examinined in Monahan (1983) and Ghosh and Heo (2003); ; this strategy can be viewed as a very strong prior restricting Σ to be within a certain class of covariance models (Daniels, 2005); see the Further reading section for more details. Default priors for correlation matrices (C), which are needed for multivariate probit models (Example 2.6) have been proposed by Barnard et al. (2000) who consider either a joint prior that induces marginal uniform priors on the individual correlations in the correlation matrix or a uniform prior on the space of correlation matrices. Neither are (conditionally) conjugate, but both are proper. The latter is a uniform prior on the compact subspace of the ν(ν 1)/2 dimensional cubic [ 1, 1] ν(ν 1)/2 such that the correlation matrix is positive definite, where ν is the dimension of the correlation matrix. This prior can be used in WinBUGS. Recommendations for using default priors in longitudinal models

62 56 BAYESIAN INFERENCE For regression coefficients, β, flat improper priors are a good choice though if the posterior cannot be confirmed to be proper, the just proper normal priors should be used. In addition, if the WinBUGS software is being used, the just proper normal priors would be the only choice. For variances, we recommend (truncated) uniform priors or folded normal priors. For the 2nd level variance, uniform shrinkage priors have good properties. For covariance matrices, either the flat prior or the reference priors of Berger and colleagues would be recommended; unfortunately, both are improper and cannot be chosen in WinBUGS. As a result, the just proper Wishart is the most common choice. If structure is to be assumed, priors which shrink toward the structure are recommended (see Further Reading for more information on these priors, Daniels and Pourahmadi, 2002; Daniels and Kass, 2001). Most of the analyses in later chapters will be done in software which does not allow improper default priors. In such cases, we will use the corresponding just proper priors. Informative priors Informative priors can be used in situations where informative prior information exists whether in the form of historical information, expert opinion, or pilot data. With complete data, using strong priors has the effect of combining information from the prior, together with information from the data and model, to form the posterior. This can be illustrated very clearly in the normal linear regression model (Example 3.7). The posterior mean is a weighted average of the ordinary least squares estimator of β, ˆβ (formed from the data and the model) and the prior mean β 0, with weights a function of the data and the prior variance V β. The strength of the prior is (partially) determined by the magnitude of V β. The informativeness of this prior can be altered by appropriately modifying V β. With incomplete data, the role of the prior takes on considerably added importance. When data are missing, information about some components of θ will derive only from the prior. This places a premium on methods for obtaining and modelling prior information. Due to the difficulty in constructing such priors, a common, though possibly not optimal, approach is to specify a particular family of distributions (often conditionally conjugate) and then to elicit information from the investigator (expert opinion) to fill-in values for the hyperparameters. A key component is determining an easy to comprehend scale to accurately extract information from an expert on the hyperparameters

63 COMPUTATION OF THE POSTERIOR DISTRIBUTION 57 (or functions of them) (Chaloner, 1996) ; this is hard. As an example, in linear regression models, Kadane et al. (1980) and later, Laud and Ibrahim (1996), proposed eliciting priors on regression coefficients based on eliciting information about future data (Y s) and then deriving the induced hyperparameters on the regression parameters of interest; they viewed data predictions as a more understandable metric for the experts than the regression coefficients themselves. The power priors of Ibrahim and Chen (2000) are a class of informative priors that provide a way to introduce prior information based on historical data. A prior is specified that is proportional to the likelihood of the historical data with a subjective choice of the power on this likelihood, i.e., how much prior weight should be placed on this previous (historical) data. We will delve into informative priors in more detail in Chapter 9 in the setting of sensitivity analysis. The next step in conducting a Bayesian analysis will be how to use sampling based approaches to compute the posterior for longitudinal models. Large sample behavior of the posterior, informative priors, and incomplete data As the sample size grows, the likelihood dominates the prior for identified, but not unidentified, parameters (Theorem 7.78 in Schervish, 1997 ). In incomplete data problems, as mentioned earlier, the data and model often provide no information on certain parameters. Informative priors allow for identification of these parameters. In these settings, the posterior is often very similar to the prior for these same parameters and the data does not eventually overwhelm the prior. 3.3 Computation of the Posterior Distribution The 1990 paper by Gelfand and Smith (1990) marked the start of the revolution in Bayesian computations within the statistical community that provided an approach to obtain exact inferences for complex Bayesian hierarchical models. The late 1990 s saw the advent of software to do these computations, including WinBUGS (ref and other software). The seminal idea underlying all these approaches was to obtain a sample from the posterior distribution without the need for explicit evaluation of the integral in the denominator of eq. (3.7) by constructing a Markov chain which had the posterior distribution of interest as its stationary distribution Marginal posteriors and the posterior of functions of the

64 58 BAYESIAN INFERENCE parameters are easily obtained by doing Monte Carlo integration using the sample from the Markov chain. The approaches are often referred to as Markov Chain Monte Carlo (MCMC) algorithms. Gibbs sampler The Gibbs sampler (Geman and Geman, 1984) is a Markov chain whose kernel is the product of the sequentially updated full conditional distributions of the parameters. To be more specific, consider a q-dimensional parameter vector θ = (θ 1, θ 2,..., θ q ) T and let θ (k) i correspond to the sample of the ith component of θ at iteration k. The algorithm samples from the following distributions sequentially, 1. θ (k) 1 p(θ (k) 1 θ (k 1) 2, θ (k 1) 3,..., θ (k 1) q, y) 2. θ (k) 2 p(θ (k) 2 θ (k) 1, θ(k 1) 3,..., θ q (k 1), y). q. θ (k) q p(θ (k) q θ (k) 1, θ(k) 2,..., θ(k) q 1, y). A common extension is block sampling (Roberts and Sahu, 1997), whereby some components of θ are sampled together, e.g., from p(θ (k) 1,..., θ(k) j θ (k 1) j+1,..., θ q (k 1), y) instead of componentwise. Example Gibbs sampling for Normal random effects model (cont. of Example 2.1) Define b T = (b T 1,..., bt n ). We consider conditionally conjugate prior distributions for the parameters, β N(β 0, V β ) 1/σ 2 Gamma(a, b) Ω 1 Wishart(dim(Ω), Ω 0 ). Here we sample all the regression coefficients β, the random effects b i, and the components of Ω in blocks. The corresponding full conditional distributions here are

65 COMPUTATION OF THE POSTERIOR DISTRIBUTION 59 β b, σ 2, Ω, y, X N(A 1 i ( XT i (y i W ib i ) b i β, σ 2, Ω, y i, X i N(B 1 ( 1 σ 2 β, b, Ω, y, X G(n 2 + a, (1 b σ 2 i W T i (y i X i β) Ω 1 β, b, σ 2, y, X W(n + dim(σ), (Ω i + V 1 β β 0), A 1 ) σ 2 + Ω 1 ), B 1 ) e i (β, b i ) T e i (β, b i )) 1 i b i b T i ) 1 ) where A = i XT i X i/σ 2 + V 1 β, B = i W T i W i/σ 2 + Ω 1, and e i (β, b i ) = y i X i β W i b i. Each is available in closed form, demonstrating the computational advantage of using conditionally conjugate priors and the ease from which to obtain a sample from the posterior using Gibbs sampling. Having all the full conditional distributions available in closed form is atypical. We go through a more representative example next. Example Gibbs sampling for logistic random effects model (cont. of Example 2.2) We specify the priors as in Example Recall, logit(µ b ij ) = X ijβ + W ij b i. The full conditional distributions here are β b, Σ, Ω, y, X i (µ b ij) yij (1 µ b ij) 1 yij p(β) j b i β, Σ, Ω, y, X exp( 1 2 bt i Ω 1 b i ). j (µ b ij) yij (1 µ b ij) 1 yij The full conditionals of β and b i are not available in closed form; the full conditional distribution for Ω 1, however, is the same as in Example As stated earlier, the inability to obtain all the full conditional distributions in closed form is not an uncommon occurrence in most models, in some cases due to the use of non-conjugate priors. For example, Examples 2.4 and 2.5 have at least one full conditional with an unknown form. Clearly approaches are needed to sample from the full conditional distributions for such cases. There are many techniques available to do this. We review one such approach in the next section.

66 60 BAYESIAN INFERENCE Sampling from non-standard full conditional distributions: The Metropolis-Hastings algorithm The most commonly used approach to sample from non-standard full conditional distributions is the Metropolis-Hastings algorithm (Hastings, 1970). We will describe its implementation within a Gibbs sampling algorithm; this involves constructing another Markov chain (within the Gibbs sampler) that has as its stationary distribution the full conditional distribution of interest. For simplicity, denote the full conditional distribution of interest as p(θ (k) j {θ (k) l : l < j}, {θ (k 1) l : l > j}, y); in the following, for clarity, we will remove the conditioning and write the full conditional as p(θ (k) j ). At the kth iteration, θ (k) j is sampled from some candidate distribution, q(θ (k) j θ (k 1) j ), and the sampled value is accepted with probability α given by { α p(θ (k) j )q(θ (k 1) j θ (k) } j ) = min 1,. (3.10) p(θ (k 1) j )q(θ (k) j θ (k 1) j ) This choice of acceptance probability ensures the stationary distribution of the Markov chain is the full conditional distribution of interest. An important practical issue is the choice of candidate density, q( ). Setting q(θ (k) j θ (k 1) j ) = N(θ (k 1) j, v), where v is a fixed variance, yields the random walk Metropolis-Hastings algorithm, q(θ (k) j θ (k 1) j ) = q(θ (k 1) j θ (k) j ), and α reduces to the ratio of the full conditionals evaluated at the proposal value and the previous value; i.e., α = min ( 1, p(θ (k) j ) p(θ (k 1) j ) ). (3.11) Optimal acceptance for this choice should be around 20 30% (Brooks and Gelman, 1998). Higher acceptance rates will result in θ j moving very slowly around the posterior distribution and a highly correlated sample of θ j, which is not efficient; these issues will be discussed in detail in Section A common initial choice for v is some value proportional to the squared standard error, usually with proportionality constant less than one; this initial value can be adjusted by trial and error to attain the optimal acceptance rate. Another common choice for q is an approximation to the full conditional

67 COMPUTATION OF THE POSTERIOR DISTRIBUTION 61 distribution p( ), independent of the previous value, θ (k 1) j. For example, one might construct a normal approximation to the full conditional distribution based on the modified score vector and information matrix as discussion in Section 3.1;. this usually involves implementing an optimization algorithm (e.g., Newton-Raphson) at each iteration. In this case, (3.11) can be re-written as the ratio of importance weights, ( α p(θ (k) j )/q(θ (k) ) j ) = min 1,, (3.12) p(θ (k 1) j )/q(θ (k 1) j ) where q( ) is the approximation to the full conditional distribution. For this approach, the optimal acceptance rate is as close to 100% as possible. Poor choices of q will result in low acceptance often due to long runs of not accepting the candidate value. Heavier tailed candidates will often improve the efficiency (Chib and Greenberg, 1998; Daniels and Gatsonis, 1999). Choosing the candidate distribution based on approximating the full conditional distribution has been implemented with considerable success for sampling the fixed effects and random effects, β and b i respectively, in GLMMs (Example 2.2). For some examples, see Daniels and Gatsonis (1999) and Chib and Greenberg (1998). Recommendations The random walk Metropolis-Hastings algorithm typically works best for sampling one parameter at a time and is computationally inexpensive (no need to do a maximization or compute constraints). However, it does not usually generalize well to sampling blocks because of the difficulty in specifying an appropriate covariance matrix for the candidate distribution. In addition, sampling the parameter vector θ componentwise can result in poor performance of the MCMC algorithm (more details in Section 3.3.3). Using an approximation to the full conditional distribution for the candidate distribution in the Metropolis-Hastings algorithm often works well when it is not computationally expensive to do a maximization at each iteration (e.g., compute derivatives and do Newton-Raphson steps), and is a good approach when trying to sample more than one parameter simultaneously. Heavy-tailed approximations, e.g. a t-distribution instead of a normal distribution, will often increase efficiency (Chib and Greenberg, 1998). Other approaches to sample non-standard full conditionals are given in the Further Reading section.

68 62 BAYESIAN INFERENCE Data augmentation As discussed previously, it is often the case that some of the full conditional distributions of p(θ y) are difficult to sample. Another alternative to Metropolis-Hastings is to introduce latent data Z such that it is easy to sample p(θ z, y) and p(z θ, y) and/or to improve the mixing properties of the sampler (Tanner and Wong, 1987; van dyk and Meng, 2001). This was coined data augmentation by Tanner and Wong. The approach thus proceeds in two steps: 1. Sample p(z θ, y). 2. Sample p(θ z, y). This approach is well suited to two (related) situations. First, complete data problems that inherently have latent structure based on known distributions, e.g., probit or random effects models; we illustrate this in Examples 3.14 and Second, data augmentation is a natural way to deal with incomplete data. For incomplete data problems, we typically specify the model for the complete data. As such, it is often easier to work with the complete data posterior as opposed to the posterior only based on the observed data. In this case, Z would correspond to the missing data and sampling p(θ z, y) would be sampling from the complete data posterior; p(z θ, y) would be sampling from the posterior predictive distribution of the missing data. Chapter 5 provides details on data augmentation in this setting. In the following, we provide details on data augmentation for some of the longitudinal models introduced in Chapter 2, corresponding to the first situation where data augmentation can make sampling from the posterior considerably easier. Example Data augmentation for the multivariate probit model (cont. of Example 2.6) For simplicity, assume J = 1. The posterior for this model is proportional to [ ] φ(x i β) yi φ( X i β) 1 yi p(β) (3.13) i where p(β) is typically chosen to be a (multivariate) normal prior. The complicated likelihood part of this expression (see Example 2.6), L(β y) i φ(x iβ) yi φ( X i β) 1 yi is maximized in a frequentist analysis. Clearly, the posterior distribution of β is not available in closed form. We

69 COMPUTATION OF THE POSTERIOR DISTRIBUTION 63 can use data augmentation here to simplify sampling from the posterior. We exploit the existence of latent variables Z i such that L(β y) z i Q(y i) p(z i β)dz i where Q(y i ) is the set that restricts z i to be positive if y i = 1 and to be negative if y i = 0 and p(z i β) is the pdf of a normal distribution with mean X i β and variance 1 (cf: Example 2.6). With the introduction of the latent z i, sampling from the posterior distribution of β can proceed (quite easily) in two steps: 1. Sample p(β z, y); a normal distribution (assuming p(β) is a normal distribution). item Sample p(z β, y); a truncated normal distribution. Multivariate t-distributions are sometimes specified instead of normal distributions to obtain more robust inferences; the heavier tails of the t-distribution make it more robust to outliers (Lange, Little, and Taylor, 1989). In addition, they can be used to build multivariate logistic models for longitudinal binary data (Example 6.14). Unfortunately, regardless of the prior specification, the full conditional distributions of the location and scale parameters of the t-distribution are not available in closed form. However, we can use data augmentation to facilitate sampling from the location and scale parameters. Example (cont. of Example 2.4) Data augmentation for the multivariate t-distribution To implement data augmentation for the t-distribution, we will use the following relationship between the multivariate t-distribution and the multivariate normal distribution; if Y i µ, Σ t ν (µ, Σ), then its density can be written as 1 (2π) J/2 Σ 1/2 exp( 1 2 (y i µ)t Σ 1 (y i µ))p(τ i ν)dτ i, (3.14) where p(τ i ν) is a Gamma density with parameters (ν/2, 2/ν). Using this result that t-distributions can be written as gamma mixtures of normals, data augmentation approaches can be used to greatly simplify sampling. This result was first used (in the context of Gibbs sampling) by Chib and Albert (1993). By using the expanded parameter space with the latent variables τ i instead of integrating them out as in (3.14), we can sample from the posterior of (β, Σ) using the following steps (assuming a multivariate normal prior on µ and a Wishart prior on Σ 1 ): 1. Sample p(τ i µ, Σ, y); independent gamma distributions.

70 64 BAYESIAN INFERENCE 2. Sample p(µ, Σ τ, y) using Gibbs sampling (a) Sample p(µ Σ, τ, y); a multivariate normal distribution (b) Sample p(σ 1 µ, τ, y); a Wishart distribution As a final example, we show that the Gibbs sampler constructed in the normal random effects model in Example 3.12 implicitly used data augmentation. Example (cont. of Example 2.1) Data augmentation for the normal random effects model As noted in Example 2.1, the random effects, b i can be integrated out in closed form. Thus, the posterior distribution can be sampled by just using the full conditional distributions of β, Ω, and σ 2 based on the integrated likelihood (3.3) and priors. However, the full conditional distributions of Ω and σ 2 will not be known forms and in particular, the full conditional for Ω will be very hard to sample (Daniels, 1998; Daniels and Kass, 1999). On the other hand, if the random effects are not integrated out and are sampled from their full conditional distributions as well, the full conditional distribution of Ω 1 will be a Wishart distribution (cf: Example 3.12) and 1/σ 2 will be a Gamma distribution. Hence, from a computational perspective, we can view the random effects, b i as latent variables augmenting the model Y i N(X i β, W T i ΩW i + σ 2 I). (3.15) We now have reviewed the tools to obtain a sample from the posterior. Next, we discuss how to use this sample for posterior inference Inference using the sample from the posterior The sample generated by Gibbs sampling or other MCMC algorithms must be be used carefully. At least two issues need to be considered to make accurate and correct inferences. 1. When do we consider the Markov chain to have reached the stationary distribution, i.e., the posterior distribution (the time before reaching this is often referred to as the burn-in )?

71 COMPUTATION OF THE POSTERIOR DISTRIBUTION After the burn-in, how many additional samples are needed in order to make accurate inferences. This can be tricky since MCMC approaches generate a dependent sample from the posterior distribution of the parameters. For the first issue, it is often advocated to throw away the first K samples as burn-in giving the sampler time to find the stationary distribution, i.e., the posterior. To make sure the chain has actually converged to the correct place, it is often recommended to run multiple chains with different starting values and make sure they all converge to the same region of the parameter space (again, give a figure) (Gelman and Rubin, 1992); Gelman and Rubin proposed a statistic to quantify this based on the variability within and between the multiple chains. We refer the reader to Cowles and Carlin (1996) for more details on this and other approaches to assess convergence. Figure 3.1 illustrates the concept of burn-in. From Figure 3.1, it appears that the sampler has reached the stationary distribution rather quickly after 7 or 8 iterations (i.e., when the multiple chains come together). Figure 3.1 Plot of multiple chains illustrating the concept of burn-in. Once the sampler has converged (i.e., found the stationary distribution), the issue then becomes how efficiently the sampler moves around the posterior space, the second issue, to determine how many samples we need to obtain the required precision for inference. Figure 3.2 is an autocorrelation plot which gives the lag-k correlations in the chain. It appears by lag 12 or 13 that the autocorrelation is negligible. If the chain moves slowly around the space (high autocorrelation), more samples need to be obtained. For example, 10, 000 autcorrelated samples might be equivalent to only 500 independent samples; see Tierney (1994) for details. Figures 3.3 and 3.4 demonstrate a sampler that is mixing well and poorly, respectively. In Figure 3.4, it is clear that each individual chain is moving slowly around the parameter space. To quicken mixing, block gibbs sampling or non-random walk metropolis-hastings approaches (one of which was discussed in Section 3.3.1) are often used. Figure 3.2 Autocorrelation plot. Lag k correlation is plotted verus k. Figure 3.3 Plot illustrating good mixing

72 66 BAYESIAN INFERENCE Figure 3.4 Plot illustrating bad mixing Computation of functions of the posterior, e.g., means, medians, and percentiles can be done in a straightforward manner by just calculating the appropriate summary using the posterior sample after throwing away the burn-in. However, to compute the posterior standard deviation, care must be taken to account for the autocorrelation in the chain. One approach used to account for this is thinning. Thinning means only keeping every kth iteration, where k is chosen large enough so the lag-k correlation is small (cf: Figure 3.2; thus, one obtains an approximately independent sample. However, this approach is inefficient as (k 1)/k% of the sample is discarded. Batching is another more efficient approach. It involves breaking the output of the chain into m batches of length m, such that m is large enough for the correlation between batch means to be negligible. Then, the posterior variance of the parameter is computed as the suitably normalized corrected sum of squares of the batch means (see p in Carlin and Louis, 2000). For other approaches, including some based on time series methodology, see Chapter 5 of Carlin and Louis (2000). Marginal posteriors of (any function of) the parameters are obtained by summarizing the realizations of the parameter (or the function of the parameters computed at each iteration) from the posterior sample. We can increase the efficiency of estimating posterior summaries (i.e., less variability using fewer iterations) by using a technique coined Rao- Blackwellization (Gelfand and Smith, 1990; Liu, Wong, and Kong, 1994 ) which is illustrated next. Example (cont. of Example 2.1) Rao-Blackwellization in the normal random effects model Suppose we want to estimate the posterior mean of the random effects for subject i, E[b i y]. The obvious approach is to compute the sample mean of b i based on the posterior sample. However, this can be estimated more precisely by computing the conditional expectation, E[b i β, Σ, y], which is available in closed form, at each iteration and averaging these, i.e., computing E(b i y) using the form, E(b i β, Σ, y)p(β, Σ y)dβ dσ. (3.16) This approach leads to increased efficiency since there will be much less variability in E[b i β, Σ, y] than in the samples b i p(b i β, Σ, y).

73 MODEL COMPARISON AND MODEL FIT 67 By using Rao-Blackwellization, we can decrease the number of samples needed for the required precision. A note on improper priors and Gibbs sampling We end this section on posterior sampling with a cautionary remark on using improper priors in Gibbs sampling (Hobert and Casella, 1996). We motivate this with an example. Consider the normal random effects model in Example 2.1. Suppose we assume the common default (improper) prior on Ω, p(ω) Ω (p+1)/2, where p = dim(ω). The full conditional distribution of Ω 1 will be a proper Wishart distribution. However, the posterior distribution of Ω 1 will be improper. This scenario of all the full conditionals being proper distributions, but the posterior being improper, was first pointed out in Hobert and Casella (1996); in addition, even more problematic, the sample from the improper posterior may not indicate any problems! So, when using improper priors, the propriety of the posterior distribution needs to be verified analytically. Otherwise, improper priors should not be used. If WinBUGS is being used, this is not a concern as it does not allow improper priors; however, for investigators writing their own code, this can be an important issue. Given that we know have the tools to conduct posterior inference, the next step is to determine which (of competing) models to base inference on and whether such a model fits the data well. We do this next. 3.4 Model comparison and Model fit When fitting several parametric models to a dataset, two related questions arise. First, of the models under consideration, which model fits best? And second, do any of the models fit the data well? The former is a statement of relative fit while the latter concerns absolute fit. Two common, relatively easy to compute criteria often used to answer the first question are the Deviance Information Criterion (DIC) (Spiegelhalter, Best, Carlin, and van der Linde, 2002) and posterior predictive loss (PPL) (Gelfand and Ghosh, 1998). A good criterion will take into account goodness of fit while not rewarding models for overfitting (a complexity penalty). Both these criteria have such a property and can be decomposed into a goodness of fit term and a penalty term for overfitting the data (too many parameters). Before giving details on both criteria, we first address some issues with implementing these checks for incomplete data (with details in Chapters 6 and 7). In the case of incomplete data, these measures of fit can only tell the

74 68 BAYESIAN INFERENCE investigator how well the model fits the observed data. For conducting sensitivity analyses on unverifiable assumptions about the missing data (Chapter 9), a key component will be to ensure the models maintain a good fit to the observed data. We will discuss use of these criteria in incomplete data in Chapters 6 and 7. Deviance Information Criterion (DIC): This is a model-based criterion based on the deviance. The deviance is a linear function of the log likelihood, Dev(θ) = 2 log L(θ y)+constant. Model complexity is quantified using the effective number of parameters, p D defined as p D = E[Dev(θ) y] Dev(E[θ y]). From this expression, it is clear that as the variability in the posterior of θ decreases, p D 0. So a model may give a good fit, as measured by Dev(E[θ y]), but if there is a lot of variability in θ, E[Dev(θ) y] will be large, resulting in a large penalty. p D can also be understood using the concept of the residual information in the data y conditional on θ, defined as 2 log(p(y θ)) (Kullback and Liebler, 1951; Burham and Anderson, 1998) ; recall L(θ y) p(y θ). Define ˆθ to be an estimator of θ and θ t to be the true parameter value. The difference between the residual information at the true parameter value and at the estimated parameter value is 2 log(p(y θ t )) + 2 log(p(y ˆθ)). (3.17) This can be interpreted as the degree of overfitting due to the influence of y on the estimator ˆθ. In a Bayesian analysis θ is random. So we replace (3.17) with its posterior expectation. This posterior expectation is the effective number of parameters, p D. The criterion itself is written as the sum of a goodness of fit term Dev(E[θ y]) and a penalty (complexity) term 2 p D, DIC = Dev(E[θ y]) + 2 p D, (3.18) which is a form very similar to the Akaike Information Criterion (AIC) (Akaike, 1973). It can also be re-written in the following (equivalent) form explicitly as a function of the log likelihood, DIC = 4E θ log L(θ y) + 2 log L(E[θ y] y), (3.19) which will be a more convenient form for its development in the setting of incomplete data in Chapter 6. The DIC is easy to compute requiring the computation of two quantities, E[Dev(θ) y] and Dev(E[θ y]) using the output from MCMC

75 MODEL COMPARISON AND MODEL FIT 69 approaches. This has contributed to its widespread use. An advantage of the DIC over approaches like AIC, where the user specifies the number of parameters, is that the (effective) number of parameters is counted automatically. This is particularly helpful in multilevel models where the number of parameters is sometimes ambiguous. As an example, consider the normal random effects models (Example 2.1) with the conditional likelihood given by L(θ y) = n i=1 1 (2π) 1/2 Σ 1 exp( 1 2 (y i x i β w i b i ) T (Σ) 1 (y i x i β w i b i ) (3.20) where θ = (β, b i, Σ). (3.20) is the likelihood conditional on the random effects, b i. Just considering the random effects (and assuming they are one-dimensional), there are n parameters. However, the effective number can be quite smaller as the random effects distribution does some shrinkage. As the variance of the distribution of the random effects goes to zero, there are fewer and fewer parameters; in fact, if variance is zero, all the random effects are identically zero so there are in fact no parameters. However, as the variance increases, the number of parameters approaches n. Despite its (computational) simplicity, the DIC does have drawbacks both with respect to the specification of the likelihood and the parameterization of θ. The best model as determined by the DIC can change depending on the choice of likelihood (see Trevisani and Gelfand, 2003 ); for example, in the normal random effects model (Example 2.1), the likelihood can take one of two forms, either (3.3) or (3.20). In addition, the criterion is not invariant to the parameterization of θ. Dev(θ) requires a plug-in estimator for θ based on the posterior. For ease of computation within sampling based approaches, the posterior mean is typically chosen (though other choices, like the posterior mode are equally valid). It is known that for some 1 to 1 function of θ, h( ), in general, E[h(θ) y] h(e[θ y]). For the model in Example 2.3, θ could be defined as (β, Σ 1 ) or (β, Σ). Using Σ vs Σ 1 will result in different values for the DIC; see Data Example 4.1 in Chapter 4. A final issue is that for some models, the likelihood is not available in closed form (cf: Example 2.6). To evaluate the likelihood for such models, we can often use some form of Monte Carlo integration. So, for the multivariate probit model, given (β, Σ), we need to compute E z [ j I{z ij > 0} yij I{z ij < 0} 1 yij ] where z i a multivariate normal with mean Xβ and variance Σ. So, we can sample from the distribution of z i

76 70 BAYESIAN INFERENCE and compute the expectation by averaging the term in the expectation over the samples. It would be computationally prohibitive to do this for every sampled value of θ = (β, Σ), so another possibility would be to take a likely value, say θ = E[θ y] and just sample z for that value and then to compute the likelihood for other values, just re-weight that sample using weights of the form w l = p(z(l) θ) p(z (l) θ ) (3.21) where p( θ) is a multivariate normal distribution with parameters θ. We will implement this in Chapter 6. For recommendations on the choice of likelihood and parameterization, we refer the reader to the original paper by Spiegelhalter et al. (2002). For variances and covariance matrices, they recommend using their inverses. Posterior Predictive Loss: Posterior predictive loss (PPL) (Gelfand and Ghosh, 1998) is another criterion to determine which model fits best. This approach quantifies the fit of a model by determining how close features of the model-based posterior predictive distribution are to equivalent features of the observed data as measured by some user-chosen loss function. We provide some details next. Consider a loss function, L(y rep, a; y), where a is chosen to minimize the expectation of the loss with respect to the posterior predictive distribution, E[L(y rep, a; y) y], i.e., minimize the posterior predictive loss. For particular choices of L( ), the minimization can be done in closed form and a statistic of similar form to the DIC can be constructed (i.e., a goodness of fit term with a penalty term). Gelfand and Ghosh consider loss functions of the form, L k (y rep, a; y obs ) = L 1 (y rep, a) + kl 2 (y obs, a), k 0. (3.22) If {L j ( ) : j = 1, 2} are chosen as squared error loss, then it can be shown that min a E[L k (y rep, a; y) y] = n σl 2 + k k + 1 l=1 n (µ l y l,obs ) 2 (3.23) where µ l = E(Y l,rep y) = y l,rep p(y l,rep θ)p(θ y)dθdy l,rep is l=1

77 MODEL COMPARISON AND MODEL FIT 71 the posterior predictive mean and σl 2 = Var(Y l,rep y) is the posterior predictive variance, defined similarly. The first term in (3.23), P (k) = i σ2 i, is a penalty term. Overfitting the model will result in large predictive variances, σi 2 and a large value for P (k). The 2nd term, G(k) = k n k+1 l=1 (µ l y l,obs ) 2, is a goodness of fit term, which will decrease as more complex models are fit. This statistic is easy to compute using samples from the posterior predictive distribution. We discuss some additional features of PPL next. For other smooth choices of L( ), the criterion can be approximated in a similar form with a goodness of fit and a complexity term. Like the DIC, this criterion contains an automatic penalty, P (k). The choice of k determines how much weight is placed on the goodness of fit term relative to the penalty term. For large k, there should not be much sensitivity to the choice of k as the weight is given by k/(k + 1) which approaches one for k large. Since this criterion is based on the posterior predictive distribution, it is invariant to the parameterization unlike the DIC which uses a noninvariant plug-in estimator for θ. The downsides of this approach are that it requires the choice of an appropriate loss function (which we do not specify in the course of most Bayesian analyses) and possibly, analytical calculations to obtain the criterion. The aforementioned description and derivation of PPL was in the setting of univariate data, i.e., y i. We will now discuss how one might extend this approach to longitudinal (correlated) data. One way is to summarize the correlated data with a univariate measure, say T i = h(y i ), with dim(y i ) = J, and then apply univariate methods (Hogan and Wang, 2001). The univariate summary, T i might be specified as a weighted average of the longitudinal responses, T i = w j Y ij for some fixed set of weights w = (w 1,..., w J ). Using this summary, to emphasize fit based on the last observation time, we can set w l = 0 for l = 1,..., J 1 and w J = 1. To examine a change from baseline, set w 1 = 1, w J = 1, and w l = 0, l = 2,..., J 1. Given the choice of T i and using the Gelfand and Ghosh loss function (3.22) with squared error loss, the criterion becomes n σi(t 2 ) + k k + 1 i=1 n (µ i(t ) T i,obs ) 2 i=1 where µ i(t ) = E(T i,rep y) and σ 2 i(t ) = Var(T i,rep y) with T i,rep = h(y i,rep ). Posterior Predictive Checks:

78 72 BAYESIAN INFERENCE To determine how well the model fits the data in an absolute sense, posterior predictive checks can be used. They are a simple but powerful approach to determine whether particular aspects of the data are captured adequately by the model. The idea is to simulate sets of replicated data from the model and see how well appropriate summaries of these compare to what was observed in the data. In particular, we simulate data, Y rep, from the posterior predictive distribution, i.e., p(y rep y) = p(y rep θ, y)p(θ y)dθ, where p(y rep θ, y) = p(y rep θ). This approach uses the posterior predictive distribution to assess how well particular features of the replicated data from the model match with the observed data. Posterior predictive p-values are a means to quantify the relationship between the statistics computed based on the observed data and the statistics computed based on the replicated data. The posterior predictive p-value can be computed as Py rep y(t (y obs ) > T (y rep )), where T (y obs ) is the statistic computed based on the observed data and T (y rep ) is the statistic computed based on the replicated data. Extreme p-values, either close to 0 or close to 1 suggest the model may not be adequately capturing that feature of the data in that this indicates the statistic based on the replicated data is consistently over- or under-estimating the value of the same statistic based on the observed data. The following example illustrates a choice of T ( ) for a specific setting. Example (cont. of Example 2.4) Posterior predictive checks for multivariate normal with temporally aligned observations Table 4.2 suggests that the covariance matrix might differ across treatments. However, given the small sample size per treatment (Table 1.2.4), we might instead fit a more parsimonious model that assumes the covariance matrix, Σ is the same across treatments. We can construct posterior predictive checks to see how consistent the common covariance matrix model is with the observed treatment specific correlation structure. To assess this we specify T to be the set of pairwise correlations among the components of Y for each treatment. At each iteration of the MCMC algorithm, we simulate data, y rep from the posterior predictive distribution and compute the corresponding set of test statistics, T rep under the common Σ model. We can compare the distribution of T rep to the pairwise correlations computed from the actual data, Y denoted as T obs, cor(y ij, Y ij ) for j j. We illustrate this in Data Example 4.1 in Chapter 4. To close this chapter, we discuss Bayesian formulations of penalized splines (p-splines) that were introduced in Example?? (Ruppert and

79 SEMIPARAMETRIC BAYES 73 Carroll, 2000; Eilers and Marx, 1996; Berry, Carroll, and Ruppert, 2002; Crainiceau, Ruppert, and Carroll, 2004). 3.5 Semiparametric Bayes Example Formulation of Bayesian penalized splines (continuation of Example 2.8 Recall the spline basis given in eq. (2.12), B(t) T = (1, t,..., t q, (x s 1 ) q +,..., (x s k) q + ) of dimension R = (q + 1) + k with (z) + = max{0, z} and R = q + k + 1. with a penalty of the form λβ T Dβ, where D is a diagonal matrix with q + 1 zeros and then k 1 s (Berry, Carroll, and Ruppert, 2002). Penalties for these splines can be re-formulated as a prior distribution on β. To do this, we first partition β into (α, u), where α is (q + 1)-dimensional coefficient vector corresponding to the first (q + 1) components of the basis and u is k-dimensional coefficient vector corresponding to the last k components. The prior on α is flat on R p+1 and on u is u N(0, τ 2 I k ). The smoothing parameter λ = τ 2 /σ 2. Craineceau et al. (2004) use noninformative Gamma priors on the variance components τ 2 and σ 2. Berry et al. (2002) recommend informative (conditionally conjugate) gamma priors for the smoothing parameter, λ. We refer the reader to Berry et al. (2002) and Craineceau et al. (2004) for further discussion of priors. Also, see Crainecau, Ruppert, and Carroll (2004) for recent developments in the longitudinal setting. 3.6 Further reading Priors: In several recent papers, Daniels and others (Daniels and Kass, 1999; Daniels and Kass, 2001, Daniels and Pourahmadi, 2002) have proposed approaches to shrink toward a parametric structure for the covariance matrix in longitudinal data; this offers both robustness to mis-specification of the structure and parsimony via the structure. This is well-suited to covariance matrices in longitudinal data models where parsimonious

80 74 BAYESIAN INFERENCE parametric models are often used to describe the covariance structure (see Example 2.3). Some results on the propriety of the posterior for various longitudinal data models when using improper priors on regression coefficients, variances, and/or covariance matrices can be found in Daniels (2006), Natarajan and McCulloch (1995), Natarajan (2001), Hobert and Casella (1996), and Berger, Strawderman, and Tang (2005). Sun and Sun (2005) developed reference priors specifically for the covariance matrix for certain patterns of incomplete data. Prior elicitation and Informative priors For formulation/elicitation of informative priors, an alternative to conjugate priors is to use some sort of nonparametric prior, either constructed by just specifying quantiles (Berger and O Hagan, 1988) or more formally using a modern Bayesian nonparametric approach (Oakley and O Hagan, 2002). If multiple experts are to be used for elicitation, a major issue becomes how best to combine their opinions. Strategies can be found in Garthwaite, Kadane, and O Hagan (2005) and Press (2003) with additional references therein. For elicitation and informative priors for covariance matrices, we refer the reader to Brown, Le, and Zidek (1994), Garthwaite and Al-Awadhi (2001), and Daniels and Pourahmadi (2002); for correlation matrices, some recent work can be found in Boscardin et al. (2007). Other recent work on elicitation can be found in Kadane and Wolfson (1998), O Hagan (1998), and Chen et al. (2003). Excellent reviews can be found in Garthwaite et al. (2005), Press (2003), and Chaloner (1996). Computing: Slice sampling is an another approach to sample from unknown full conditional distributions (Damien, Wakefield, and Walker, 1999; Neal, 2003), which involves augmenting the parameter space with non-negative latent variables in a particular way. Slice sampling is a special case of data augmentation. WinBUGS uses this approach for parameters with bounded domains. Another approach which has been used to sample from unknown full conditionals is Hybrid MC (Gustafson, 1997) which uses information from the first derivative of the log full conditional and has been implemented successfully in Ilk and Daniels (2005) in the context of some extensions of MTM s for multivariate longitudinal binary data. For other candidate distributions for the Metropolis-Hastings algorithm, we refer the reader to Gustafson et al. (2004) who propose/review approaches that try to

81 FURTHER READING 75 avoid the random walk behavior of the random walk Metropolis-Hastings algorithm without necessarily doing an expensive numerical maximization at each iteration. This includes Hybrid MC as a special case. When the logarithm of the full conditional distribution is log concave, Gilks and Wild (1992) have proposed easy to implement adaptive rejection sampling algorithms. Parameter expansion algorithms to sample correlation matrices have recently been proposed (Liu and Wu, 1999; Liu, 2001)) that provide a conditionally conjugate structure by sampling a covariance matrix from the appropriate distribution and then transforming it to a correlation matrix. Rao-Blackwellization for computing credible intervals is discussed in Eberly and Casella (2003). Additional extensions of data augmentation, termed marginal, conditional, and joint, that can further improve the efficiency and covergence of the MCMC algorithm can be found in van Dyk and Meng (2001). Inference using the sample from MCMC algorithms reference some of Hobert stuff (Biometrika) and perfect sampling??? Model comparision and Model fit: For some other diagnostics for assessing model fit, see Hodges (1998) and Gelfand, Dey, and Chang (1992). Semiparametric Bayes For p-splines, an alternative more stable basis based on low rank thin plate splines has recently been proposed in Craineceau et al. (2004). An alternative to penalized splines (p-splines) are regression splines. Regression splines also contain a set of knots, though typically much fewer than p-splines since their is no penalty. Their use requires determining the number and location of the knots. Bayesian approaches use reversible jump MCMC methods (Green, 1995) to determine the number (k) and locations {s 1,..., s k } of the knots. See Denison, Mallick and Smith (1998) and DiMateo, Kass, and Genovese (2001) for methodology for a single longitudinal trajectory and Botts and Daniels (2005) for methodology for multiple trajectories (the typical longitudinal setting). For prior distributions on distribution functions using Dirichlet process priors, see Ferguson (1982) and Escobar (1995) and for polya tree priors, see Lavine (1992, 1994). These can be used for the random effects distribution in Examples 2.1 and 2.2. For other approaches for semiparametric regression, see Chen and Ibrahim (2005) and Zhang (1999).

82

83 CHAPTER 4 Data Analysis In this chapter, we conduct analyses of several of the datasets from Chapter 1 using some of the models described in Chapter 2. For each, we do a careful analysis based on the questions of interest for each dataset, but also provide additional results to illustrate some aspects of inference from Chapter Analysis of GH study using complete data (continuation of Data Example 1.2) Models We use the multivariate normal model in Example 2.3 with J = 3 and t = (0, 6, 12). Examination of Table suggests a separate mean for each of the four treatments (Tx) at each of the three observation times (baseline, 6 months, and 12 months). Letting Tx i be the treatment indicator for subject i, X ij is the 12 1 design vector given by X ij = (I{j = 1, Tx i = 1}, I{j = 2, Tx i = 1},..., I{j = 3, Tx i = 4}). (4.1) For simplicity, in the following we set µ jk to be the mean for treatment k at time t j. Table 4.2 provides some evidence that the covariance matrix is different across treatments. As a result, we fit two models: 1. Σ(X i, φ) = Σ(Tx i, φ), the covariance matrix differing by treatment. 2. Σ(X i, φ) = Σ(φ), the covariance matrix common across treatments. As discussed in Data Example 1.2, the main inferential objective for this data was the mean quadriceps strength at 12 months for each of the four treatments. Priors 77

84 78 DATA ANALYSIS Priors are specified as in Example 3.9; in particular, the prior on β is assumed to follow a multivariate normal distribution with mean β 0 = 0 and covariance matrix V β = I 3, the prior on Σ 1 is Wishart with degrees of freedom, ρ = J = 3 and scale matrix Ω 0 = ρ 900. Model Selection and Fit To compare the fit of the common Σ model to a different Σ for each treatment we used the DIC with θ = (β, Σ(X i, φ) 1 ) and PPL using the Gelfand and Ghosh loss function given in (3.23) with k = and h(y i ) = Y i3 h(y i ) was chosen as such to emphasize the fit of the model at 12 months which was the observation time of primary interest. These results appear in Table 4.1. The DIC results slightly favored the common Σ model; also, notice that when we set θ = (β, Σ(X i, φ)), the results reversed, slightly favoring the different Σ model illustrating the sensitivity to the parameterization of θ. The PPL also favored the more parsimonious common Σ model. Table 4.1 DIC and PPL for common and different Σ models. The terms in parentheses in the DIC rows are the values when θ = (β, Σ) instead of (β, Σ 1 ). model DIC D PD common Σ 2187 (2186) (18) different Σ 2193 (2183) (28) model PPL common Σ different Σ To assess the fit of the common Σ model, we used posterior predictive checks as described in Example The posterior predictive p-values ranged from.14 to.84, which are not very extreme and suggest the more parsimonious common Σ model does not fit poorly and reasonably captures the correlation over time for the four treatments. We recommend the common Σ model here. On the other hand, from examining Table 4.2, more specialized covariance models that allow individual components of Σ to differ across treatments may be useful here as the correlation between the responses at baseline and month 6 in treatment 1 appear lower than the other treatments as do the correlations between baseline and 12 months and 6 months and 12 months in treatment 4. Also, the variances of responses

85 ANALYSIS OF GH STUDY USING COMPLETE DATA (CONTINUATION OF DATA EXAMPLE 1.2) 79 in treatment 1 at 6 and 12 months are much larger than the corresponding variances in the other three treatments. We will explore this further in later chapters. Table 4.2 Estimated Σ by treatment. The offdiagonal elements below the diagonal are the pairwise correlations. Treatment Treatment Treatment Treatment Results For illustration, we present results from both models but as stated earlier, we recommend the common Σ model for inference. For the common Σ model, posterior means and 95% credible intervals the mean parameters are given in Table 4.3. Since most of the interest in this dataset focused on the mean outcome at month 12, we examine its posterior next. Histograms of the posterior distributions of the pairwise differences between the four treatment means at month 12 are given in Figure 4.1. Table 4.5 contains posterior probabilities that the month 12 mean was higher for each pair of treatments which quantifies the evidence in the histograms (Figure 4.1). For example, the probability that the quadriceps strength at month 12 in treatment 1 is higher than treatment 2 is given by P (µ 31 > µ 32 y) = I{(µ 31 µ 32 ) > 0}p(µ y)dµ. (4.2) This is an intractable integral which can be computed easily from the posterior sample. Subjects on treatment 1 had consistently higher means than the other three treatments with posterior probabilities of greater than.99. Results from fitting a model with a separate Σ for each of the four treatments is given in Table 4.4. The posterior means for the mean at each of

86 80 DATA ANALYSIS Table 4.3 Posterior means and 95 % credible intervals of the mean parameters for each treatment assuming a common Σ across treatments Time treatment1 treatment2 treatment3 treatment (68.16,87.88) 67.25(58.81,75.68) 64.84(56.52,73.15) 67.09(58.36,75.78) (79.88,100.60) 63.74(54.84,72.56) 81.28(72.52,89.94) 61.97(52.83,71.10) (78.64,98.10) 63.33(55.06,71.68) 72.49(64.22,80.68) 62.94(54.35,71.54) Table 4.4 Posterior means and 95 % credible intervals of the mean parameters for each treatment assuming different Σ for each treatment Time treatment1 treatment2 treatment3 treatment (67.41,88.66) 67.25(58.90,75.61) 64.84(56.06,73.64) 67.13(57.93,76.33) (76.13,104.30) 63.74(55.88,71.54) 81.26(72.05,90.52) 62.01(54.05,69.97) (74.16,102.50) 63.32(55.82,70.88) 72.47(64.67,80.28) 62.95(55.31,70.64) the three observation times is very similar across the two models, but the credible intervals for some are quite different; for example, the credible interval for the month 12 mean for treatment 1 is considerably wider under the different Σ model. Histograms of the posterior distributions of the pairwise differences between the four treatment means at month 12 are given in Figure 4.2. The conclusions were similar to the common Σ model. Conclusions The model provides strong evidence that mean quadriceps strength in treatment 1 was the highest. However, this analysis has ignored dropouts which can induce bias in estimation of mean parameters. We will discuss these issues and re-visit this example in Part II. 4.2 Analysis of schizophrenia using complete data (continuation of Data Example 1.1) Models We used the normal random effects model described in Example 2.1 to analyze data from the completers. Here J = 6 and t = (0, 1, 2, 3, 4, 6). Based on an examination of Table??, a quadratic trend in the BPRS scores over time seemed reasonable. So, we fit a separate quadratic trend for each treatment. We allowed the coefficients of these quadratic trends to vary over individuals, so W i = X i.

87 ANALYSIS OF SCHIZOPHRENIA USING COMPLETE DATA (CONTINUATION OF DATA EXAMPLE 1 Table 4.5 Posterior probabilities that the differences are greater than zero under the common and differing Σ models. µ (1) 3 µ (2) 3 µ (1) 3 µ (3) 3 µ (1) 3 µ (4) 3 µ (2) 3 µ (3) 3 µ (2) 3 µ (4) 3 µ (3) 3 µ (4) 3 Complete data with same variance Complete data with different variance Of primary interest in this trial was the change from baseline to week 6 in the four treatments. Priors Conditionally conjugate priors were specified as in Example 3.10; in particular, the prior on β was assumed to follow a multivariate normal distribution with mean β 0 = 0 and covariance matrix V β = I 3, the prior on Ω 1 was Wishart with degrees of freedom, ρ = 3 and scale matrix Ω 0 = 3 100, and the prior on 1/σ 2 was Gamma(a =.01, b = 1/.01). Results Table 4.2 contains posterior means and 95% credible intervals for the coefficients of the orthogonal quadratic polynomials for each treatment. Figure 4.3 plots the posterior means of the trajectories for each treatment over the 6 weeks of the trial. Based on the figure, completers on the high dose arm appeared to do best, completers on the low dose the worst with the medium dose and the standard treatment very similar and in between the other two treatments. Tables 4.2 and 4.2 are of primary interest for inference as they contain summary information from the posterior of the change from baseline to 6 weeks; in particular, they contain the change from baseline for all four treatments and the differences in the change from baseline among the treatments, respectively. The smallest improvement was seen in the low dose arm with a posterior mean of 11.9 and a 95% credible interval of ( 18.5, 5.3). However, Table?? indicates none of the changes from baseline were significantly different between the four treatments as the credible intervals for all the differences covered zero. Conclusion There were no significant differences in the change from baseline between the treatments. However similar to the previous example with the GH

88 82 DATA ANALYSIS Figure 4.1 Posterior distribution of pairwise differences in month 12 means for common Σ model data, dropouts were ignored. This example will also be re-visited in Part II. 4.3 Analysis of CTQ I using complete data (continuation of Data Example 1.3) Models Recall, the outcomes in this study were binary weekly cessation status

89 ANALYSIS OF CTQ I USING COMPLETE DATA (CONTINUATION OF DATA EXAMPLE 1.3) 83 Figure 4.2 Posterior distribution of pairwise differences in month 12 means for different Σ model for each woman from weeks 1 to 12, and the objective was to compare time-averaged treatment effect from weeks 5 through 12. To accomplish this aim, we fitted both a logistic-normal random effects model (Example 2.2) and a marginalized transition model with first-order dependence (MTM(1), Example 2.5). Logit normal random effects model. For j = 1,..., 12, let Y ij denote cessation status of individual i at time t j. Treatment is denoted by X i = 1 (exercise) and X i = 0 (no exercise).

90 84 DATA ANALYSIS Table 4.6 Posterior means and 95 % credible intervals for the regression parameters for each treatment(mar data) Time Low Medium High Standard β (21.25,32.21) 24.95(20.90,29.13) 23.23(18.92,27.58) 25.83(21.72,30.01) β (-2.54, 1.68) -1.95(-3.29,-0.61) -1.57(-2.99,-0.09) -2.73(-4.21,-1.21) β (-0.45, 1.47) 0.81( 0.10, 1.53) 1.01( 0.23, 1.79) 0.27(-0.49, 1.04) Table 4.7 Posterior means of change from baseline to week 6 for all four treatments Low Medium High Standard Complete -11.9(-18.5, -5.3) -15.8(-21.1,-10.5) -16.3(-22.0,-10.4) -17.1(-22.6,-11.7) The random effects specification uses a single random effect b i, and assumes separate cessation rates before and after week 5, the designated quit week. Let Z j = I(t j 5). Treatment effect is the log odds ratio comparing time-averaged quit rate from week 5 to week 12 (i.e., over the period where Z j = 1). The model is given in three stages, where the third level assigns prior distributions. At the first level, smoking status within individual are independent Bernoulli outcomes conditionally on b i, with [Y ij X i, b i, α, β] Ber(µ b ij ), µ b ij = (1 Z j )α + Z j (b i + β 0 + β 1 X i ). Here, β = (β 0, β 1 ) T, and β 1 is the subject-specific log odds ratio capturing treatment effect. At the second level, we assume [b i X i, τ] N(0, τ 2 ). The population-averaged treatment effect can be approximated by β 1 β 1 K(τ 2 ), where K(τ 2 ) = ( /τ 2 ) 1/2 (Zeger and Liang, 1992). Marginalized transition model. We use a similar formulation to fit the MTM(1). Serial correlation is modeled only for the weeks following the quit date. The marginal mean follows logit(µ ij ) = (1 Z j )α + Z j (β 0 + β 1X i ), where superscript denotes parameters from the marginal rather than

91 ANALYSIS OF CTQ I USING COMPLETE DATA (CONTINUATION OF DATA EXAMPLE 1.3) 85 Table 4.8 Posterior means and 95% credible intervals for the pairwise differences of the changes from baseline among the four treatments diff12 diff13 diff14 diff23 diff24 diff34 Complete 4.0(-4.7,12.4) 4.4(-4.5,13.2) 5.2(-3.4,13.7) 0.43( -7.5,8.3) 1.3(-6.3, 9.0) 0.84(-7.0, 8.7) conditional distribution. For j = 6,..., 12, serial correlation follows logit(φ ij ) = ij + γ i Y i,j 1 ; correlation is allowed to differ by treatment group by setting Priors γ i = φ 0 + φ 1 X i. Regression parameters for the mean in both models, serial correlation parameters in the MTM(1), are assumed independent with just proper normal priors N(0, 10, 000). In the random effects model, the inverse of the prior variance τ 2 is given a diffuse Gamma prior with parameters (.001, 1/.001). Model selection In progress. Results For each model, inference is based on a posterior sample of size 1000, generated after a burn-in comprising 2000 samples. Table 4.9 summarizes the key parameters. Turning first to the MTM(1), the posterior means of β 0 and β 1 suggest that smoking rates were approximately exp( 0.8)/{1 + exp( 0.8)} = 0.31 exp( )/{1 + exp( )} = 0.40 in the control and treatment conditions, respectively. The populationaveraged treatment log odds ratio is β1 = 0.4 (odds ratio roughly 1.5), and the posterior credible interval for β1, (0.0, 0.9), suggests the treatment is effective. Within-subject correlation in the control group is very high, as indicated by the posterior distribution of φ 0 ; the posterior of φ 1 does not suggest any difference in correlation between treatment groups. In the random effects model, the between subject variance τ 2 is very large, and this leads to a substantial difference between the subjectspecific effect β 1 (posterior mean 1.9) and its model-derived population-

92 86 DATA ANALYSIS Figure 4.3 Posterior means of the trajectories for each of the four treatments averaged effect β1 (posterior mean 0.5). The mean of the populationaveraged effect is similar between the two models, but posterior variation is much smaller in the MTM(1). 4.4 Analysis of HERS CD4 data (continuation of Data Example 1.4) Models

93 ANALYSIS OF HERS CD4 DATA (CONTINUATION OF DATA EXAMPLE 1.4) 87 Table 4.9 Comparison of key model parameters from logistic-normal random effects (RE) model and MTM(1), fit to longitudinal smoking cessation data from CTQ I. Parameter Posterior mean 95% credible interval Model 1 (RE) β ( 0.1, 4.0) τ (4.7, 7.4) β1 0.5 ( 0.3, 1.1) Model 2 (MTM) β ( 1.1, 0.4) β (0.0, 0.9) φ (4.0, 5.2) φ ( 0.1, 1.5) In the HERS data, one question of interest was to examine the CD4 trajectories and in particular, the impact of highly-active antiviral therapy (HAART) on these trajectories. Because HAART use was incident in HERS, these data afford an opportunity to capture the population-level effects of HAART on CD4 cell count. We model the trajectories using p-splines (see Examples 2.8 and 3.19) with q = 1 (linear) and q = 2 (quadratic) using the multivariate normal model with an exponential covariance function (Example 2.8). The p-splines used a truncated power basis (see Examples 2.8 and 3.19) with 18 equally spaced knots placed at the sample quantiles of the observed times. Priors A diffuse gamma prior was assumed on the inverse of the residual variance, σ 2 with parameters (.001, 1/.001) and a uniform(0,10) on the log of the correlation parameter, ρ for the exponential covariance function. The first q + 1 coefficients, α were given just proper normal priors with mean 0 and variance 10, 000 and the coefficients corresponding to the knots, u were given normal priors with mean zero and precision τ 2 where 1/τ is then given a uniform(0,10) prior. Model Selection In addition to the model with an exponential covariance function for the residuals we also fit a model which assumes independent residuals (φ = 0) for comparison. To assess fit among the models, we computed the DIC, with θ = (β, φ, 1/σ 2 ); see Table 4.4. The best-fitting model was the quadratic p-spline with the exponential covariance function. The greatly

94 88 DATA ANALYSIS improved fit of both the linear and quadratic p-splines with residual autocorrelation over independence of the residuals is clearly from the large differences in the DIC. Table 4.10 DIC under q = 1, q = 2 under independence and exponential covariance function model DIC Dev( θ) p D indep,q = indep,q = expo,q = expo,q = Results Posterior means and 95% credible intervals for the covariance parameters in the model with q = 2 appear in Table 4.4. The correlation parameter φ was significantly different from zero with the posterior mean corresponding to a lag one (week) correlation of.52. Posterior means of the curves for q = 1 and q = 2 appear in Figures 4.4 and 4.5, respectively. The posterior mean of the spline curves were relatively similar for q = 1 and q = 2 and captured more local variation than the independence models. Time zero on the plot represents self-reported initiation of HAART). An increase can be seen near self-reported initiation (the effect of HART), followed by a decrease later on, possibly attributable to development of resistant viral strains. It also seems clear the self-reports are subject to a delay relative to the actual initiation of HAART because the decrease occurs prior to time zero. Table 4.11 Posterior means and 95% credible intervals for exponential covariance function parameters and smoothing variance under q = 2 parameter posterior mean (95% credible interval) φ.65 (.60,.70) σ (3.8,4.03) λ 1.9 (.81,3.6)

95 ANALYSIS OF HERS CD4 DATA (CONTINUATION OF DATA EXAMPLE 1.4) 89 Figure 4.4 Plot of observed data with fitted p-spline curve with q = 1 under independence and exponential covariance function

96 90 DATA ANALYSIS Figure 4.5 Plot of observed data with fitted p-spline curve with q = 2 under independence and exponential covariance function

97 PART II Missing Data

98

99 CHAPTER 5 Missing data mechanisms and longitudinal data 5.1 Introduction This chapter lays out the definitions and assumptions that are commonly used to make inference about a full-data distribution from incompletely observed data. Some specific examples help to motivate the discussion. Example 5.1. Dropout in a longitudinal clinical trial. In the growth hormone study described in Chapter 1, individuals were scheduled for measurement at baseline, month six and month 12. The target of inference is mean difference in quadriceps strength between the two treatment groups, at month 12, among everyone who began follow up. This is a parameter indexing the distribution of full-data; i.e., the data that would have been observed had everyone completed the study. However several individuals dropped out of the study before completing follow up. Example 5.2. Dropout and mortality in a cohort study. The HER Study (Chapter 1) followed 871 HIV-infected women for up to six years, recording clinical and behavioral outcomes every six months. All women were scheduled for 12 follow up measurements. About 10 percent died from AIDS and other causes, and about another 20 percent withdrew or were lost to follow up. For a specific outcome of interest, like CD4 count, the full data could be defined in several ways. One definition is all CD4 counts that were scheduled to be observed (12 for each woman). Because of the difficulty in conceptualizing CD4 count for someone who has died, an alternate definition is those CD4 counts taken while still alive. A key theme of this chapter is the distinction, for purposes of model specification and inference, between full data and observed data. Inferences about full data that are based on incomplete observations must rely on assumptions about the distribution of missing responses. We will demonstrate here that these assumptions are always encoded in a full data model by some combination of structural modeling assumptions and constraints on the parameter space. This is even true actually it 93

100 94 MISSING DATA MECHANISMS is especially true for commonly used assumptions such as MAR and ignorability. Section 1.2 provides definitions for full data and characterizes some common missing data processes such as dropout and monotone missingness. Section 1.3 gives a general conceptualization of the full-data model, and Section 1.4 describes missing data mechanisms such as MAR in the context of a full-data model. In Section 1.5 we show how the MAR assumption applies to dropout in longitudinal studies. The remainder of the chapter is devoted to model specification and interpretation under various assumptions for the missing data mechanism. A number of issues are highlighted for further reading. 5.2 Full versus observed data Overview When drawing inference from incomplete data, it is necessary to expand the context of the modeling problem to characterizing some set of full but incompletely observed data. In this Section we differentiate full data, observed data, and full-data response. The full data refers to all elements of the data that are observed or are intended to be observed, usually including some response or dependent variable, covariates of direct interest, auxiliary variables, and missing data indicators. Observed data comprise the observed subset of the full data. Finally, the fulldata response refers to the dependent variable of primary interest; for example, while missing data indicators are part of the full data, their distribution or relation to covariates typically are not of primary interest to the modeler. As an example, consider a clinical trial of a new antipsychotic agent, where patients are randomized to receive either standard treatment or the new therapy. The schedule calls for symptom scale to be recorded weekly for six weeks, but many patients drop out of the study before their follow up is complete. Here, the full data consists of symptom scale outcomes for all weeks, regardless of whether they actually were recorded; a binary random variable for each week, indicating whether the symptom scale was recorded (sometimes called missing data indicators); the covariate of interest, namely treatment group; other baseline covariates that may have been recorded.

101 FULL VERSUS OBSERVED DATA 95 The full-data response data consists only of the first and third items, or more simply, the data that was intended to be collected and analyzed. Finally, the observed data comprises the observed elements of symptom score, the missing data indicators, and the covariates. This Section lays out notation and definitions for the data only; in Section 5.3, we define various models that can be used to characterize various aspects of the full-data distribution. Key references: Copas and Eguchi (2005), Heitjan and Rubin, Scharfstein et al., Diggle et al Data structures Consider first the case of bivariate response, where the first response is always observed but the second one may be missing. The full data for individual i consists of a response vector Y i = (Y i1, Y i2 ), a 2 p matrix X i of model covariates, a 2 q matrix of auxiliary covariates V i, and the missing data indicator R i. Throughout the book, we assume the process that causes response data to be missing is stochastic (as opposed to being part of a study design); hence R i (and its generalizations below) are always treated as random variables. The observed data for individual i are O i = (Y i,obs, X i, V i, R i ), where { (Yi1, Y Y i,obs = i2 ) T if R i = 1 Y i1 if R i = 0. For more general patterns of longitudinal data, similar structures apply. Consider first the case of temporally aligned observations. Referring to Section??, the full data response vector is Y i = (Y i1,..., Y ij ) T. The vector R i = (R i1,..., R ij ) T indicates which components are observed, with R ij = 1 if Y ij is observed, and R ij = 0 otherwise. Let J i = j R ij denote the number of full-data components of Y i that are observed. A useful way to represent the observed and missing components of the full data is the (possibly temporally unordered) partition (Y T i,obs, Y T i,mis )T. The subvector Y T i,obs is J i 1, with elements {Y ij : R ij = 1}; similarly, Y i,mis is (J J i ) 1, with elements {Y ij : R ij = 0}. For temporally misaligned data the situation is somewhat different, and the definition of full data is not always obvious because the timing of measurements is itself a random variable. For example, the full response data could be all values of a stochastic process {Y i (t)} over a fixed range of t, and the observed data is the subset {Y i (t i1 ),..., Y i (t iji )}. Here the

102 96 MISSING DATA MECHANISMS observation times t i1,..., t iji might arise from a marked counting process N i (t), with observations of Y i (t) taken at points where dn i (t) = 1 (i.e., where the counting process jumps). While our focus will primarily be on full data with fixed (as opposed to random) observation times, many of the topics we discuss can be applied to settings where observation times are random. For instance, a key consideration in modeling longitudinal data is whether the missing data process defined by R is independent of the responses Y ; for continuous-time processes, the consideration is whether N(t) is in some sense independent of Y (t). We will return to this in later chapters; for a full treatment, see Lin and Ying (1999 JASA) Dropout and other processes leading to missing responses Any number of events can lead to missing data. Commonly encountered examples include (a) missed visits, either at random or for reasons related to response, such as when a patient in a study of depression fails to show up when he is experiencing symptoms; (b) withdrawal from a study, decided either by the participant or by the investigator conducting the study; common examples in pharmacologic trials is withdrawal due to side effects, toxicity, or lack of efficacy; (c) loss to follow up, distinguished from withdrawal because the reasons are not reported; (d) death or disabling event, possibly related to the outcome but sometimes not; for example in a longitudinal study of HIV, accidental death is not outcome related. (e) missingness by design, as in a longitudinal survey where only a subset of individuals is selected for follow up. This certainly is not exhaustive but covers many reasons for missing data in longitudinal studies of various types. Adding to the complexity of handling missing data is that the types of missingness listed above may have different causes, and in most cases should be treated differently. Withdrawal for lack of efficacy is a different process than withdrawal for toxicity; outcome related mortality must be treated differently from death by other causes. Our focus throughout the book is primarily on dropout, and for simplicity we begin with the assumption that dropout and its relation to the

103 FULL-DATA MODELS AND MISSING DATA MECHANISM 97 response process can be captured using a single random variable. Although it seems overly simplistic, this approach can sometimes be used when dropout has multiple causes (Hogan and Laird, 1997). When there are distinct types of dropout such that they are related to outcome in different ways, then it is straightforward to introduce a multinomial version of the missing data indicator, and many of the same ideas discussed here will apply directly (Scharfstein et al; Scharfstein and Rotnitzky). Before moving forward to describing models for incomplete data, it is necessary to define two important terms, dropout and monotone missing data pattern. For many of the models we discuss in this and subsequent chapters, monotone missingness is a key requirement. For temporally aligned longitudinal data, missingness is characterized in terms of the missingness indicators R = (R 1,..., R J ) T. Using these, we can define both terms. Definition 5.1. Dropout process. For full-data responses Y 1,..., Y J scheduled to be recorded at times t 1,..., t J, let R 1,..., R J denote the missing data indicators, with R j = 1 if Y j is observed and R j = 0 if missing. A missing data process is a dropout process if for some j < J, R j = 0 R j+k = 0 for all 1 < k J j. The dropout time is t j. Missingness that does not lead to dropout usually is called intermittent missingness because rather than truncating the longitudinal process, it creates gaps. Dropout that occurs in the absence of intermittent missingness leads to a monotone pattern for the responses (see Figure xx). Definition 5.2. Monotone missing data pattern (monotone dropout). A missing data pattern is monotone if, for each individual, there exists a measurement occasion j such that R 1 = = R j 1 = 1 and R j = R j+1 = = R J = 0; that is, all responses are observed through time j 1, and no responses are observed thereafter. 5.3 Full-data models and missing data mechanism We are now prepared to describe models that can be used for drawing inference about a full-data parameter, say θ, from incompletely observed data. To meet this objective, we begin with a model for the joint distribution of the full data, p(y, r x, ω), indexed by a parameter ω. The parameter of interest is the one indexing the full data response distribution p(y x, ω). The target of inference θ is a subset or function of ω. By specifying p(y, r x, ω), the analyst either implicitly or explicitly specifies both the full-data response model p(y x, ω) and a missing

104 98 MISSING DATA MECHANISMS data mechanism p(r y, x, ω). This section describes how the three are related Targets of inference In almost all practical settings, the analyst is interested in making inferences about a parameter θ that indexes the full-data response model p(y x, θ) = p(y 1, y 2,..., y J x, θ). Examples include the full-data mean µ(θ) = E(Y i θ), or regression coefficients β = β(θ) in a linear model E(Y i X i, β). When data are completely observed, p(y x, θ) can be specified directly. If responses are not fully observed, θ is a function of the parameter ω indexing a larger model p(y, r x, ω) of the full data (Y, R, X). The form of the model for the responses will depend on specifications and assumptions on the full data model. Definition 5.3. Full-data model. Let Y denote the full-data response vector for an individual, let R denote the associated vector of missingness indicators, and let X represent covariates of interest. The full data model describes the joint distribution of Y and R, conditionally on covariates X, is indexed by a finite-dimensional parameter ω, and is written p(y, r x, ω). The full-data response model is determined by the full-data model. Definition 5.4. Full-data response model. The full-data response model characterizes the distribution of Y conditionally on covariates X. It is indexed by a parameter θ = θ(ω) that is a subset or function of the full-data parameter ω, p(y x, θ(ω)) = p(y, r x, ω) dr. Inference about θ will depend crucially on choices made for the specification of the full-data model p(y, r x, ω); however, observed data offer no information to validate these choices. Assumptions made by the data The integral representation here is meant to be very general; although R is generally discrete in our discussions to this point, we use a slight abuse of notation writing p(r)dr rather than dp (r) to maintain consistency of notation and clarity of ideas.

105 FULL-DATA MODELS AND MISSING DATA MECHANISM 99 analyst will generally exert considerable influence over final inferences, even for very large samples of observed data. This point can sometimes be lost when using widely adopted models based on assumptions like missing at random (MAR) and widely available in standard software packages. Many commonly-used approaches, such as posterior inference under ignorability, can be viewed as posterior inference about a full data model under some very specific assumptions about the joint distribution of Y and R and/or constraints on the parameter ω Missing data mechanisms and the full data model In general terms, the missing data mechanism is the stochastic mechanism that leads to missingness among elements of Y. Formally, we will use the term missing data mechanism to refer to the conditional distribution of missing data indicators R given the full-data response Y and covariates X, p(r y, x, ω). Generally speaking, specification of p(y, r x, ω) will imply a missing data mechanism; in practice, the modeler often begins with a working assumption about the missing data mechanism and uses it to specify the full data model. In what follows, we define what is meant by missing data mechanism and then describe a hierarchy of assumptions about this mechanism. Definition 5.5. Missing data mechanism. The missing data mechanism is the model for the joint distribution of missing data indicators R as a function of Y and X; it is indexed by a finite-dimensional parameter ψ = ψ(ω) and written as p(r y, x, ψ(ω)). Any full data model can therefore be factored as the product of a fulldata response model and the associated missing data mechanism, p(y, r x, ω) = p(y x, θ(ω)) p(r y, x, ψ(ω)). (5.1) Importantly, being able to characterize the missing data mechanism does not depend on whether the full-data model has been specified according to (5.1), which is commonly referred to as a selection model factorization. The implied missing data mechanism can, in principle, be derived from any specification of the full data model, but it may not always take a closed form.

106 100 MISSING DATA MECHANISMS 5.4 Common assumptions about the missing data mechanism Restrictions on the missing data mechanism can be classified as missing completely at random (MCAR), missingness at random (MAR), or missingness not at random (MNAR); these progressively weaker assumptions delineate the dependence of the missing data indicators R on the observed and missing parts of the full-data response vector Y. The taxonomy and its terminology were developed by Rubin (1976) and generalized in a series of papers on coarsening by Heitjan and colleagues (198x). Related work can be found in Keiding etc (1994) and Gill and Robins (1997). To define missing data assumptions, it is useful to rewrite the missing data mechanism as p(r y, x, ψ) = p(r y obs, y mis, x, ψ) to emphasize dependence of R on the observed and missing components of Y. Although we write ψ, it is assumed that ψ = ψ(ω) unless stated otherwise Missing completely at random (MCAR) Definition 5.6. Missing completely at random. Missing responses are missing completely at random (MCAR) if, for all x and ψ, p(r y, x, ψ) = p(r x, ψ); i.e., if p(r y, x, ψ) is a constant function of y for all x and ψ. One implication of MCAR is that missingness can be fully explained by covariates X that are included in the full-data model. Another is that the full-data distribution can be factored as p(y, r x, ω) = p(y x, ω) p(r x, ω), meaning that the observed and missing response data have the same distribution, conditionally on X. If in addition the full-data parameter ω can be separated as (θ, ψ), then inference about the full-data response distribution can be based solely on those with complete response data. Even though this is a valid approach, it may not make optimal use of the available data and posteriors for θ may have unnecessarily high variance. Example 5.3. Missing completely at random without covariates. A longitudinal cohort study of school performance will record test scores each year for two years of secondary school; these are Y 1 and Y 2. In the first

107 ASSUMPTIONS ABOUT MISSING DATA MECHANISM 101 year, 1000 students are sampled and their test scores are recored. In the second year, budget constraints force the investigators to reduce the sample size to 800. Data on Y 2 is collected for a random subsample of the original 1000 students. Because the subsample is randomly drawn, missing responses on Y 2 are missing completely at random. Example 5.4. Missing completely at random with covariates (continuation of Example 5.3). In the same study, suppose the investigators are interested to compare test scores between boys and girls, and let X denote gender. The distribution of interest is [Y X]. Suppose further that the random subsample at time 2 oversamples girls, so that girls are more likely to have Y 2 observed. Because gender is a model covariate, the missing data on Y 2 is missing completely at random. The MCAR assumption usually is not realistic for longitudinal studies because unplanned missingness is so common. Consider the smoking cessation trial in Example??, where Y = (Y 1,..., Y 12 ) are the smoking indicators measured weekly over the course of the study, X is the binary indicator of treatment, and R = (R 1,..., R 12 ) are the binary missingness indicators. The full-data model of interest is p(y x, θ), which characterizes the joint distribution of smoking outcomes conditionally on treatment group X. The MCAR assumption states that p(r y, x, ψ) = p(r x, ψ); in this context, it means that probability of nonresponse can depend on treatment group, but within treatment group, nonresponse is completely independent of smoking outcomes. Under the plausible (and testable) scenario that within treatment group, participants observed to be heavier smokers are more likely to drop out, then the MCAR assumption would not hold Missing at random (MAR) A more realistic condition for many longitudinal studies is missing at random (MAR), which essentially requires that the probability of nonresponse is a function of observed responses Y obs and model covariates X. Definition 5.7. Missing at random. Missing responses are missing at random (MAR) if, for all y obs, x and ψ, p(r y obs, y mis, x, ψ) = p(r y obs, x, ψ); i.e., if p(r y obs, y mis, x, ψ) is a constant function of y mis for all y obs, x and ψ. The MAR assumption has two important implications for model-based

108 102 MISSING DATA MECHANISMS inference. First, MAR implies that missingness can be explained by observed responses Y obs and model covariates X. If, in addition, the parameters θ and ψ are distinct, then the observed-data likelihood is a function of θ only. If we impose the additional condition that θ and ψ are a-priori independent, then the observed data posterior is a function of θ only, and the missing data mechanism does not have to be specified in order to obtain poterior inference about θ. This combined set of assumptions constitutes the ignorability condition, discussed in detail in Section 5.7. Example 5.5. A missing at random mechanism (continuation of Example 5.3). If those with lower values of the first test score Y 1 are less likely to sit for the second test, then missingness in Y 2 depends on Y 1. If missingness further depends on X, but does not depend on any other variable, then the missing data mechanism is MAR. The implications of the MAR condition for dropout mechanisms requires some more development and is discussed in more detail in Section Missing not at random (MNAR) Although the MAR assumption is fairly general in that it allows the missing data mechanism (and sometimes the missing data itself) to be explained by observables, there are cases where MAR may fail to hold. MAR will not be valid, for example, when the probability of missingness depends on the value of the missing response or on other unobservables even after conditioning on observed data. We refer to mechanisms of this type as missing not at random (MNAR). Definition 5.8. Missing not at random. Missing responses are missing not at random (MNAR) if, for some y mis y mis, p(r y obs, y mis, x, ψ) p(r y obs, y mis, x, ψ); i.e., if R depends on some part of Y mis, even after conditioning on Y obs and X. Example 5.6. Missing not at random mechanism: dropout in a smoking cessation study. Returning to the smoking cessation trial, dropout is an inevitable complication. Consider a simplified version where the full data consist of cessation outcomes (Y 1, Y 2 ), treatment indicator X, and missing data indicator R. Under MAR, P (R = 1 X, Y 1, Y 2 = 1) = P (R = 1 X, Y 1, Y 2 = 0), which implies that, conditional on treatment group and time-1 smoking status Y 1, the dropout probability is the same for both smokers and

109 ASSUMPTIONS ABOUT MISSING DATA MECHANISM 103 nonsmokers at time 2. In smoking cessation studies, this assumption will not hold if the dropout probability differs by Y 2 (or equivalently, if smoking status Y 2 differs between dropouts and non-dropouts having the same (X, Y 1 ) profile). Liechtenstein et al [] cite empirical evidence that in smoking cessation studies, participants followed up after dropout are more likely than not to have experienced relapse, raising the possibility that MAR is not plausible here Auxiliary variables Recall that the full data response model is p(y x, θ(ω)), and primary interest is in parameters governing this distribution. The parameters of interest are a function of those indexing the full-data response model p(y, r x, ω). We have implicitly assumed up to this point that (Y, R, X) constitutes the only data that potentially are available. In many cases, however, investigators may have access to a set of auxiliary variables, denoted by Z, that may be correlated with one or more of (Y, R, X). Examples of auxiliary variables include covariates that are observed but not of direct interest, or other longitudinal responses that are measured concurrently with Y. In a clinical trial, the model covariates X would include only treatment indicators (and possibly stratification variables); however, other covariates might be available. In an HIV cohort study where longitudinal CD4 counts are the response of interest, the investigator may also have access to viral load measurements taken at the same time points. If the primary responses are completely observed, the modeler generally has little or nothing to gain by using these auxiliary variables to estimate covariate effects or other parameters; however when data are incomplete, auxiliary variables can be helpful to the extent that they explain variation in the joint distribution of Y and R. Missingness mechanisms can be defined conditionally on auxiliary covariates as well. Definition 5.9. Auxiliary variable MAR (A-MAR). Missing responses are A-MAR if MAR holds conditionally on auxiliary variables Z; i.e., if, for all y obs, x, z and ψ, p(r y obs, y mis, x, z, ψ ) = p(r y obs, x, z, ψ ). We use ψ to emphasize that parameters indexing this missing data mechanism are different from the ones indexing p(r y obs, y mis, x, ψ). Clearly, A-MAR does not imply MAR. Consequently, when primary interest is in the full-data response model p(y x, θ) which does not

110 104 MISSING DATA MECHANISMS involve z it is necessary to specify a full-data model conditional on z and integrate it out to obtain the unconditional model. Specifically, the full-data model is p(y, r x, ω) = p(y, r, z x, ω) dz. (5.2) Returning to the longitudinal HIV example, suppose Y is longitudinal CD4 count and Z is longitudinal viral load, and that there is appreciable missingness in CD4. The MAR condition requires the analyst to assume that conditional on observed CD4 history, missingness is unrelated to the CD4 count that would have been measured; this may be unrealistic. Further suppose that the investigator can confidently specify a joint model for CD4 count and viral load, based on historical data, and that viral load is thought to explain sufficient variability in CD4 count that one could use it to predict missing responses. In other words, the analyst is willing to assume A-MAR, or MAR conditional on viral load. This approach requires specification of the joint distribution of (Y, R, Z) under the integral sign in (5.2). One possibility is to write p(y, z x) p(r y, z, x), where the first factor is the joint model for CD4 and viral load, and the second is the missing data mechanism. As we will see in Chapter??, incorporating auxiliary covariates can sometimes be simpler than it would appear. Under A-MAR (and assuming ignorability), the missing data mechanism can be left unspecified. If the joint model for p(y, z x) is multivariate normal, then the marginal p(y x) is easy to obtain and specialized methods for integrating over z are not needed. 5.5 Missing at random applied to dropout processes The definitions of MCAR, MAR and MNAR given in the previous section were fully general. In this section we describe the implications of the MAR assumption for handling monotone missing data patterns caused by dropout. A number of papers have addressed this topic, albeit from different perspectives and sometimes without clear consensus on terminology. Key references include Little (1995), Little and Rubin (2002), Diggle and Kenward (1994). References related to pattern mixture models include Molenberghs et al (1998 Stat Neerlandica), Kenward and Molenberghs (Bka). See also Fitzmaurice (Soc reviews), Hogan et al (2004 Stat Medicine), Gill and Robins (Lecture notes, Ann stat paper). Although the MAR condition is rather simple to state and interpret for cross sectional data, it is somewhat less transparent when applied to

111 MAR AND DROPOUT 105 longitudinal data, particularly if missingness patterns are non-monotone. When missingness follows a monotone pattern, as when dropout is the sole cause of missing data, it turns out that MAR has an intuitive representation in terms of the hazard of dropout, with obvious analogies to the cross-sectional setting. Our discussion assumes a discrete set of measurement times within individual and that dropout happens at one of the scheduled times. Without loss of generality, let t ij = j for j = 1,..., J. Recall that the missingness indicator at time t ij is R ij, with R ij = I(Y ij observed). Hence dropout time can be defined as U i = 1 + j R ij. Further define Y ij = (Y i1,..., Y ij ) to be the subset of the full-data vector Y i that would be observed up to and including measurement time j. Under monotone dropout, the MAR condition states that P (U = j Y J ) = P (U = j Y j 1 ), or that the marginal probability of dropping out at t j cannot depend on measurements at or beyond t j (Kenward and Molenberghs, 200x). In practical settings, this version of MAR can be somewhat difficult to interpret because it makes reference to the marginal probability of dropping out at j, P (U = j), rather than the more intuitive hazard function, P (U = j U j). The marginal dropout rate is the probability of dropping out at j among all those who begin follow up, while the hazard rate is the probability of dropping out at j among those still in follow up at j. At time j and for any given response history Y k = (Y 1,..., Y k ), let h j (Y k ) = P (U = j U j, Y k ) denote the hazard of dropout at j conditional on Y k. It turns out that monotone missingness caused by dropout is MAR if and only if the hazard of dropout at j depends on observed response history up to j 1, which is shown in the following. Proposition 1. Under monotone dropout, missingness is MAR if and only if for all j, the hazard of dropout at t j is independent of Y j,..., Y J conditionally on Y 1,..., Y j 1. Proof: Recall that under MAR, P (U = j Y J ) = P (U = j Y j 1 ). By definition, the hazard can be written in terms of the marginal dropout probabilities P (U = j Y k ) h j (Y k ) = 1 j 1 l=1 P (U = l Y k), for any j, k such that 1 j J and 1 k J. Hence, under MAR, we

112 106 MISSING DATA MECHANISMS have P (U = j Y J ) h j (Y J ) = 1 j 1 l=1 P (U = l Y J) mar P (U = j Y j 1 ) = 1 j 1 l=1 P (U = l Y l 1) = function of Y 1,..., Y j 1. Conversely, suppose the hazard of dropout at j given Y J depends only on Y j 1 ; that is, P (R j = 0 R 1 = = R j 1 = 1, Y J ) = P (R j = 0 R 1 = = R j 1 = 1, Y j 1 ). To show this implies MAR, we need to demonstrate that P (U = j Y J ) is a constant with respect to Y j,..., Y J, or equivalently, is a function of Y 1,..., Y j 1 only. Writing it in terms of the hazard, we have P (U = j Y J ) = P (R 1 = = R j 1 = 1, R j = 0 Y J ) = P (R j = 0 R 1 = = R j 1 = 1, Y J ) P (R j 1 = 1 R j 2 = = R 1 = 1, Y J ) P (R 1 = 1 Y J ) j 1 = h j (Y J ) {1 h k (Y J )}. (5.3) k=1 But if h j (Y J ) = h j (Y j 1 ), then (5.3) reduces to j 1 P (U = j Y J ) = h j (Y j 1 ) {1 h k (Y k 1 )} k=1 = function of Y 1,..., Y j 1 only. This result is useful for making the MAR assumption more transparent; dropout is really an event process, and it is more intuitive to think about imposing conditions on hazard of dropout rather than its marginal probability. It also is helpful in writing down the observed data likelihood under MAR, which we will see in Section 5.7 In the foregoing, we have not discussed the role of covariates, and in particular time-varying covariates. These results will be expected to hold for covariates meeting the following conditions (i) Time invariant covariates observed at baseline;

113 POSTERIOR INFERENCE 107 (ii) Time-varying covariates that are a deterministic function of baseline value (e.g., age); (iii) Stochastic time-varying covariates that are external to the response process, and can be expected to be observed throughout the measurement period (e.g., when air pollution level measured from a monitoring station is a covariate, and health status of an individual is the response process. The individual may drop out of the study, but air pollution will continue to be measured.). Note that more detailed discussion of covariates to follow in causal inference chapter and possibly also in Ch Posterior inference from incomplete longitudinal data Thus far, we have described processes that cause data to be missing, distinguished full from observed data, described the Rubin taxonomy for missing data mechanisms, and showed how it applies to monotone missingness induced by dropout. In particular we showed that in monotone patterns, the MAR condition equivalent to assuming the hazard of dropout at some time t j depends only on observed response data up to and including time t j 1. We are now ready to describe formal methods for drawing posterior inference about full data parameters from observed (incomplete) data. Posterior inference from incomplete data requires the analyst to specify a full-data model and priors for parameters indexing that model. The full-data model, including its assumptions and constraints, will encode both the full-data response model and the missing data mechanism. The posterior distribution p(ω y obs, r) is derived directly from the assumed full-data model and its priors. When interest is in a parameter θ = θ(ω) (e.g. the parameter indexing p(y θ)), posteriors of the appropriate functionals or subsets of ω are used. We do not confine ourselves in this discussion to any particular factorization or specification of the full-data model; the set up described here applies to any full-data model (e.g., selection model, mixture model). Posterior inference under particular full-data model specifications are described and discussed in Section xxx of this chapter. By definition, the observed-data posterior is p(ω y obs, r) = p(y obs, r ω)p(ω) p(yobs, r ω)p(ω)dω = p(y obs, r ω)p(ω) p(y obs, r) (5.4) (for clarity we will drop covariate dependence, but it is implied through-

114 108 MISSING DATA MECHANISMS out). The observed-data model p(y obs, r ω) is obtained by averaging the full-data model over all possible realizations of the missing data, p(y obs, r ω) = p(y obs, y mis, r ω) dy mis. The observed-data likelihood function is any function of the observed data that is proportional in ω to p(y obs, r ω); i.e., L obs (ω y obs, r) p(y obs, r ω). Because the marginal distribution p(y obs, r) in the denominator of (5.4) is a constant with respect to ω, the observed-data posterior is proportional to the product of the full-data prior p(ω) and the observed-data likelihood, p(ω y obs, r) p(ω)l obs (ω y obs, r). (5.5) In the next sections we will describe specific strategies for observed-data posterior inference about full-data parameters. In general the missing data distribution is not identifiable, and assumptions are needed for both the full-data model and the priors. When missingness is MAR, the ignorability assumption provides considerable simplification of the observed-data posterior; that is reviewed in Section xxx. When missingness is not MAR, the key assumptions usually relate to the association structure between Y and R, and specifically to the association between Y mis and R conditional on Y obs (and X). 5.7 Posterior inference under the ignorability assumption When missingness is MAR, the ignorability condition, first described by Rubin (1976) [], can be used to facilitate posterior inference without having to specify the missing data mechanism. The ignorability condition has three key ingredients Definition Ignorable missing data mechanism. A missing data mechanism is said to be ignorable for the purposes of posterior inference if 1. The missing data mechanism is MAR 2. The full data parameter ω can be decomposed as ω = (θ, ψ), where θ indexes the full-data response model p(y θ) and ψ indexes the missing data mechanism p(r y, ψ). 3. The parameters θ and ψ are a-priori independent; i.e., p(θ, ψ) = p(θ)p(ψ).

115 POSTERIOR INFERENCE UNDER IGNORABILITY 109 An important consequence of ignorability is that inference about θ typically of direct interest can be based on a likelihood function that is proportional in θ to the observed-data response model, p(y obs θ) = p(y obs, y mis θ) dy mis. This can be shown as follows. Recall that the observed data posterior for ω is given by p(ω y obs, r) p(ω)l obs (ω y obs, r). (5.6) If we factor the full-data model as p(y, r ω) = p(y θ)p(r y, ψ), then condition (i), MAR, implies that p(r y, ψ) = p(r y obs, ψ). By condition (2) of the ignorability assumption, we can write L obs (ω y obs, r) = L obs (θ, ψ y obs, r). Conditions (1) and (3) MAR and independence of priors on θ and ψ lead to the further simplification L obs (θ, ψ y obs, r) p(r y obs, y mis, ψ) p(y obs, y mis θ) dy mis MAR = p(r y obs, ψ) = p(r y obs, ψ) p(y obs θ) p(y obs, y mis θ) dy mis = L 1 (ψ r, y obs ) L 2 (θ y obs ). (5.7) Conditions (2) and (3) from the definition of ignorability imply that p(ω) = p(θ)p(ψ). Substituting the simplified likelihood (5.7) into the observed-data posterior (5.6) and using the factored prior, we have p(ω y obs, r) p(ω) L obs (ω y obs, r) = {p(ψ)l 1 (ψ r, y obs )} {p(θ)l 2 (θ y obs )} (see also Little and Rubin, 2002, pages xxx). Because the posterior factors over ψ and θ, the ignorability condition implies p(θ y obs ) p(θ) L 2 (θ y obs ), (5.8) which involves neither R nor the part of the likelihood corresponding to p(r y, ψ); i.e., the missing data mechanism is ignored for the purposes of drawing posterior inference about θ. We illustrate with a derivation for the case where J = 2. For those with missing Y 2, the observed data are (Y 1, R = 0, X), and the contribution to the likelihood is proportional to p(y 1, r θ, ψ), evaluated at y 1 = y i1 and r = 0. The expression for this term must be derived by integrating over the distribution of missing responses in this case y 2 for the

116 110 MISSING DATA MECHANISMS specified model, p(y 1, r θ, ψ) = p(y 1, y 2 x, θ) p(r y 1, y 2, x, ψ) dy 2. (5.9) If missingness in Y 2 is MAR, then p(r y 1, y 2, x, ψ) = p(r y 1, x, ψ) and (5.13) reduces to p(y 1, y 2 x, θ) p(r y 1, x, ψ) dy 2. Because y 2 only appears under the integral in the first term, (5.13) further reduces to p(y 1 θ)p(r y 1, ψ). Hence the observed-data likelihood factors over ψ and θ as L(θ, ψ y i,obs, r i ) {p(y i1, y i2 x i, θ)} ri {p(y i1 x i, θ)} (1 ri) p(r i y i1, x i, ψ) = L 1 (θ y i,obs, x i ) L 2 (ψ r i, y i1, x i ) (5.10) A second implication of ignorability is that missing responses Y mis can be imputed or extrapolated from observed data Y obs using only the fulldata response model (ignoring the missing data mechanism); this can be seen as follows (again suppressing covariates for clarity): p(y mis y obs, r, ω) = p(y mis y obs, r, θ, ψ) (5.11) = p(y mis, y obs, r θ, ψ) p(y obs, r θ, ψ) = p(r y obs, y mis, ψ) p(y obs, y mis θ) p(r y obs, ψ) p(y obs θ) = p(y obs, y mis θ) p(y obs θ) (5.12) = p(y mis y obs, θ), where (5.11) follows from condition (2) of the ignorability assumption, and (5.12) is a direct consequence of MAR (condition 3). Applied to the simple case of bivariate response data where Y 1 is always observed but Y 2 may be missing, we see that p(y 2 y 1, r = 0, θ) = p(y 2 y 1, r = 1, θ); hence missing values of Y 2 can be imputed from a model of p(y 2 y 1, r = 1, θ) (e.g. from a regression of Y 2 on Y 1 among those with complete data). This consequence of the ignorability assumption figures prominently when using data augmentation techniques for posterior inference under MAR (Chapter??). Calling the missing data mechanism ignorable can be deceptive; if we view the process of inference as starting from a full-data model and moving forward with a series of assumptions, then plainly we have not

117 POSTERIOR INFERENCE UNDER IGNORABILITY 111 ignored the missing data mechanism at all; indeed, we have paid very close attention to it by placing the MAR assumption on p(r y, ψ), by partitioning the parameter space of ω, and by assuming a-priori independence between θ and ψ. It is possible to make convincing qualitative justifications of these assumptions, but they cannot be empirically verified (Chapter 8), and therefore should not be viewed as benign. The term ignorable means that once these assumptions and parameter constraints have been imposed, and assuming the full-data model has been correctly specified, the missing data mechanism can be safely ignored when making inference about the full-data response model parameter θ = θ(ω) from (Y obs, X) We are now ready to illustrate applications of ignorability with some specific distributions. Here we will focus on construction of observed data likelihood, identification of full-data parameters, and the predictive distribution of Y mis given Y obs. Example 5.7. Bivariate normal with missing Y 2. Consider a study, such as a pretest-posttest design, where measurements Y 1 are taken at baseline and Y 2 at follow up. All individuals have their outcome recorded at baseline, but some are missing at time 2. Further assume that covaraites of interest are time-invariant (e.g., baseline characteristics, treatment group, etc.). The full data for each individual is (Y 1, Y 2, R, X), where R = I(Y 2 observed). The full-data model is denoted by p(y 1, y 2, r x, ω). If we wish to model (Y 1, Y 2 X) using a bivariate normal distribution under ignorability, we must make the following assumptions: 1. The full-data model can be factored as p(y 1, y 2, r x, ω) = p(y 1, y 2 x, θ) p(r y 1, y 2, x, ψ), where ω = (θ, ψ); 2. The full-data response model p(y 1, y 2 x, θ) follows N(µ, Σ), where Σ = [{σ ij }] and θ = (µ 1, µ 2, σ 11, σ 12, σ 22 ); 3. The missing data mechanism is MAR, which implies p(r y 1, y 2, x, ψ) = p(r y 1, x, ψ); and 4. the prior distribution factors as p(θ, ψ) = p(θ) p(ψ). In our discussions of the normal distribution, we use the following notational conventions: θ j denotes parameters for the marginal distribution p(y j x, θ j ); θ j k represents the same for the conditional distribution p(y j y k, θ j k ), and θ jk indexes the joint distribution p(y j, y k θ jk ). For the bivariate normal, θ 12 = θ. From our bivariate example (5.10), the

118 {1 φ ij ( j, γ)} 1 yij 112 MISSING DATA MECHANISMS observed data likelihood contribution for individual i is L obs,i (θ y obs,i ) = p(y i1, y i2 θ 12 ) ri { p(y i1, y 2 θ) dy 2 } (1 ri) An equivalent representation is = p(y i1, y i2 θ 12 ) ri p(y i1 θ 1 ) (1 ri). L obs,i (θ y obs,i ) = p(y i2 y i1, θ 2 1 ) ri p(y i1 θ 1 ); which emphasizes that although the parameters (µ 2, σ 12, σ 22 ) govern the full-data distribution including those with missing Y 2 their posterior will generally be informed only by those with observed Y 2. The previous example suggests a general form for the observed data likelihood under ignorable dropout, L obs,i (θ y i,obs ) = p(y i1 θ 1 ) J p(y ij y i,j 1,..., y i1, θ j {1,...,j 1} ) rij. j=2 Example 5.8. First-order marginalized transition model under ignorability. The full-data MTM(1) model follows where p(y 1 x, θ 1 ) Ber(π 1 ) p(y j y j 1, x, θ j j 1 ) Ber(φ j ), g(π 1 ) = x 1 β h(φ j ) = j (x) + y j 1 γ. Hence θ 1 = β and θ j j 1 = ( j, γ). Given x and γ, the parameter j (x) is determined by the marginal regression constraint π j = E(Y j X = x) = g 1 (x j β). The observed-data likelihood follows from (5.13). Let π i1 (β) = g 1 (x i1 β) and let φ ij ( j, γ) = h 1 ( j (x i ) + y i,j 1 γ); then p(y i1 θ 1 ) = {π i1 (β)} yi1 {1 π i1 (β)} 1 yi1 p(y ij y i,j 1,..., y i1, θ j {1,...,j 1} ) = p(y ij y i,j 1, θ j j 1 ) = {φ ij ( j, γ)} yij (5.13)

119 POSTERIOR INFERENCE UNDER NON-IGNORABILITY 113 Although MAR is needed for ignorability, not all MAR mechanisms lead to a model where the missing data is ignorable. This is particularly true for mixture specifications, evident from this example. Example 5.9. Example of non-ignorability under MAR: mixture model Again consider the normal model. Suppose p(y 1, y 2 r = 1, α (1) ) N(µ (1), Σ) p(y 1, y 2 r = 0, α (0) ) N(µ (0), Σ) p(r φ) Ber(φ); that is, the full-data model is a mixture of two bivariate normals with different means but the same variance. The proportion of those with observed Y 2 is P (R = 1) = φ. Then the implied selection model is { } P (R = 1 y, ω) log = ψ 0 + ψ 1 y 1 + ψ 2 y 2, P (R = 0 y, ω) where ψ 0 = log φ 1 φ + {µ(0) } T Σ 1 µ (0) {µ (1) } T Σ 1 µ (1) ψ 1 = 1σ 22 2 σ 12 σ 11 σ 22 σ 2 12 ψ 2 = 2σ 11 1 σ 12 σ 11 σ 22 σ12 2, (5.14) with j = µ (1) j µ (0) j denoting between-pattern difference in means at measurement time j. It can be shown that when the pattern-specific variances are equal, MAR corresponds to E(Y 2 Y 1, R = 0, ω) = E(Y 2 Y 1, R = 1, ω). Equating these conditional means gives µ (0) 2 σ 12 σ 22 µ (0) 1 + σ 12 σ 22 Y 1 = µ (1) 2 σ 12 σ 22 µ (1) 1 + σ 12 σ 22 Y 1, which reduces to 2 = (σ 12 /σ 11 ) 1 and therefore implies ψ 2 = 0. However it does not imply ψ 1 = 0; from (5.14) we see that ψ 1 = ψ 1 (µ (0), µ (1), Σ), meaning parameters of the missing data mechanism p(r y) are not separable from those of the full-data response model p(y). Hence the mechanism is MAR but not ignorable. 5.8 Posterior inference with non-ignorable missing data mechanisms The steps involved in drawing inference from incomplete data are specification of a full data model, specification of the priors, and then sam-

120 114 MISSING DATA MECHANISMS pling from the posterior distribution of full-data parameters, given the observed data Y obs, R and X. In Section 5.7, we showed how to construct the posterior distribution for the full-data response parameter θ when the missing data mechanism is ignorable. As we saw in the previous section, identification of a full-data response model particualrly the part involving Y mis requires making unidentifiable assumptions about the full-data model f(y, r x, ω). In the previous section, we relied on the ignorability assumption to identify the full-data response model f(y x). When ignorability is believed not to be a suitable assumption, one can use a more general class of models that allow missing data indicators to depend on missing responses themselves. Essentially these models allow one to parameterize the conditional dependence between R and Y mis, given Y obs and X. Clearly this association structure cannot be identified from observed data for the simple reason that Y mis cannot be observed. Inference therefore depends on some combination of (a) unverifiable parametric assumptions and (b) informative prior distributions. This can be viewed either as a strength or a weakness. The goal of this section is to illustrate, using some simple examples, the construction, identification, and practical utility of some commonly used nonignorable models Selection models The SM approach assumes p(y, r x, ω) = p(y x, ω)p(r y, x, ω), so that the full-data response model p(y x, ω) and the missing data mechanism p(r y, x, ω) must be specified by the analyst. We have written the model so that both factors are indexed by a common parameter ω; it is common but not necessary to assume that ω can be decomposed as (θ, ψ), separate parameters for each factor. Selection models can be attractive for several reasons. First, the analyst is usually interested in the full-data response model p(y x), which in the SM is specified directly. Second, the SM factorization appeals directly to the missing data taxonomy described in Section xxx, so the assumptions about the missing data mechanism are fairly easy to characterize. Third, when the missing data is caused by dropout, the missing data mechanism can be formulated as a hazard function, where hazard of dropout at some time t can depend on parts of the full-data vector Y. Under monotone

121 POSTERIOR INFERENCE UNDER NON-IGNORABILITY 115 dropout, assumptions about the hazard function translate directly into the MCAR-MAR-MNAR taxonomy. Two potential downsides to selection models are their sensitivity to model specification, and the sometimes opaque nature of identifiability conditions. Several features of selection models can be illustrated using a simple example involving the bivariate normal distribution. Example Selection model for bivariate normal data Consider a situation where Y 1 is always observed but Y 2 may be missing, and define R = I(Y 2 observed). For simplicity we assume no covariates. A selection model factors the full-data distribution as p(y 1, y 2, r ω) = p(y 1, y 2 θ) p(r y 1, y 2, ψ); we have implicitly assumed here that ω = (θ, ψ); this is common but not necessary. Suppose we specify p(y 1, y 2 θ) as a bivariate normal density N(µ, Σ). A typical choice for p(r y 1, y 2, ψ) is a Bernoulli distribution with some regression structure for the mean; e.g. where [R Y 1, Y 2 ] Ber(π(ψ)), g(π) = ψ 0 + ψ 1 y 1 + ψ 2 y 2. Setting g( ) equal to the inverse normal cdf Φ 1 ( ) yields the Heckman selection model (1976). Diggle and Kenward use the logit function on a discrete hazard in applying this model to longitudinal data with dropout that depends on the current but possibly missing Y. An appeal of models like the one in Example 5.10 is that it reduces to well known missing data mechanisms in special cases. For example if ψ 2 = 0 we have MAR. If, in addition, p(µ, Σ, ψ) = p(µ, Σ) p(ψ), then we have ignorability. For nonignorable models, the observed data likelihood does not usually take a closed form. Setting π(y 1, y 2, ψ) = pr(r = 1 y 1, y 2, ψ) = g 1 (ψ 0 + ψ 1 y 1 + ψ 2 y 2 ), the observed data likelihood contribution can be written as L obs,i (µ, Σ, ψ y i,obs, r i ) {p(y i1, y i2 µ, Σ) π(y i1, y i2, ψ)} ri [ ] (1 ri) p(y i1, y 2 µ, Σ) {1 π(y i1, y 2, ψ)} dy 2. For common choices of g( ), such as probit and logit, the observed data likelihood generally will be a function of the parameter ψ 2 that governs conditional association between R and Y mis given Y obs. Consequently,

122 116 MISSING DATA MECHANISMS identification of this parameter is purely a function of parametric model specification for the full-data response. Moreover, although the parameter ψ 2 does not index the full-data response model p(y 1, y 2 µ, Σ), it does index the joint distribution of observables Y obs and R. This can be seen by writing p(y obs, r) = p(y obs, y mis, µ, Σ) p(r y obs, y mis, ψ) dy mis = function of (µ, Σ, ψ). This can become problematic for sensitivity analyses that rely on refitting the model at various fixed values of the nonignorability parameter ψ 2 because unique choices of ψ 2 will yield different models for the observable data. Ideally, parameters used for a sensitivity analysis should only change the distribution of missing responses and not affect fit to the observables (cf. Scharfstein et al., 1999). This will be discussed in more detail in Chapter 8. To summarize, selection models represent a natural generalization from ignorability to nonignorability by expanding p(r y obs, y mis ) to depend on y mis. The structure of selection models appeals directly to the MAR- MNAR taxonomy developed by Rubin, and in some cases the models can be fit using standard software. On the other hand, identification of the missing data distribution is accomplished primarily through parametric modeling assumptions. For the simple example where the full-data response model is bivariate normal and the selection model is linear in y, the parameter ψ 2 indexing association between R and the partially observed Y 2 is fully identifiable from observed data because it appears in the observed-data likelihood. This feature of the model places considerable importance on assumptions that cannot be verified, and has the potential to make sensitivity analysis problematic Mixture models Mixture models represent another way to factor the full-data model in a way that allows R to depend on Y mis. In contrast to selection models, the MM approach uses the factorization, p(y, r x, ω) = p(y r, x, ω) p(r x, ω). (5.15) As with the selection model, it is sometimes useful to partition the parameter space as ω = (α, φ), where α and φ respectively index the two factors in (5.15). (Recall that the separable parameters condition required for ignorability implicitly relies on a selection model factorization

123 POSTERIOR INFERENCE UNDER NON-IGNORABILITY 117 of the full-data model, so the partion ω = (α, φ) is not equivalent.) The full-data response model is a mixture p(y x, α, φ) = p(y r, x, α) p(r x, φ) dr. The missing data mechanism can be derived using Bayes rule, p(r y, x, α, φ) = p(y r, x, α)p(r x, φ) p(y r, x, α)p(r x, φ)dr. Despite its seeming intractibility, in some circumstances the missing data mechanism implied by a MM actually takes a closed form (recall Example 5.9). From the point of view of modeling, mixture models treat dropout or missingness as a source of variation in the full data distribution. Specifying a different distribution for each dropout time or missing data pattern may seem cumbersome, but it frequently has the advantage of making explicit the parameters that cannot be identified by observed data. This can be seen as follows (leaving aside covariates for the time being). Recall the MM specification is p(y obs, y mis, r ω) = p(y mis, y obs r, α) p(r φ). Further decompose the first term on the right hand side as p(y obs, y mis r, α) = p(y mis y obs, r, λ 1 )p(y obs r, λ 2 ), (5.16) where α = λ 1 λ 2 (i.e. the parameters λ 1 and λ 2 may overlap). The first component on the right hand side is the conditional distribution of missing responses, given observed data (i.e. the model extrapolation); the second component gives the distribution of observables. A parameter τ that satisfies τ λ 1 and τ λ 2 (5.17) will not appear in the observed-data likelihood; it therefore indexes the the conditional distribution of missing responses but not the observed ones, and will be be informed solely by prior distributions. Parameters satistying (5.17) are frequently easy to find in MM factorizations, and more difficult to find in SM factorizations. A potential downside of mixutre models relates to their implementation; it is often not obvious whether and how to constrain the parameters, and it is sometimes difficult to characterize the missing data mechanisms using the Rubin taxonomy. We have already seen (Example 5.9) that a MM factorization can satisfy MAR but not ignorability. To illustrate the use of mixture models for non-mar missingness, we return to Example??.

124 118 MISSING DATA MECHANISMS Example Mixture of bivariate normals with missingess in Y 2. We generalize the specification of Example 5.9 to allow separate variance matrices by pattern of missingness; specifically let p(y 1, y 2 r = 1, α (1) ) N(µ (1), Σ (1) ) p(y 1, y 2 r = 0, α (0) ) N(µ (0), Σ (0) ) p(r φ) Ber(φ); in this set up, for pattern r, the parameters are {µ (r) 1, µ(r) 2, σ(r) 11, σ(r) 22, σ(r) 12 }. For pattern r, define β (r) 0(2 1), β(r) 1(2 1), and σ(r) 2 1 to be, respectively, the intercept, slope and residual variance for the regression of Y 2 on Y 1. These allows our model parameters to be written as { } α (r) = µ (r) 1, σ(r) 11, β(r) 0(2 1), β(r) 1(2 1), σ(r) 2 1. The non-identified parameters for the full-data model are those from the conditional distribution of Y 2 given Y 1 among those with R = 0; we collect them into ( T τ = β (0) 0(2 1), β(0) 1(2 1) 2 1), σ(0). To simplify notation, let p r ( ) denote p( R = r). Using this representation, we can write the full-data model as { } (1 r) { r p(y 1, y 2, r ω) = (1 φ) p 0 (y 1, y 2 α (0) ) φ p 1 (y 1, y 2 α )} (1). We now factor p 0 (y 1, y 2 α (0) ) to emphasize the distinction between identifiable and nonidentifiable parameters, p 0 (y 1, y 2 α (0) ) = p 0 (y 2 y 1, τ ) p 0 (y 1 µ (0) 1, σ(0) 11 ). Referring now to (5.16), we have p(y 1, y 2 r, α) = {p 0 (y 2 y 1, τ )} 1 r { } 1 r r p 0 (y 1 µ (0) 1, σ(0) 11 {p ) 1 (y 1, y 2 α )} (1), so that with λ 1 = τ and p(y obs r, λ 2 ) = p(y mis y obs, r, λ 1 ) = {p 0 (y 2 y 1, τ )} 1 r, { p 0 (y 1 µ (0) 1, σ(0) 11 ) } 1 r {p 1 (y 1, y 2 α (1) )} r, with λ 2 = (µ (0) 1, σ(0) 11, α(1) ). The parameter τ therefore satisfies (5.17); it is easy to show that the observed data likelihood contribution for

125 POSTERIOR INFERENCE UNDER NON-IGNORABILITY 119 individual i is L obs,i (α (1), µ (0) 1, σ(0) 11 y i,obs, r i) { } = (1 φ) p 0 (y i1 µ (0) 1, σ(0) 11 ) 1 ri {φ p 1 (y i1, y i2 α )} (1) ri. Compared to the selection model for bivariate response in Example 5.10, the observed-data likelihood is completely free of parameters τ indexing the distribution of missing responses. For the purposes of posterior inference, information about τ will derive wholly from prior distributions. To summarize, mixture models have the advantage that in many cases, one can find full-data parameters indexing the distribution of missing responses that are not identified by observed data. In one sense, inferences are more honest. forward reference to sensitivity chapter Shared parameter models A third approach is to use an explicitly multilevel formulation, where random effects η are modeled jointly with Y and R. In some cases the random effects can be used to explain the association or to capture multiple sources of variation. Key papers on this topic include Wu and Carroll (1988) and Ten Have et al (19xx). DeGruttola and Tu (1994), Wulfsohn and Tsiatis (1994), and Faucett and Thomas (199x) use similar formluations but with the objective of modeling a survival process as a function of stochastic longitudinal covariates. The general form of the full-data model using a shared parameter approach is p(y, r x, ω) = p(y, r, η x, ω) dη. (5.18) Specific shared parameter models are formulated by making assumptions about the joint distribution under the integral sign. An early example is the random coefficients selection model of Wu and Carroll (1988). Notice in (5.18) that the full-data parameter ω also includes parameters indexing the distribution of the random effects. Advantages to the SPM model include simplified specification for the response and missingness components. When Y is measured with error, SPM s provide a useful way for missingness to depend on the error-free version of Y, which is represented by random effects (this is a type of random-coefficientdependent missingness ; see Little 1995). Another potential advantage

126 120 MISSING DATA MECHANISMS is that through the use of random effects, SPM s can be used to handle high-dimensional or multilevel response data. A disadvantage is that except in simple settings, the underlying missing data mechanism can be difficult to understand and may not even have a closed form: the representation of p(r y, ω) requires integration over the random effects. Example Random coefficients selection model. Wu and Carroll (1988) are concerned with estimating the population slope of longitudinal lung function measurements using data from a study where the dropout proportion was appreciable. They assume the lung function measures follow a linear random effects model [Y i X i, η i ] N(X i β + W i η i, Σ i (φ)), where W i X i are the random effects covariates, with rows w ij = (1, t ij ) (hence each individual has a random slope and intercept). The random effects η i = (η 1i, η 2i ) T are assumed to follow a bivariate normal disrtribution [η i X i ] N(α, Ω). For the dropout process, they assume [R ij R i,j 1 = 1, η i ] Ber(π ij ), where hazard of dropout depends on the random effects governing Y i via g(π ij ) = ψ 0 + ψ 1 η 1i + ψ 2 η 2i. The model is seen as a special case of (5.18) where the joint distribution under the integral is factored as p(y, r, η x, ω) = p(r y, η, x, ψ) p(y η, x, β, φ) p(η x, α, Ω) = p(r η, ψ) p(y η, x, β, φ) p(η α, Ω), with ω = (β, φ, α, Ω, ψ). The model makes the key assumption that dropout R is independent of both Y obs and Y mis, conditionally on random effects. However, integrating over the random effects induces dependence, and hence the model can be used for MNAR mechanisms. The conditional linear model (Wu and Bailey, 1988; Hogan and Laird, 1997) also can be viewed as a version of an SPM, though it also has representation as a standard pattern mixture model. The conditional linear model specializes (5.18) to p(y, r, η x) = p(y r, η, x) p(η r, x) p(r x).

127 SUMMARY 121 As an example, Hogan and Laird (1997) assume that mixture components p(y r, η, x), r R in the first factor follow random effects models where the regression coefficients β depend on r, and the random effects have a distribution that does not depend on r; i.e. p(η r, x) = p(η x). Hogan, Lin and Herman (2004) generalize the model to allow continuous dropout time. For the case where p(y r, η, x) and p(η r, x) follow normal distributions, integrating the random effects yields a standard pattern mixture model, where the full-data distribution is a mixture of normal distributions (as in Example 5.11). Finally, the SPM framework is useful for understanding the assumptions used in fitting standard random effects models to incomplete data. These models (e.g. logit normal random effects model for longitudinal binary data) can be seen as specializations of (5.18), where the full-data model is factored as p(y, r, η x) = p(r y, η, x) p(y η, x) p(η x). (5.19) Fitting a standard random effects model to incomplete data generally involves specifying p(y η, x) and p(η x), ignoring the first term in (5.19). This approach will provide valid inference for the full-data parameters if ignorability holds. MAR requires independence between R and (Y mis, η), conditionally on (Y obs, X); i.e., p(r y, η) = p(r y obs ). If in addition parameters indexing p(r y, η) are separate from and a-priori independent of those indexing p(y, η x), we have ignorability. 5.9 Summary 5.10 Further reading 1. Nonmonotone patterns of missingness 2. Multiple cause dropout 3. Combinations of dropout and nonmonotone missingness 4. Death as a reason for dropout 5. Use of auxiliary covariates 6. Dropout for continuous time or stochastic processes. May have to define a continuous time processes that marks observation times, and hence this leads to observed at random. See Lipsitz et al, Diggle (discussion of Copas). For mixture formulations, Hogan Lin and Herman.

128

129 CHAPTER 6 Inference about full data under ignorability In this chapter, we discuss inference about the full-data of interest under the assumption of ignorability. Recall, the full-data model, p(y, r, x, ω) can be factored as p(y x, θ(ω))p(r y, x, ψ(ω)). Under an assumption of missing at random (MAR) (Definition 5.7), the missing data mechanism takes the form p(r y, x, ψ(ω)) = p(r y obs, x, ψ(ω)). (6.1) Ignorability (recall Definition??) arises under the additional restriction of a priori independence between the parameters of the full-data response model, θ and the parameters of the missing data mechanism in (6.1), ψ. In the following we will suppress the dependence on x without loss of clarity. Under these assumptions, proper inference depends only on correctly specifying the full-data response model, p(y θ); the corresponding posterior distribution of interest is then p(θ y). However, since we only observe y obs, not y mis, inference is based on the observed data posterior, p(θ y obs ) = p(θ y)p(y mis y obs ) dy mis. (6.2) Equation (6.2) is the expectation of the full-data response model posterior, p(θ y) with respect to the posterior predictive distribution of y mis, p(y mis y obs ). The observed data posterior can be sampled by supplementing the Gibbs sampler with a data augmentation step that samples from the posterior predictive distribution of p(y mis y obs, θ). By doing this, sampling from the full data response model parameters proceeds essentially as if there were no missing data. This will often 123

130 124 INFERENCE UNDER MAR provide considerable simplification of computations. We provide more details on this in Section 6.2. This chapter will focus on the key components of the analysis of the parameters of the full-data response model under an assumption of ignorability. First, issues of model specification will be discussed, including the importance of correctly specifying the dependence structure; it is well-known that in full-data settings, specification of the dependence structure is often not required to consistently estimate the mean parameters. The convenient computational device of data augmentation will be described in the setting of missing data with its implementation in specific models derived. Given the importance of specifying the dependence structure, we will discuss approaches to model the dependence via parsimonious structure and/or by allowing the dependence to differ by model covariates. In the context of several longitudinal processes of interest or one process of interest with a stochastic time-varying auxiliary covariate(s) (recall Section from Chapter 5), several joint models will be described. The chapter closes with some suggested approaches for assessing model fit and model checking and a continuation of some examples first analyzed in Chapter 4. Throughout the chapter, we may occasionally refer to the ignorability condition as MAR without any loss of clarity. 6.1 General issues in model specification Inference under ignorability relies on correct specification of the fulldata response model, p(y θ) (i.e, the full-data response likelihood, L(θ y)). This can be thought of in terms of several components: mean, dependence, and distribution. By distribution, we are referring to the distribution of the random effects (see, e.g., Examples 2.1, 2.2, and 2.7) or the residuals (see, e.g., Examples 2.3, 2.4, and 2.6); for more details, see the Further reading section. The following development will focus on (mis-)specification of the dependence. The importance of correct specification of the mean for estimating the mean parameters is well known (Liang and Zeger, 1986). In full-data settings the dependence is often viewed as a nuisance and only leads to gains in efficiency. However, when dealing with incomplete data, correctly specifying the dependence is of utmost importance. To illustrate the consequences of mis-specification of dependence under ignorability, we go through a simple bivariate normal example (refer to Example 2.3 for notation) with Y i2 potentially missing. Recall, we assume Y i N(µ, Σ), with the jk element of Σ denoted as σ jk.

131 GENERAL ISSUES IN MODEL SPECIFICATION 125 To fill in Y i2 within the sampling algorithm, we sample from the distribution p(y i2 y i1, θ), where θ = (µ, Σ); details on this sampling will be given Section 6.2. The mean of this distribution, E[Y i2 y i1, θ] is µ 2 + φ 21 (y i1 µ 1 ), which is a function of the dependence parameter φ 21 = σ 21 /σ 11. This impacts the posterior distribution of µ 2 ; see Example 6.1. In complete data, the data augmentation step is not necessary and the posterior distribution of µ 2 is not a function of φ 21. The simplest mis-specification would be to assume φ 21 = 0 when φ A less obvious, but important mis-specification would be if φ 21 = h(x), i.e., φ 21 is a function of model covariates, but it is assumed to be constant as a function of model covariates. In most longitudinal settings, T > 2 which introduces more dependence parameters. For example, in a multivariate normal model, when Σ is large, a common strategy is to assume a simpler, parsimonious form for Σ; we might (incorrectly) assume an AR-1 form for Σ. Or we might introduce structure via random effects, and assume the wrong model for the random effects covariance matrix (Daniels and Zhao, 2003). Several of these cases will be explored in detail below. We start by examining the consistency of the posterior distribution of the mean when the dependence is mis-specified. Example 6.1. (cont. of Example 2.3) Posterior distribution of the mean for a bivariate normal under monotone missingness that is ignorable Consider a bivariate response from a randomized controlled trial with binary treatment x. Assume the response follows a bivariate normal distribution Y i µ, Σ, x N(µ(x), Σ(x)) (6.3) in which Y i2 is potentially missing and the covariance matrix Σ(x) is a function of treatment; denote the (j,k)th element of Σ(x) as σ jk (x). Let σ22 = σ 22 (x) σ 12 (x) T σ11 1 σ 12(x) and without loss of generality, assume σ 11 (x) = σ 11. Assume for the first n 1 subjects, we observe (y i1, y i2 ) and for the last n n 1 subjects, we only observe (y i1 ); also, without loss of generality, assume an equal prevalence of the binary covariate, x in the cohort. Finally, define µ c 1(x) = E[Y i1 x, R i = 1] and µ m 1 (x) = E[Y i1 x, R i = 0]. We will examine the posterior distribution of µ 2 (x) and show that under mis-specification of the dependence, via φ 21 (x), that the posterior distribution of µ 2 (x) is not consistent. By mis-specification here, we will focus on incorrectly assuming that φ 21 (x) = φ 21.

132 126 INFERENCE UNDER MAR Recall φ 21 (x) = σ T 12 (x)σ 1 11 is the slope from regressing y 2 on y 1. Define ˆβ = ( ˆβ 0, ˆβ 1 ) T = (W T W ) 1 W T Y 2, where 1 y 11 1 y 21 W =.. 1 y n1,1 (6.4) and y 2 = (y 12,..., y n1,2 )T. Consider the conditional posterior mean of µ 2 (x) given (y obs, µ c 1 (x)); we do not need to condition on µc 1 (x) here, but it will make the following development easier to follow. In the following, we will examine the behavior of the posterior mean as n ; in particular, we will examine whether the posterior distribution of µ 2 (x) is consistent. It can be shown that E[µ 2 (x) y obs, µ c 1 ] = ˆβ 0 + ˆβ 1 µ c 1 (x) Under ignorability, it can be shown that µ c 2(x) φ 21 (x)µ c 1(x) + φ 21 (x)µ 1 (x). (6.5) µ c 2 (x) φ 21(x)µ c 1 (x) = µm 2 (x) φ 21(x)µ m 1 (x) (6.6) = µ 2 (x) φ 21 (x)µ 1 (x). (6.7) Thus, (6.5) is equal to µ 2 (x). However, under mis-specification, φ 21 (x) = φ 21 and assuming P (X = 1) = 1/2, the expression in (6.5) is equal to µ c 2 (x) φ 21(x)µ c 1 (x)+φ 21(x)µ 1 (x) = µ 2 (x)+(µ 1 (x) µ c 1 (x))[αx α(1 x)], (6.8) where φ 21 (x) = φ 21 + αx α(1 x). So, under mis-specification of the dependence structure, the posterior distribution of µ 2 (x) will be inconsistent with bias, (µ 1 (x) µ c 1 (x))[αx α(1 x)]. Clearly, in addition to being a function of the dependence parameter, φ 21 (x), the bias is a function of the difference in the means between the completers and dropouts at time 1. If the means at time 1 are identical, there is no bias from mis-specifying the dependence (φ 21 ) since that implies the means at time 2 will also be identical; this is obvious in a multivariate normal from the fact that if the conditional distribution of y 2 given y 1 is the same for completers and dropouts and the marginal means of y 1 are the same, then the marginal mean of y 2 will be the same.

133 GENERAL ISSUES IN MODEL SPECIFICATION 127 We also point out that under MCAR or no missing data,, E[µ 2 (x) y obs, µ c 1 ] = ȳ 2(x) µ 2 (x), where ȳ 2 (x) = (1/n x ) i:x i=x y i2. This example illustrates that even in the simplest case, a bivariate normal with monotone missingness that is ignorable, mis-specification of the dependence, here parameterized as φ 21, will result in an inconsistent posterior for the mean parameters. These results readily generalize to higher dimensional multivariate normals and other models discussed in Chapter Orthogonal parameters In the previous section, we examined the consequences of mis-specifying the dependence under ignorability. Another way to understand the increased importance of correctly modelling the dependence is by exploring the difference in the form of the information matrix for the mean and dependence parameters for complete data versus incomplete data (under ignorability). Before going into the details, we first introduce the concept of orthogonal parameters. For illustration in the following, let β represent the mean parameters and α the dependence parameters. Definition 6.1. Orthogonal parameters. Parameters are said to be orthogonal if the corresponding components of (expected) information matrix are 0 (Cox and Reid, 1987). So, for example, if for any component of β, β j and any component of α, α k, [ ] I βj,α k (β 0, α 0 ) = E d2 L(β, α) = 0 (6.9) dβ j dα k where β 0 and α 0 are the true values of β and α, respectively, then β and α are orthogonal. The block diagonality of the information matrix implies that the mle s of the orthogonal parameters are asymptotically independent and the asymptotic standard errors (based on the information matrix) are the same whether the other parameters are estimated or assumed known (Cox and Reid, 1987). In the Bayesian paradigm, in addition to the likelihood, based on the full data-response model, we also have the prior. Typically, the prior will be O p (1), whereas the likelihood is O p (n). We now state a similar definition to Definition 6.1 in the Bayesian setting.

134 128 INFERENCE UNDER MAR Definition 6.2. Orthogonal parameters in Bayesian inference. Under an assumption of independent priors on β and α, p(β, α) = p(β)p(α) and/or p(β, α) = O p (1), we define orthogonal parameters as [ ] I β j,α k (β 0, α 0 ) = E d2 [L(β, α) + log(p(β, α)) = 0. dβ j dα k Under a priori independence, this term is exactly zero. Under the prior being O p (1), but a priori dependence, this is an asymptotic result only. Definition 6.2 implies that asymptotically, the joint posterior distribution of (β, α) is equal to the product of their respective marginal posteriors. For non- or weakly identified parameters, the a priori independence is required. Orthogonality of the mean and dependence parameters is a necessary condition for the posterior distribution of the mean parameters to be consistent if the dependence is mis-specified. However, a stronger condition is required for sufficiency given in the following definition. Proposition 2. Sufficient condition for consistent estimation of mean parameters The following condition, stronger than orthogonality conditions given in Definitions 6.1 and 6.2, is [ ] I β,α (β 0, α ) = E d2 L(β, α) = 0. (6.10) dβdα where β 0 is the true value for the mean parameters and α is any value for the dependence parameters (Firth, 1987). The analogous condition for Bayesian inference replaces L(β, α) with L(β, α) + log p(β, α). In the frequentist setting,the condition implies that the score function for the mean parameters, dl(β, α)/dβ is an unbiased estimating equation for β for all values of α, not just the true value, α 0. In the Bayesian setting, it corresponds to the posterior distribution for β being consistent even if the dependence model, parameterized by α is mis-specified. Example 6.2. (cont. of Example 2.3) Information matrix based on observed data log likelihood under ignorability with a Multivariate normal model for temporally aligned observations Let θ = (β, Σ) and assume that p(β, Σ) = p(β)p(σ) (a common assumption). For the case of complete data, it is easy to show that the offdiagonal block of the information matrix, I β,σ, is equal to zero for all

135 GENERAL ISSUES IN MODEL SPECIFICATION 129 values of Σ, thus satisfying condition (6.10). So, when doing Bayesian inference, the posterior for β will be consistent even under mis-specification of Σ. However, under MAR, the submatrix of the information matrix, now based on the observed data log likelihood (or observed data posterior), given by [ d I obs 2 ] β,σ (, ) = E L obs (β, α) dβdα T (6.11) is no longer equal to zero even at the true value for Σ (Little and Rubin, 2002) so the weaker parameter orthogonality condition, Definition 6.2, does not even hold. As a result, in order for the posterior distribution of the mean parameters to be consistent, the dependence structure must be correctly modelled. This lack of orthogonality can be seen in the setting of a bivariate normal linear regression, by making a simple analogy to univariate simple linear regression. This will also provide some additional intuition into how inferences change under missingness. Suppose E[Y ] = µ and R i2 = 1, i = 1,..., n 1 and R i2 = 0 for i = n 1 + 1,..., n. As in Chapter 4, we factor the joint distribution of p(y 1, y 2 ) as p(y 1 )p(y 2 y 1 ). The conditional distribution of Y 2 given y 1 (ignoring priors for the time being) as a function of µ 2 and φ 21 is proportional to exp( n (y i2 µ 2 φ 21 (y i1 µ 1 )) 2 /(2σ22)) (6.12) i=1 where σ 22 = σ 22 σ 21 σ 1 11 σ 12. Note that φ 21 and µ 2 do not appear in p(y 1 ). The orthogonality of µ 2 and φ 21 is apparent by recognizing (6.12) as the same form as the log likelihood for a simple linear regression with a centered covariate, y i1 µ 1 with µ 2 the intercept and φ 21 the slope. It can also be easily shown from this form that the element of the (expected) information matrix corresponding to µ 2 and φ 21 is zero for all values of φ 21. However, under MAR, the analogue of (6.12) is n 1 exp( (y i2 µ c 2 φ 21(y i1 µ c 1 ))2 /(2σ22 )). (6.13) i=1 The sum is now only over the terms that correspond to R i2 = 1; µ c j is defined as in Example 6.1. Again using the analogy to simple linear regression, µ c 2 and φ 21 are orthogonal, but not µ 2 and φ 21 since

136 130 INFERENCE UNDER MAR µ 2 = µ c 2 π + µm 2 (1 π) = µ c 2 φ 21 (1 π)(µ c 1 µ m 1 ). where π = P (R i = 1). Thus, µ 2 is a function of φ 21 under MAR; note that under no missing data, π = 1 and under MCAR, (µ c 1 µ m 1 ) = 0, the second term involving φ 21 disappears. So, since E[y i1 µ c 1 ] is not zero under MAR, the element of the (expected) information matrix corresponding to µ 2 and φ 21 is no longer zero for all values of φ 21. Thus, the mean and dependence parameters are no longer orthogonal. This example shows that for multivariate normal models with incomplete (MAR) data (whether temporally aligned or mis-aligned), Examples 2.3 and 2.6, Σ needs to be correctly specified. Thus, if we model the covariance matrix parsimoniously, we must be sure to use the correct structure and a necessary step in model specification is to determine if Σ depends on covariates. Such modelling decisions will be verifiable from the data under the ignorability assumption. Similar results hold for directly specified models for binary data. It can be shown that the information matrix for the mean parameters, β, and the dependence parameters, α, in an MTM(1), Example 2.5, satisfies condition (6.10) under no missing data or MCAR (Heagerty, 2002; Barndorff-Nielsen and Cox, 1994), but not under MAR. However, there are also situations where even in full-data, misspecification of dependence can lead to biased estimates of the mean parameters. For example, in MTM(p), p 2, β and α are not orthogonal (Heagerty, 2002). In this case, the proper order of the MTM and whether the associated dependence parameters α, depend on covariates need to be determined even for full-data. Similar specification issues with complete data are also seen in indirectly specified models for binary data. In this section, we have shown why the dependence structure cannot be treated as a nuisance even if interest is only in the mean parameters when there is incomplete data. In the next section, we review a modification to the posterior sampling algorithms, introduced originally in the context of complete data, when the data is incomplete. 6.2 Posterior sampling using data augmentation Data augmentation is an important tool for full data inference in the presence of missing data and is related to the EM algorithm (Dempster,

137 POSTERIOR SAMPLING USING DATA AUGMENTATION 131 Laird, and Rubin, 1977) and its variations (van dyk and Meng, 2001) used for similar problems in frequentist inference. As we illustrated in Chapter 5 the general strategy is to specify a model and priors for the full data and then to base posterior inference on the induced observed data posterior, p(θ y obs ). However, the full data posterior p(θ y) is often easier to work with than the observed data posterior p(θ y obs ). For example, the full-data posterior typically has full conditional distributions, used in Gibbs sampling, of simpler forms than the full conditionals derived using the observed data posterior. This motivates augmenting the observed data posterior with the missing data, y mis. To do this, at each iteration, k of the sampling algorithm, we sample (y (k) mis, θ(k) ) using 1. y (k) mis p(y mis y obs, θ(k 1) ). 2. θ (k) p(θ y obs, y (k) mis ). Thus, we can sample θ using the tools we derived in Chapter 3 for the full-data posterior as if we had complete data. Implicitly, via Monte Carlo integration within the MCMC algorithm, we obtain a sample from the observed data posterior, p(θ y obs ) given in (6.2). Since data augmentation depends on sampling p(y mis y obs, θ), in each of these examples, we will see that the augmentation depends heavily on the within-subject dependence structure. We start out by giving some examples of data augmentation for several models from Chapter 5 for ignorable dropout. Example 6.3. (cont. of Example 2.3) Data augmentation under ignorability with a Multivariate normal model for temporally aligned observations Define Y obs,i = (Y i1,..., Y ij ) T and Y mis,i = (Y i,j +1,..., Y ij ) T. Within the sampling algorithm, the distribution of p(y mis,i y obs,i, θ) takes the form Y mis,i y obs,i, θ N(µ, Σ ) (6.14) where µ = X im β + B(Σ)(y obs,i X io β), Σ = Σ mm Σ T omσ 1 oo Σ om, and B(Σ) = Σ T omσ 1 oo. The dependence of y mis,i on y obs,i is governed by a function of Σ, B(Σ); this is the matrix of autoregressive coefficients from regressing y mis,i on y obs,i.

138 132 INFERENCE UNDER MAR Example 6.4. (cont. of Example 2.5) Data augmentation under ignorability with an MTM(1) We define Y mis,i and Y obs,i as in Example 6.3. Under first order dependence, p(y i,j +1 y obs,i ) is equivalent p(y i,j +1 y i,j ) which takes the form Y i,j +1 y i,j, θ Ber(φ i,j +1) (6.15) where logit(φ i,j +1) = ij (X i ; β) + y i,j γ J. Example 6.5. (cont. of Example 2.2) Data augmentation under ignorability with random effects logistic regression Again, we define Y mis,i and Y obs,i as in Example 6.3. The distribution p(y mis,i y obs,i, b i, θ) is a product of independent Bernoulli s with probabilities P (Y mis,i,j = 1 Y obs,i, b i, θ) = (1+exp( (X ij β+w ij b i ))) 1, j J +1. (6.16) What to do with R.E. notation The dependence of these imputed values on the covariance matrix of the random effects, Ω, is evidenced by the presence of the random effect b i in the probability above. By integrating out the random effects, b i (as in Example 2.2), which cannot be done analytically, the presence of the random effects covariance matrix, Ω in the distribution [Y mis,i Y obs,i, θ] would be more transparent, p(y mis,i y obs,i, θ) = = p(y mis,i y obs,i, b i, θ)p(b i θ, y obs,i )db(6.17) i p(y mis,i b i, θ)p(b i θ, y obs,i )db i (6.18) However, (6.18) is not available in closed form. However, recalling the approximation to the population-averaged distribution in Example 2.2 for a model with only a random intercept, W ij 1, (6.18) can be approximated by where β βk(ω). P (y mis,i y obs,i, θ) exp(x ijβ ) 1 + exp(x ij β ) (6.19)

139 POSTERIOR SAMPLING USING DATA AUGMENTATION 133 Example 6.6. (cf: Example 2.4) Data augmentation under ignorability with irregularly spaced normal data For the case of irregularly spaced observations times, missing observations are imputed by first constructing Σ i from the covariance function C(, ; φ) (recall Example 3.2) and then using the methods described in Example 6.3. In all of these examples the predictive distributions, p(y mis y obs, θ) are a function of the dependence parameters (Σ, γ j, Ω), making clear the importance of the dependence structure in filling-in the missing data. For complete data, the data augmentation step is not needed and for inference about the mean, the dependence structure becomes less important Data augmentation and multiple imputation Bayesianly proper multiple imputation (Schaefer, 1997) samples values from p(y mis y obs ) = p(y mis θ)p(θ y obs )dθ. (6.20) to obtain m full-datasets that are analyzed using full-data response log likelihoods, L(θ y) or full data response model posteriors, p(θ y). itself. Inferences are then appropriately adjusted for the uncertainty in the missing values (Schaefer, 1997). This is an approximation to the fully Bayesian data augmentation procedure in (6.2) obtained by sampling just a few values from p(y mis y obs ) (as opposed to full Monte Carlo integration). Bayesianly improper multiple imputation would sample m values from some distribution, p (y mis y obs ) p(y mis y obs ). This might be implemented when the imputation model is specified and fit separately from the full data response model (Rubin, 1987) or in the case of auxiliary covariates. We expound on this later. Consider the assumption of A-MAR (Definition??). This implies that after conditioning on observed responses, Y obs and model covariates, X, Y mis still depends on R,

140 134 INFERENCE UNDER MAR p(y mis y obs, r, x) p(y mis y obs, x) (6.21) However, upon also conditioning on auxiliary covariates, Z, Y mis no longer depends on R, p(y mis y obs, r, x, z) = p(y mis y obs, x, z). (6.22) So in this case, p (y mis y obs ) = p (y mis y obs, x, z), i.e., a regression model for y mis given all the observed data, including auxiliary covariates, z which were not included as model covariates in the full data response model. In Chapter 5, we discussed how to deal with auxiliary covariates in fully Bayesian inference (cf: eq. (5.2)). In Section 6.5, we will provide more details on this in the context of time-varying auxiliary covariates. 6.3 Covariance structures for univariate longitudinal processes In Examples 6.1 and 6.2, we showed the importance of covariance specification in incomplete data. We now explore ways to model the dependence. For multivariate normal models (e.g., Example 2.3) where J is large or n is small (relative to each other), it is common to assume a parsimonious structure for Σ to avoid having to estimate a large number of parameters, J(J + 1)/2 parameters (or potentially even more if Σ depends on covariates). Most covariance structures aim to parsimoniously model the serial correlation of longitudinal data. We discuss two related classes of models, structured antedependence models (Nunez-Anton and Zimmerman, 2000) and structured GARP/IV models (Pourahmadi and Daniels, 2002) for doing this; both rely on the parameterization for Σ given in eq. (??). We also review reducing the number of parameters in Σ by inducing structure via random effects (serial and non-serial correlation) and some common structures used for covariance functions for misaligned temporal data. A natural parameterization on which to introduce structure in Σ is the modified Choleski decomposition, eq. (??). Recall, the parameterization results in a set of variance parameters, {σ 2 t : t = 1,..., J} called innovation variances (IV) and a set of dependence parameters, {φ tj : j = 1,..., t 1; t = 2,..., J} called generalized autoregressive parameters (GARP). This is a particularly natural parameterization for missingness due to dropout as the parameters correspond to conditional distributions p(y t y 1,..., y t 1 ), t = 1,..., J; these parameters are also well suited to characterize identifying restrictions which will be discussed

141 COVARIANCE STRUCTURES FOR UNIVARIATE LONGITUDINAL PROCESSES 135 in Chapter 8. We will discuss two classes of parsimonious models based on these parameters next. SHOULD WE JUST REMOVE SAD MODELS AND FOCUS ON GARP/IV; LEAVE SAD FOR FURTHER READING??? Structured antedependence (SAD) models for Σ Nunez-Anton and Zimmerman (2000) proposed the following models for the modified Choleski parameters, φ j,k = φ f(tj ;λ k) f(t j k ;λ k ) k, k = 1,..., j 1; j = 2,..., J σj 2 = σ2 g(t j ; ξ), j = 1,..., J where f(t; λ) = { (t λ 1)/λ if λ 0 log(t) if λ = 0 and g(t; ξ) is typically a low-order polynomial, e.g, g(t; ξ) = 1 + ξt. f(t, λ k ) determines the relationship of the lag-k autoregressive coefficients with time. It is a Box-Cox transformation (Box and Cox, 1964) of the time scale such that the effect of time is linear (in the exponent) on this scale. When λ k = 1, f(t j ; 1) f(t j k ; 1) = (t j t j k ) = k, which is not a function of the time t j, only the difference between the times t j and t j k, i.e, the lag; thus, the lag k autoregressive coefficients at each time are equal. {φ 1 0 and φ k = 0 : k > 1} is a first order autoregressive structure referred to as SAD(1); {φ k = 0 : k > j} corresponds to an SAD(j). λ k greater (less) than one corresponds to monotone decreasing (increasing) of the lag k regression coefficients as a function of time. An alternative formulation replaces the GARP, φ jk with the pairwise marginal correlations, ρ jk and the innovation variances with the marginal variances, σ jj (Nunez-Anton and Zimmerman, 2000). For either formulation, the full conditionals for λ k and ξ are not available in closed form. In addition, for the latter formulation using pairwise correlations, (complex) constraints are necessary to maintain positive definiteness. Exploratory model selection can be conducted by examining the lag j autoregressive coefficients, {φ t,k : t k = j} and the innovation variances {σt 2 } as a function of t (time). Posterior means of the GARP parameters from fitting the model in Example 2.3 to the data from the Schizophrenia clinical trial (Data Example 1.1) appear in Table 6.1. The posterior means for the innovation variances were (124, 112, 76, 68, 69, 56).

142 136 INFERENCE UNDER MAR Table 6.1 GARP parameters from fitting a multivariate normal model to the schizophrenia data Here, the innovation variances are decreasing over time so a monotone function might be chosen for g(t j ; ξ). The GARP parameters within lag (by examining each of the sub-diagonals) do not appear to be monotonic within lag. However, the GARP for lags 3, 4, and 5 are quite small which suggests possibly fitting an SAD(2) model. There is clearly considerable potential here to parsimoniously model this covariance matrix and thus reduce the number of parameters. Structured GARP/IV models for Σ Structured GARP/IV models (Pourahmadi and Daniels, 2002) like the SAD models use the GARP (φ tj ) and IV (σ 2 t ) parameters from the modified Choleski decomposition. However, they build structural models based on (simpler) linear and log-linear models, respectively, φ j,k = a j,k γ, k = 1,..., j 1; j = 2,..., J (6.23) log(σ 2 j ) = b jλ, j = 1,..., J (6.24) where a j,k is a design vector that represents a combination of a smooth function in lag (j k) and/or a smooth function in j for a fixed lag, (j k), and b j is a design vector that represents a smooth function (in j) design vector. These smooth functions are typically chosen as low order polynomials (or splines). Special cases of these models include setting φ jk = φ j k for all j and k; i.e., stationary GARP. A first order structure would set φ j k = 0 for j k > 1. GARP/IV models look for structure as a function of time for a fixed lag, as do the SAD models, but they also look for structure across lags, unlike the SAD models. Exploratory model selection can be conducted similar to SAD models by examining an unstructured estimate of Σ (Table 6.1) and/or also by examining regressograms (Pourahmadi, 1999), which we provide some examples of in the following. To illustrate these models, we again examine the covariance structure

143 COVARIANCE STRUCTURES FOR UNIVARIATE LONGITUDINAL PROCESSES 137 from the Schizophrenia clinical trial (Data Example 1.1) shown in Table 6.1. Figure 6.1 shows the lag 1 GARP as a function of week; there is little structure to exploit here. Figure 6.2 shows the GARP as a function of lag. The GARP for lags greater than 2 seem small so one possibility would be to set such parameters (structurally) to zero. An alternative approach, as suggested in Pourahmadi (1999) would be to model the GARP using a polynomial in lag. Based on the figure, a quadratic might be adequate for this data. Figure 6.1 Posterior means of lag 1 GARP with 95% credible intervals as a function of weeks in the schizophrenia clinical trial.

144 138 INFERENCE UNDER MAR Figure 6.2 Posterior means of GARP as a function of lag in the schizophrenia clinical trial. MAKE SURE FIGURE IS OK?? An attractive feature of the structured GARP/IV models is that given the linear models on φ jk (6.23) computations are much simpler than the SAD since the GARP regression parameters γ have full conditional distributions that are normal when the prior on γ is normal (see Pourahmadi and Daniels, 2002). The SAD and Structured GARP/IV models are very similar. Both use the parameters of the modified Choleski decomposition of Σ to develop parsimonious models for Σ. However, we recommend the latter for com-

145 COVARIANCE STRUCTURES FOR UNIVARIATE LONGITUDINAL PROCESSES 139 putational considerations and their increased generality in looking for structure in the GARP both within and across lags. Random effects are another way to parsimoniously model the dependence structure. Whereas the GARP/IV models parameterize the covariance matrix directly, random effects models induce structure to the covariance matrix indirectly. For example, the normal random effects model (Example 2.1) implicitly induces a parsimonious covariance structure for Σ based on both serial and non-serial correlation components. This is illustrated next. Structured covariance matrices induced by random effects For the normal random effects model, Example 2.1, the marginal covariance structure for Y i, after integrating out the random effects, is Σ = σ 2 I +W T i ΩW i. This covariance model has dimension q(q+1)/2+1 whereas the unstructured Σ has dimension J(J + 1)/2, where typically q << J This illustrates an alternative approach to induce structure into the marginal covariance matrix that is not based only on serial correlation ideas. Introduction of random effects can also be used to introduce structured dependence in glmm s (cf: Example 2.2). For a comparison of random effects models and structured GARP/IV models in an application, see Pourahmadi and Daniels (2002). It is also feasible to combine Structured GARP/IV models with random effects by decomposing Σ as Σ = Σ + W T i ΩW i where Σ is modelled using a parsimonious Structured GARP/IV model (Pourahmadi and Daniels, 2002). As discussed in Example 2.4, a structured covariance is necessary to estimate the covariance function for mis-aligned temporal data. We review some examples next. Structured covariance functions for misaligned data Recall, for temporally misaligned longitudinal data, the covariance structure is typically summarized via a covariance function, i.e., Cov(Y ij, Y ik ) = C(t ij, t ik ; φ). This function needs to be structured to be estimated as it is often the case that no (or few) replications are available for observations at certain times or pairs of times. As a result, the process is often assumed to be weakly stationary (Diggle, 1988). This implies the covariance function, C(t ij, t ik ; φ) is only a function of the difference between times, t ij t ik, i.e., C(t ij, t ik ; φ) = g( t ij t ik ; φ). For example, the exponential covariance function (cf: Example 2.4) is given by C(t ij, t ik ; σ 2, ρ) = σ 2 exp{ ρ t ij t ik c }.

146 140 INFERENCE UNDER MAR The corresponding correlation function is exp{ ρ t ij t ik c }; clearly the correlation decays exponentially with (the cth power of) the distance between times (for ρ > 0). Taylor, Cumberland, and Sy (1994) proposed a covariance function based on an integrated Ornstein-Uhlenbeck process, which takes the form, C(t ij, t ik ) = σ2 2α 3 [2α min{t ij, t ik } + exp( αt ij ) + exp( αt ik ) 1 exp( α t ij t ik )]. Clearly, this is a structured nonstationary covariance function as the covariance depends on the difference between the observation times, t ij t ik and the times themselves, {t ij, t ik }. These covariance functions have been used for longitudinal CD4 counts (Taylor, Cumberland, and Sy, 1994) and are well-suited to deal with derivative tracking (individual s longitudinal measurements tend to maintain the same trajectory over time). For data arising from a Gaussian process, recall Examples 2.4 and 3.2, the distribution of any set of responses at a finite number of observation times follows a multivariate normal distribution and the mean and covariance function completely specify the distribution. For other commonly used structured covariance functions, see Fuller (1996) and Cressie (1993). For an illustration, refer back to Data Example 4.4 in Chapter 4. A structured covariance function can also be induced based on a random effects model as discussed earlier. Next we discuss methods to incorporate covariates into the covariance structure. 6.4 Covariate dependent covariance structures Properly modelling the covariance structure for extrapolating/imputing missing data also involves determining whether the structure depends on covariates through the observed data (under a specific missing data assumption). In fact, recent work by Heagerty and Kurland (2001) and Kurland and Heagerty (2004) has shown that even for complete data, there can be bias in the mean regression coefficients in generalized linear mixed models (Example 2.2), if the random effects variance depends on a between subject covariate and this is not modelled. They also observed that this bias is reduced in marginalized models (Heagerty and Zeger, 2000); for example, the bias would be larger in a logistic random

147 COVARIATE DEPENDENT COVARIANCE STRUCTURES 141 effects model (Example 2.2) than in Heagerty s corresponding marginalized random effects model (Heagerty, 1999). In light of this and its importance in incomplete data, it is important to develop natural and easy ways to introduce covariates into the covariance structure (re: Example 2.3) whether it is represented as a covariance/correlation matrix, a covariance function, or Markov dependence parameters Covariance/Correlation matrices A complication in allowing components of a covariance matrix to depend on covariates is to ensure the resulting covariance matrices, now a function of covariates, are positive definite. Several parameterizations of the covariance matrix have been proposed that provide a new set of parameters that are unconstrained, giving a natural parameterization on which to introduce covariates (cf: link function in glms McCullagh and Nelder, 1999). We describe two parameterizations of Σ on which to do this next. Example 6.7. (cont. of Example 2.4) Modelling Σ using the log matrix parameterization with a multivariate normal model for temporally aligned observation SHOULD THIS BE REMOVED AND DISCUSSED IN FUR- THER READING?? The log matrix parameterization is based on a spectral decomposition of Σ, Σ = P ΛP T (6.25) where P is an orthogonal matrix whose columns are the eigenvectors of Σ and Λ is the diagonal matrix of ordered eigenvalues, λ 1 > λ 2 > > λ dim(σ) (Leonard and Hsu, 1992). Σ will be positive definite if and only if λ dim(σ) > 0. To ensure this condition is satisfied, we take the logarithm of the eigenvalues and then create a new matrix, Σ = P Λ P T, where here Λ is a diagonal matrix with the logarithm of the eigenvalues on the diagonal. Define the new set of parameters, σ = vech(σ ). The parameters σ are unconstrained and provide a natural scale for introducing covariates (Chiu, Leonard, and Tsui, 1996), σ i,j = a i,j γ, j = 1,..., J(J + 1)/2 where a i,j is the design vector of covariates for component j of the σ

148 142 INFERENCE UNDER MAR Table 6.2 Log matrix transformation of common covariance matrix for subject i. Unfortunately, the full conditional distribution of γ is not available in closed form for these models. Some suggestions for posterior sampling are given in Chiu et al. (1996). As an illustration of this approach, we examine the log matrix parameterization of the covariance matrices assuming a common covariance matrix and assuming a separate covariance matrix for all four treatments in the schizophrenia clinical trial (Data Example 1.1) as given in Tables 6.2 and 6.3. The last row of the log matrix for each of the four treatments varies considerably across the four treatment groups. The diagonal elements vary much less. So, the design vectors, a ij could be chosen to take this into account. A disadvantage of this parameterization is that the elements of the log matrix are not easy to interpret and do not lend themselves very naturally to structured models for longitudinal data. This parameterization can also be used for introducing covariates into a the random effects covariance matrix (see Examples 2.1 and 2.2). Example 6.8. (cont. of Example 2.4) Modelling Σ using the parameterization induced by the modified Choleski decomposition with a Multivariate normal model for temporally aligned observations The modified Choleski decomposition (Pourahmadi, 1999) used in Section 5.? to introduce structure also provides an unconstrained parameterization to incorporate covariates. An advantage of this parameterization (as discussed earlier) is the fact that the full conditional distribution of γ is a normal distribution (assuming p(γ) is normal). Recall, the set of parameters φ i,k,j, k = 2,..., J and j = 1,..., k 1 are unconstrained as are log(σit 2 ), t = 1,..., J. So, covariates can be introduced as φ i,k,j = a ikj γ, k = 2,..., J, j = 1,..., k 1 log(σij 2 ) = a ijλ, j = 1,..., J

149 COVARIATE DEPENDENT COVARIANCE STRUCTURES 143 Table 6.3 Log matrix transformation of covariance matrix for each treatment Low High Medium Standard where a ikj and a ij are design matrices for the GARP and log innovation variances respectively. These design vectors can be used to introduce covariates (and to introduce structure). We illustrate these models again using the data from the schizophrenia clinical trial. The main covariate of interest in this data was treatment so we examine the GARP and IV parameters by treatment. Figure 6.3 shows the innovation variances for each treatment. Within each treatment, the innovation variances do not show much structure as a function of time. However, there appears to be some large differences across treatments. For example, the innovation variance for the high dose at week 4 is considerably higher than for the other three treatments and the credible intervals do not overlap. At week 6, the high dose innovation variance is still the largest, though there is more overlap in the credible intervals here. This plot suggests that some of these variances should be modelled as a function of treatment. Figure 6.4 shows the lag 1 GARP for each treatment, again plotted as a function of time. Here, the lag 1 GARP at week 6 for the medium dose is much smaller than for the other three treatments (with its credible interval not overlapping with the standard dose treatment). This exploratory analysis suggests that the covariance matrix for the schizophrenia data does depend on treatment and the GARP/IV parameterization provides a parsimonious way to model this as the individual parameters can de-

150 144 INFERENCE UNDER MAR pend (or not) on (a subset of)treatment groups and the resulting covariance matrices will be guaranteed to be positive definite. Figure 6.3 Posterior means of innovation variances (as a function of week) with 95% credible intervals for each treatment in the schizophrenia clinical trial. (a) low dose, (b) medium dose, (c) high dose, (d) standard dose. For a detailed analysis using these models, see Pourahmadi and Daniels (2002) who analyze the longitudinal depression data and found that the innovation variances and GARP depended on the patient s treatment and their initial severity of depression.

151 COVARIATE DEPENDENT COVARIANCE STRUCTURES 145 Figure 6.4 Posterior means of lag 1 GARP (as a function of week) with 95% credible intervals for each treatment in the schizophrenia clinical trial. (a) low dose, (b) medium dose, (c) high dose, (d) standard dose. The modified Choleski parameterization has also been used to introduce covariates into the random effects covariance matrix in the model given in Example 2.1 (Daniels and Zhao, 2003) and could also be used for the random effects covariance matrix in Example 2.2; however, there should be some implicit (or explicit) ordering of the random effects for this parameterization to be fully justified as the parameterization is not invariant to the ordering of the components of b i ; recall the parameterization induced by the (ordered) factorization in eq. (??). If the components of

152 146 INFERENCE UNDER MAR b i were the coefficients of orthogonal polynomials or regression splines, there is an obvious ordering. For a detailed example of introducing covariates into the random effects matrix, we refer the reader to Daniels and Zhao (2003). FOR FURTHER READING? PUT SUMMARY OF LOG MA- TRIX HERE WITH THIS STUFF IF (RE)MOVE IT There are other less computationally friendly ways to introduce covariates into a covariance/correlation matrix. In the setting of Example 2.3 there has been work on modelling the logarithm of the marginal variances while keeping the correlation matrix constant across covariates (Manly and Rayner, 1987; Barnard, McCulloch, and Meng, 2000;, Pourahmadi, Daniels, and Park, 2007). However, this can create computational difficulties as the logarithms of the marginal variances are not unconstrained for Σ to be positive definite. also Dunson and Chen stuff [and connection to mod chol in pourahmadi] Example 6.9. (cont. of Example 2.6) Modelling the correlation matrix as a function of covariates in the multivariate probit model In the setting of multivariate probit models, we can model the individual correlations as a function of covariates. The typical approach is to individually transform the correlations to R using Fisher s z-transform (Czado, 2000), z(ρ i,jk ) = 1 2 log (1 ρ) (1 + ρ) = a i,jkγ. (6.26) This transformation of the individual correlations does not guarantee the resulting correlation matrices are positive definite. Thus, the vector of regression coefficients for the correlations, γ will be constrained; within a sampling algorithm, values of γ sampled which correspond to a nonpositive definite correlation matrix will have prior probability zero and be rejected within a Metropolis-Hastings algorithm. For temporally misaligned data, similar approaches can be applied to covariance/correlation functions. Example (cont. of Example 2.4) Modelling the covariance function as a function of covariates Consider the exponential covariance function introduced in Example 2.4 C(t ij, t ik ; ρ, σ 2 ) = σ 2 exp( ρ t ij t ik ) (with c = 1). A natural way to introduce covariates here would be through log(σ 2 ) and log(ρ) (assuming ρ > 0) as these transformations provide a set of unconstrained parameters to maintain positive definiteness of C(, ; ρ, σ 2 ).

153 COVARIATE DEPENDENT COVARIANCE STRUCTURES 147 Markov transition models for binary data For longitudinal binary data models constructed based on underlying normal latent variables or random effects (Example 2.2), covariates can be introduced into their corresponding covariance/correlation matrix as discussed in the previous section. For MTM s (Example 2.5) and transition models, covariates can be introduced through the (unconstrained) Markov dependence parameters. In the following two examples, we discuss this in the setting of temporally aligned and misaligned data, respectively. Example (cont. of Example 2.5) Modelling the dependence in MTM s We start out by discussing an MTM(1) with dependence parameters γ ij, j = 2,..., J. The γ ij are (unconstrained) log odds ratios that can be modelled as a function of covariates, γ ij = a i,j α (cf: Example 2.5) with no computational burden over models without covariates. For higher order MTM s, the dependence parameters corresponding to lags larger than one are also unconstrained and can be similarly modelled. These models are fit to Data Example 7.3 in Chapter 6. JUST MENTION IN ONE SENTENCE WITH A REFER- ENCE; OR FURTHER READING? Example (cont. of Example 2.5) Modelling the dependence in MTM s with misaligned times For misaligned measurement times, Su and Hogan (2005) have developed continuous time MTMs for longitudinal binary data. They model γ ij which is the transition parameter between Y ij and Y i,j 1, measured at t ij and t i,j 1, respectively, as γ ij = a i,j α + h(t ij t i,j 1 ), where h(u ij ) is a smooth function of lag (which they model using p- splines); recall Example See Su and Hogan (2005) for an illustration of this model on the HERS data, Example 1.4 and simulations that

154 148 INFERENCE UNDER MAR illustrate the importance of correctly specifying the dependence under ignorability in such models. 6.5 Multivariate processes Multiple longitudinal processes are typically modelled separately, despite potential gains in efficiency from modelling them jointly unless there is interest specifically in the relationship between the multiple processes. In the case of incomplete data, there are additional reasons to build and base inference on joint models when there are multiple longitudinal processes. Some of these reasons echo those for carefully modelling the dependence in univariate longitudinal data (Section 6.1), e.g., efficiency. However, joint modelling allows the use of all available data, on the same and other processes, to impute the missing values. This is especially advantageous when the processes are not all observed at the same observation times and observed responses for other processes are closer temporally than observed responses from the same process. By modelling the processes separately, we do not use the information from the other processes to fill in the missing values. There are reasons to consider joint models. Recall the definition of auxiliary MAR, A-MAR (Definition??) from Chapter 5. In some cases, the auxiliary covariate(s) Z may be a time-varying covariate (process). In this case, joint models can be constructed for the auxiliary process, Z(t) and the process of interest, say Y (t). For example, consider the example in Chapter 5 with CD4 (process of interest) and viral load (auxiliary process) or the smoking cessation/weight change data (Data Example 1.3) where the primary process was smoking cessation and the auxiliary process was weight change. In most of these cases, when the individual drops out, we stop observing the primary process and the auxiliary process. Let Z obs be the observed values of auxiliary covariate process. Then, under A-MAR, but p(r y, x) p(r y obs, x) (6.27) p(r y, x, z obs ) = p(r y obs, x, z obs ). (6.28) In the setting of the smoking cessation clinical trials, the above assumption corresponds to the probability of missingness at time t being independent of the unobserved smoking status at time t conditional on

155 MULTIVARIATE PROCESSES 149 the previous weeks smoking responses and the previous weeks weight change responses (as opposed to conditional on just the previous weeks smoking responses). Such an assumption be viewed as a less restrictive version of MAR with the extra conditioning on the observed values of the other process. In the case of auxiliary time-varying covariates, to impute Y mis, we use p(y mis y obs, z obs ), not p(y mis y obs ). This was discussed recently in the context of generalized estimating equations by Ibrahim et al. fill in tech report reference. In the following, we will denote the two processes as Y and Z. With a joint model, filling in missing responses can be done for either process, whether interest is in p(y z), p(z y), p(y, z), p(y), or p(z). p(y z), p(z y), and p(y, y) are of interest in examining the relation between smoking cessation and weight gain (Data Example 1.3). p(z y) and p(y z) are often of interest in evaluating surrogate markers (Liang, Wu, and Carroll, 2003; Wang and Taylor, 2002). For the setting where one of the processes is an auxiliary time-varying covariate, p(y), or p(z) would then be of interest; the smoking cessation clinical trial (Data Example 1.3) is also an example of this where the weight change process can be viewed as auxiliary. Here, we will review and introduce some models for multivariate processes of the same type as in Data Examples??,??, 1.4 (Roy and Lin, 2000; Ilk and Daniels, 2005; O Brien and Dunson, 2004) and of different types (mixed) as in Data Example 1.3 (Gueorgeiva and Agresti, 2001; Dunson, 2003; Liu and Daniels, 2007). In particular, we will review and introduce some joint models that are well suited to some of the Data examples introduced in Chapter 1. We will fit some of these models at the end of this Chapter. In addition, the models presented here lend themselves to the setting of auxiliary time-varying covariates in that the full data response model of interest can be obtained in closed form (cf: eq. (5.2)). Joint modelling of a longitudinal process and a time to event process will not be discussed here. However, references can be found in the Further Reading section at the end of this chapter. In the following examples, we denote Y ijk as the kth response for subject i at time t ij. Example Joint normal models for multiple continuous responses As a starting point, we assume all the processes are (potentially) measured at the same set of observation times (t ij = t j ) and later discuss extensions to misaligned times. For continuous data, the multivariate normal distribution is an obvious choice for the joint distribution of the longitudinal processes. Consider K processes and let Y ik = (Y i1k,..., Y ijk ),

156 150 INFERENCE UNDER MAR k = 1,..., K. For ease of illustration, in the following, we set K = 2. We model the joint distribution of the two processes, (Y i1, Y i2 ), as where ( ) Y i1 N(X Y i β, Σ) i2 ( Σ11 Σ 12 Σ = Σ T 12 Σ 22 ) and X i = ( Xi1 X i2 ). The extension to K > 2 is obvious. The dimension of Σ can be quite large, K J, so structure is often imposed on Σ via either introduction of latent variables, e.g., Roy and Lin (2000), catered to the specific data or by inducing structure directly on Σ. We will address a specific form of the latter in the context of the bivariate longitudinal data from HERS (Data Example 1.4) on CD4 count and body mass index where the covariation between these processes was of interest and there were missing responses. Consider the following (serial correlation) structure for Σ (Carey and Rosner, 2001), Cov(Y ijk, Y ij k) = σ 2 k γ tj t j θ k k (6.29) Cov(Y ijk, Y ij k ) = σ kσ k γ ( tj t j +1 θ kk ) 12. (6.30) Eq. (6.29) corresponds to the covariance for same response (k) for the same subject at different times (longitudinal correlation structure within a subject) and (6.30) corresponds to the covariance across responses at potentially different times. For each, it is assumed that the correlation decreases for responses farther apart in time by restricting γ k and γ 12 to be in [0, 1). γ 12 determines the relationship between CD4 and body mass index for the HERS data; if γ 12 = 0, then they are independent processes. Missing data is imputed in the data augmentation step by constructing the appropriate conditional distributions from this multivariate normal model with a structured covariance matrix. Clearly, the form of Σ implies

157 MULTIVARIATE PROCESSES 151 that observed responses closer in time to the missing responses will carry more weight in the imputation and the relative magnitudes of γ k and γ 12 will determine how much information is used from the other process for imputation. Multivariate normal models as described above are also natural for handling time-varying continuous (auxiliary)covariates as the conditional distribution of one process given the other can be written down in closed form (as a linear regression) as well as the marginal distribution of the process of interest using standard multivariate normal results. The covariance structures described in Example 6.13 both readily extend to misaligned measurement times by inducing a structured covariance function within a Gaussian process. Marginal or directly specified models for multivariate longitudinal binary responses can be specified similarly to Example 6.13 by specifying a multivariate distribution on the latent variables Z i connected to data, Y ijk as Y ijk = I{Z ijk > 0}. However, Σ will now be a correlation matrix for identifiability. We will describe such models in the context of the IYFP (Ilk and Daniels, 2007). Example Joint marginal models for multiple binary responses In the IYFP data described in Ilk and Daniels (2007) fill in some detail here from paper the marginal (population level) effects of the covariates were of interest. The trivariate dichotomous responses measures were both correlated among each other at each time and over time. The correlation matrix corresponding to the covariance matrix in Example 6.13 would potentially be well suited to this problem and is easily generalized to a trivariate response. In setting of binary data, Z i is assumed to follow a multivariate normal distribution (i.e., a probit model; recall Example 2.6). However, recent work has also considered allowing Z i to follow a multivariate t-distribution (O Brien and Dunson, 2004). One reason for using a t-distribution is that when the scale matrix and degrees of freedom are appropriately specified, the multivariate t-distribution approximates a multivariate logistic model; that is, Z ijk marginally follows a logistic distribution and thus P (Y ijk = 1) follows a logistic regression model. Computations are as easy as the probit model given that the multivariate t-distribution can be re-expressed as a gamma mixture of multivariate normals (see Example 3.15). Structure could also be induced on Σ, by introducing random effects and then integrating over them to obtain the marginal covariance structure (recall Section 6.3 and see Gueorgeiva and Agresti (2001) ) or by using

158 152 INFERENCE UNDER MAR the structure induced by a Markov model on the Z s (Czado, 2000; Muller and Czado, 2005; Daniels and Normand, 2006). Similar to the Example 6.13, this class of models is well suited to handling time-varying auxiliary covariates, in this case, binary as the conditional and marginal distribution of the process of interest takes a simple form due to the underlying normal latent structure. Joint models can also be used in the setting of mixed longitudinal processes. Here, we discuss a model for the bivariate setting with a continuous and binary longitudinal process, motivated by the smoking cessation and weight change data from Data Example 1.3. Example Joint models for continuous and binary responses using an underlying normal latent structure The smoking cessation/weight change data in Data Example 1.3 had two questions of interest. First, did the exercise treatment increase the quit rate? For this question, a joint model might be used if the auxiliary process, here weight gain, was thought to be necessary for an MAR assumption on the missing smoking cessation response (Definition??). The second question involved the relationship between smoking cessation and weight gain for which a joint model is necessary. Liu and Daniels (2007) built variations on the below model to address just this question. A straightforward way to induce correlation between a continuous and binary longitudinal processes is to assume the latent variables underlying the binary responses are normally distributed and then to specify an expanded multivariate normal covariance matrix, with the restriction that the marginal covariance matrix corresponding to the binary responses is a correlation matrix. This model has appeared in various forms in the literature (Catalano and Ryan, 1992; Gueorgeiva and Agresti, 2001; Liu and Daniels, 2007). As in Example 2.6, we assume Z is the vector of latent variables underlying the vector of longitudinal binary responses, S; we denote the longitudinal vector of continuous responses as Y. We assume Y i and Z i are jointly multivariate normal as follows, where ( ) Y i N(X Z i β, Σ) i X i = ( Xiy X iz )

159 MULTIVARIATE PROCESSES 153 and ( ) Σyy Σ yz Σ = Σ T yz Σ zz such that S ij = I{Z ij > 0} and Σ zz is a correlation matrix. Notice that the marginal distribution of Y i is multivariate normal and the marginal distribution of S i follows a multivariate probit model. The submatrix Σ yz characterizes the dependence between the continuous and binary longitudinal processes. If Σ yz = 0, then the two longitudinal processes are independent. Guiergiovea and Agresti (2001) assumed a specific structure for Σ by introducing random effects and assuming conditional independence over time given the random effects. Liu and Daniels (2007), whose inferential objective was the relationship between the smoking cessation and weight gain processes, re-parameterized the three components of Σ, (Σ yy, Σ yz, Σ zz ) as (Σ zz, B zy, Σ yy), where B yz = Σ yz Σ 1 zz and Σ yy = Σ yy B zy Σ T yz. These can be recognized as the regression coefficients and residual covariance matrix from the conditional distribution p(y z); in fact, Y Z N(X y β + B yz (Z X z β), Σ yy ). Liu and Daniels then conducted variable selection techniques (George and McCulloch, 1997; Wong, Carter, and Kohn, 2003) on the components of the B zy matrix to parsimoniously model the association structure; note that B zy = 0 Σ zy = 0. Some results on this data example can be found at the end of this chapter. To close this discussion of models for multivariate longitudinal data, we review a few key ideas. To fill in missing data for all the models reviewed in Section 6.5, we use the conditional distribution of the missing component of one process, given the observed components of that process and the other processes. As discussed at the beginning of this section, that induces a weaker version of MAR, A-MAR (by conditioning on timevarying auxiliary covariates) that may be needed in some analyses. The choice of a specific joint model may be partially guided by the ultimate question (inference) of interest. For example, with auxiliary covariates, a joint model that provides a closed and/or simple form for the conditional distribution of one process given the other may be preferred (Examples 6.13, 6.14, and 6.15). When there are multiple longitudinal processes of interest, it may be more natural and/or parsimonious to consider models with latent variable/random effects (see Further reading).

160 154 INFERENCE UNDER MAR 6.6 Model comparison and fit So far, we have reviewed and proposed a wide variety of models for (multivariate) longitudinal (incomplete) data. We now discuss some modifications of the techniques for model comparison and for assessing model fit, first introduced in Chapter 3, that can be adapted to incomplete data settings (under ignorability). Before we discuss these modifications, we remind the reader that we can only assess the fit of the full data model to the observed data; the fit to the missing data cannot be ascertained. So, models which provide the same fit to the observed data should be indistinguishable when using a sensible model selection criterion even though they may make very different assumptions about the missing data. Since the focus of this chapter has been on ignorable missingness, we assess the fit of the model using only the full data response model, p(y θ), not the full data model, p(r, y ω) since it was unnecessary to specify p(r y obs, ψ). For non-ignorable mechanisms, we use p(r, y ω); this will be discussed in detail in Chapter 8. DIC The DIC is typically constructed using the likelihood based on the full data response model. The complication with using the DIC under ignorability is the fact that we do not observe the complete data. In the following, we will propose three constructions of the DIC in the presence of ignorable missing data, labelled as DIC I-III. The literature is sparse on recommendations for constructing the DIC in this setting. The following versions will be based on those used in several methodological papers (Pourahmadi and Daniels, 2002; Ilk and Daniels, 2007) and those proposed in Celeux et al. (2007). The recommendations in Celeux et al., however, may not directly generalize to our setting as they deal with missing data that is latent data in the context of mixture and random effects models. Such data is fundamentally different than missing response data. First, we can never observe the latent data, but ideally, we would have observed the missing response data. Second, there is an additional binary vector of missingness indicators that are observed that have no analogue in random effects and mixture models and that in non-ignorable models, need to be explicitly modelled. DIC I A first, and perhaps most obvious construction based on the development in Chapter 5 would be to integrate out the missing data (6.2) and use the resulting observed data likelihood for the DIC, L(θ y obs ). This approach has been applied in the literature (Pourahmadi and Daniels,

161 MODEL COMPARISON AND FIT ; Ilk and Daniels, 2007). As an example, the DIC for Example 2.3 would use 2 times the observed data log likelihood given in Example??. This form of the DIC can generally be computed directly in winbugs. DIC II An alternative approach, proposed in (Celeux et al, 2007) in the context of mixture and random effects models, would be to construct the DIC based on the (unobserved) full-data response model. For clarity, in the following development, we express the DIC explicitly as a function of y, DIC(y) = DIC({y mis, y obs }). Since the missing data is not observed, we take the expectation of DIC(y mis, y obs ) with respect to the posterior predictive distribution, p(y mis y obs ), i.e., Ey mis y obs [DIC(y)]. Note that we took a similar expectation with respect to p(y mis y obs ), to derive the observed data posterior from the full data posterior in (6.2). We provide some details on computing this version of the DIC next. Recall from Section 3.4 that the DIC(y) can be re-written as DIC(y mis, y obs ) = 4E θ [log L(θ y obs, y mis )] +2 log L(E[θ y obs, y mis ] y obs, y mis ) Taking the expectation over p(y mis y obs ), we obtain Ey mis y obs DIC(y) = 4E θ,y mis [log L(θ y obs, y mis )] +2Ey mis log L(E[θ y obs, y mis ] y obs, y mis ) (6.31) Similar to using the observed data log likelihood for the DIC, this approach also only assesses the fit of the model to the observed data. However, the criterion has been formed by taking expectations of the log likelihood with respect to the posterior predictive distribution as opposed to the likelihood with respect to the conditional (on θ) predictive distribution (more later) as in the first approach. To compute (6.31), we can use the samples from the posterior predictive distribution of y mis. The first term in (6.31) is computed by averaging the term within the expectation, L(θ y obs, y mis ) over the samples of (θ, y mis ) within the data-augmented MCMC algorithm. For the 2nd term, given the current value of y mis at each iteration, we need to compute E[θ y obs, y mis ], and then average the entire term, L(E[θ y obs, y mis ] y obs, y mis ) over the samples y mis from the MCMC algorithm. A computational problem with this approach is that E[θ y obs, y mis ] will typically not available in closed form. A straightforward, but computationally impractical approach would be to run a Gibbs sampler for every

162 156 INFERENCE UNDER MAR value of y mis sampled and compute the above expectation from these samples. We recommend two alternative approaches which are more feasibly computationally. HOW MUCH OF THE FOLLOWING COMPUTATIONAL DETAIL SHOULD GO HERE???? First, recall from running the MCMC algorithm, a sample of (y mis, θ) is obtained from p(y mis, θ y obs ). We can reweight this sample to estimate E[θ y obs, y mis ] for any value of y mis, say y mis = y (l ) mis. We set E[θ y obs, y mis ] = l w l l θ(l) w l, (6.32) where the sum is over the L iterations of the MCMC run, θ (l) is the value of θ at iteration l and the weights, w l are given by w l = p(θ(l), y (l ) mis, y obs)/p(y (l ) mis y obs) p(θ (l), y (l) mis, y, (6.33) obs) with p(θ (l), y (l) mis, y obs ) = p(y(l) θ)p(θ) and y (l) = (y obs, y (l) mis ). p(y mis y obs ) can be estimated by (1/L) L ) j=1 p(y(l mis y obs, θ (j) ), where θ (j) are the samples from p(θ y obs ) from the original MCMC sample. This approach does not require re-running the sampling algorithm and we would expect these weights to be fairly stable. A second computational approach would be to re-run a single Gibbs sampler for a fixed value of y mis, say y mis = E(y mis y obs ) and then use weights of the form, w l = p(θ(l), y (j) mis, y obs) p(θ (l), y mis, y (6.34) obs ), where θ (l) is the sample from the Gibbs sampler run at the fixed value of y mis and y(j) mis is the value of y mis on which the expectation is conditioned. Parts of this version of the DIC can be computed in winbugs while other parts need to be computed in other software (like R, in which winbugs can be called). We give more details in Chapter 8 and on the book webpage. DIC III A third construction of the DIC would be to not integrate out y mis in any form, but to treat y mis as parameters like θ (see, e.g., Celeux et

163 MODEL COMPARISON AND FIT 157 al., 2007 ). So now, instead of grouping the y mis with y obs, we group it with the parameters, θ and define an augmented parameter vector, θ = (y mis, θ). This version of the DIC for missing data is defined as DIC = 2E[Dev(θ ) y obs ] + Dev(E[θ y obs ]) where Dev(θ ) = 2 log L(θ y mis, y obs ). So, the effective number of parameters, p D, includes the missing responses, y mis. Conceptually, this approach may be less appealing in that we are essentially plugging in values of y mis as opposed to averaging over their distribution, given y obs. Another concern with this form of the criterion is that the missing responses are considered part of the complexity penalty, which does not seem like the most sensible way to handle missing responses. This approach needs further exploration to examine whether different models count the effective number of parameters from the missing responses differentially. And if so, this criterion, which is meant to assess the fit of the model to the observed data, may be impacted by how the missing data itself is modelled (with respect to choosing the best model). In summary, all three DIC constructions need further exploration and comparison to determine their behavior in the setting of ignorable missingness. DIC I and II appear to be preferrable in terms of removing the missing data by averaging over its posterior predictive distribution or its predictive distribution conditional on θ, respectively. Using the observed data log likelihood in DIC I is equivalent to what is done in the frequentist model selection literature with AIC. DIC II, on the other hand, constructs the DIC based on the full data response likelihood and then averages it over the posterior predictive distribution of the missing data. Given that the full data response model (and its parameters) is typically of primary interest lends some credence to this approach. The plug-in approach of DIC III which treats the parameters and missing data similarly seems less preferrable. Posterior predictive loss (PPL) In Chapter 3, an approach was suggested for computing PPL for model comparison in longitudinal data by creating a univariate summary/measure of the longitudinal vector of responses, Y i. In this section, we will discuss how to extend this approach to handle incomplete data. The main issue once again will be the choice of the univariate summary, T i, which will now be a function of only the observed data y obs, T i = h(r T i Y i ) = h(y i,obs ). The weighted average from Chapter 3, T i = J j=1 w jy ij might be appended with the missing data indicators, R ij to form a summary, T i = J j=1 w jy ij R ij. Another possible summary would be to use the last

164 158 INFERENCE UNDER MAR observed response, T i = {argmax j Y ij : R ij = 1}. Using the Gelfand and Ghosh loss function (see Chapter 3) and squared error loss, the criterion takes the form, n σi(t 2 ) + k k + 1 i=1 n (µ i(t ) T i,obs ) 2 i=1 where µ i(t ) = E(Ti,rep y obs ) and σ2 i(t ) = Var(T i,rep Ti,rep = h(rt i Y i,rep). refresh notation from ch 3?? y obs ), where Despite the drawbacks of the DIC (discussed in Chapter 3), it does not require choosing the univariate summary T i required in this version of PPL for incomplete longitudinal data. An alternative approach would be to use a multivariate loss function such as the deviance (Gelfand and Ghosh, 1998) which does not require the choice of the scalar function T i. For such loss functions, the criterion can usually still be approximated with a goodness of fit and a penalty term. An important feature of this criterion as described under MAR above, is that the criterion implicitly conditions on the observed vector of missing data indicators, R. The reason for this is that under ignorability, the conditional distribution p(r y obs, x) is not specified. Posterior predictive checks Assessing model fit using posterior predictive checks can be based on statistics computed based on replications of only the observed data or based on replications of the completed dataset. A recent paper by Gelman et al. (2005) advocates computing statistics, making plots, and computing p-values based on the completed (full-) data. However, the fit of the model is still only assessed to the observed data; the missing data portion of completed datasets sampled by data augmentation are conditional on the model. Basing the checks on the completed datasets offer several advantages (Gelman et al., 2005). In principle, as discussed throughout this chapter, interest most often is on the full-data response model, p(y θ) so doing diagnostics on this model (as opposed to the observed data model) will typically be of primary interest. As such, choices of T ( ) discussed in the complete data setting in Chapter 3 would be appropriate here. Second, model building under an assumption of ignorability does not require specification of the missing data mechanism, p(r y obs, x, ψ). By using diagnostics based on completed datasets, the replicated datasets of the full data vector, y do not need to differentiate the missing and observed components, as specified by the response indicator vector r; thus, there is no need to specify p(r y obs, x, ψ). For

165 FURTHER READING 159 these reasons, we conduct posterior predictive checks on the completed datasets. To assess the fit of certain features of the model to the observed data, appropriate test statistics need to be constructed. These statistics are computed using the observed data, supplemented with a realization from the posterior predictive distribution of the missing data sampled in the data augmentation step (at say, iteration l), (y obs, y l mis ), and a full set of the replicated data, (y l rep). This approach has been implemented recently in several papers including Ilk and Daniels (2007). So, for some statistic of interest, T ( ), at each iteration, we compute T (y obs, y l mis ) and T (yl rep ) and compare the two. The corresponding posterior predictive p-value, p would be defined as p = I T (y obs,y mis )>T (y rep ) p(y mis y obs, θ) p(y rep θ, y obs )p(θ y obs )dθdy rep dy mis (6.35) where p(y rep θ, y obs ) = p(y rep θ). See Data Examples 7.1 and 7.4 in the next section for two implementations of this approach. Comment on p-value calibration??? For non-ignorable missing data mechanisms, p(r y, ω) is specified. We will discuss posterior predictive checks for these cases in Chapter Further reading Model specification: dependence Another line of research on parsimoniously modelling covariance matrices lets the data itself find a parsimonious structure. Wong et al. (2003) model dependence parsimoniously (within models of Example 2.3) by doing Bayesian model averaging using priors that can zero-out partial correlations. Liu and Daniels (2007) use a similar approach in the context of the model in Example Liechty et al. (2004) develop Bayesian models to parsimoniously estimate a correlation matrix through the construction of appropriate priors. MY NEW STUFF WITH MOHSEN HERE ON PARTIAL CORR Diggle and Verbyla (1998) discuss nonparametric estimation of a stationary covariance function, C(h), but their approach does not guarantee the resulting covariance function is positive definite for all h 0. Bayesian approaches in the spatial setting that could be adapted to the longitudinal setting can be found in Ecker and Gelfand (1997).

166 160 INFERENCE UNDER MAR Model specification: distribution Kleinmann and Ibrahim (1998a) and Kleinmann and Ibrahim (1998b) propose Bayesian nonparametric approaches to model the distribution of the random effects for normal random effects models (Example 2.1) and for generalized linear mixed models (Example 2.2) using Dirichlet process priors. Joint Models For the covariance structure for multivariate continuous longitudinal data with misaligned times, additional structures can be found in Sy et al. (1997), Ferrer and McArdle (2003), Sammel et al. (1999), and Henderson et al. (2000). Other directly specified models for binary data that do not use an underlying multivariate normal or t- latent structure as in Example 6.14, but provide marginal logistic regressions for each component of Y, Y ijk, have been proposed. Fitzmaurice and Laird (1993) propose a very general model for multivariate (longitudinal) data based on the canonical log-linear parameterization. However, given the parameterization of this model, it is difficult to parsimoniously exploit the longitudinal correlation using serial correlation structures (cf: Example 6.11). Ilk and Daniels (2005) constructed a model specifically for multivariate longitudinal data, building on earlier work by Heagerty (1999, 2002) mixing both transition (Markov) and random effects structures; computations are complex though an efficient MCMC algorithm is proposed and implemented. Indirectly specified conditional models (via random effects or latent classes) are another approach for modelling (multivariate) longitudinal binary data. Some relevant work for multivariate binary data includes Bandeen-Roche et al. (1997), Ribaudo and Thompson (2002), Dunson (2003), and Miglioretti (2003). There are many other approaches for general mixed longitudinal data that have been proposed in the literature. Models for a set of mixed longitudinal outcomes (more than just one continuous and binary longitudinal process) have been developed without using an underlying normal latent structure, at the cost of more complex computations. Some of these models use latent variables or latent classes to connect the processes, e.g., see Sammel et al.(1997), Dunson (2003), and Miglioretti (2003). Lambert et al. (2002) instead of using latent variables, developed models for mixed outcomes using copulas (Nelson, 1999) Other authors have used the general location model (GLOM) for mixtures of categorical and continuous outcomes (Fitzmaurice and Laird, 1997; Liu

167 FURTHER READING 161 and Rubin, 1998), but not specifically in the setting of multivariate longitudinal data. We did not discuss joint modelling of longitudinal and time to event data in the main text. However, there is an extensive literature on this topic, both in the context of surrogate markers and for missing data where dropout is modelled as a time to event process. See DeGruttola and Tu (1994), Faucet and Thomas (1996), Wulfsohn and Tsiatis (1997), Wang and Taylor (2001), and Xu and Zeger (2001), among others. For an early review of these methods, see Hogan and Laird (1997) and more recently, Tsiatis and Davidian (2004).

168

169 CHAPTER 7 Data examples: Ignorability In this chapter, we re-analyze several examples from Chapter 4 now using all the data, not just the completers, under an ignorability assumption for the missing data. We use the same full data models and priors as in Chapter 4 and refer to reader to Chapter 4 for the objectives of each analysis. We also introduce another analysis using the CTQ II smoking cessation data, Data Example 1.3, to illustrate a joint model. 7.1 Re-analysis of GH study under MAR (cont. of Example 4.1) We redid the analysis previously done in Example 4.1 now using all of the data under an assumption of MAR. Results and Comparison with completers only analysis Re-show some results from Chapter 4?? Posterior means and 95% credible intervals for the mean parameters are given in Tables 7.1 and 7.2. Comparing these results to the complete case results in Tables 4.3 and 4.4, we notice narrower credible intervals under ignorability; this is expected as we are now using all the data. By comparing the posterior means of the means in treatment 1, we can see the potentially strong bias from the completers only analysis (e.g., the posterior mean of the month 12 mean was 88 (79,98) in the completers only analysis vs. 81 (72,90) in the MAR analysis). This is related to the fact that at baseline, there were large differences between the mean of completers and the mean of all the subjects, compare Tables 4.3 and 7.1. Posterior probabilities that each of the pairwise differences are greater than zero are given in Table 7.3. The posterior probabilities were fairly similar in the common Σ model versus the different Σ model. However, the posterior probability for treatment 1 versus treatment 3 changed substantially from.82 to.92. In addition, the posterior probabilities under ignorability showed differences from the completers only analysis (cf: Table 4.5). For example, the posterior probabilities for the difference 163

170 164 DATA EXAMPLES: IGNORABILITY Table 7.1 transpose; 1 decimal place Posterior means and 95 % credible intervals of the mean parameters for each treatment assuming a common Σ across treatments Time treatment1 treatment2 treatment3 treatment (61.59,77.31) 68.46(60.91,76.05) 65.85(58.21,73.54) 65.21(57.63,72.79) (73.08,91.25) 66.11(57.79,74.44) 81.31(72.79,89.79) 62.17(53.76,70.62) (72.56,89.98) 65.13(57.30,73.04) 72.73(64.74,80.76) 62.68(54.65,70.77) Table 7.2 Posterior means and 95 % credible intervals of the mean parameters for each treatment assuming different Σ for each treatment Time treatment1 treatment2 treatment3 treatment (61.03,77.71) 68.44(61.31,75.59) 65.89(57.51,74.28) 65.22(57.59,72.84) (69.57,93.78) 66.10(58.27,73.87) 81.35(72.57,90.08) 62.15(54.75,69.55) (66.34,91.60) 64.95(57.73,72.28) 72.77(65.20,80.32) 62.88(55.61,70.15) between treatment 1 and 3 changed from.99 and.97 from the completers only analysis to.92 and.82 for the MAR analysis. Model Fit We used DIC I to compare the fit of the common and different Σ models. The results, which appear in Table 7.4, supported the common Σ model. To assess the absolute fit of the common Σ model, we used posterior predictive checks using the statistics described in Example The p- values for the three correlations for treatment 3 ranged from.94 to.97. In addition, the correlations between baseline and month 6 and baseline and month 12 for treatment 2 were overestimated with p-values of.94 Table 7.3 Posterior probabilities that each of the pairwise differences at month 12 is greater than zero. µ (1) 3 µ (2) 3 µ (1) 3 µ (3) 3 µ (1) 3 µ (4) 3 µ (2) 3 µ (3) 3 µ (2) 3 µ (4) 3 µ (3) 3 µ (4) 3 common Σ Σ by Tx independence

171 RE-ANALYSIS OF GH STUDY UNDER MAR (CONT. OF EXAMPLE 4.1) 165 Table 7.4 DIC for the common Σ model and model with Σ differing for each of the four treatments. For the DIC, the values in parentheses are the values when setting θ = (β, Σ). model DIC D PD common Σ 2783 (2668) 2761 (2650) 22 (18) different Σ 2787 (2671) 2745 (2643) 41 (28) Table 7.5 Posterior means and 95 % credible intervals of the mean parameters for each treatment under the independence model. Time treatment1 treatment2 treatment3 treatment (59.96,78.75) 68.46(61.44,75.43) 65.89(58.20,73.61) 65.20(58.40,71.98) (75.48,98.19) 66.22(58.62,73.74) 81.58(73.08,90.10) 62.04(54.47,69.58) (76.03,100.60) 63.32(55.14,71.50) 72.45(63.67,81.25) 62.96(54.75,71.18) and.92. The magnitudes of these p-values suggest the common Σ model does not adequately model the correlation structure. Comparison with diagonal covariance matrices To further illustrate the importance of modelling dependence under MAR, we fit several independence models. We present the results of the model that set Σ(T x i ) = σ 2 (T x i )I 3. add I p to the notation at end of book The DIC for this model was 2956 and the effective number of parameters was 16. Based on the DIC, this model fit considerably worse than the dependence models (cf: Table 7.4). Table 7.5 gives posterior means and 95% credible intervals for the means at each time for each treatment. The largest difference from the dependence models (cf: Tables 7.1 and 7.2) can be seen in the month 12 mean for treatment 1. Under the independence model, this mean is 7-9 units higher than under the dependence models. This large difference is due to the independence model not using the observed values at month 0 and month 6 to fill-in the month 12 values for the dropouts for inference. This difference also led to larger (but incorrect) posterior probabilities for the differences between month 12 means for treatment 1 and the other three treatments. Conclusions The more parsimonious model that assumed a common Σ across treatments was preferred in this analysis (as measured by the DIC and PPL). However, posterior predictive checks on the common Σ model suggested

172 166 DATA EXAMPLES: IGNORABILITY Table 7.6 Posterior mean of covariance matrices by treatment. The entries on the main diagonal are the innovation variances and below the main diagonal are the GARP. Treatment Treatment Treatment Treatment some problems. We only compared models that assumed a common Σ versus a distinct Σ for each of the four treatments. It may be that a covariance model somewhere between these extremes may provide a better fit. The covariance models in Section 6.? are well suited to address this but are inefficient to fit in WinBUGS at this time. Table 7.6 presents the treatment specific GARP and IV parameters. The innovation variances for months 6 and 12 were almost twice as large in treatment 1 versus the other treatments. In addition, several of the GARP in treatment 4 were considerably smaller than their counterparts for the other treatments. We refer the interested reader to Pourahmadi, Daniels, and Park (2007) and Daniels (2006) for detailed exploration of such models for this data. 7.2 Analysis of schizophrenia clinical trial under MAR (cont. of Example 4.2) Results and Comparison with completers only analysis Figure 7.1 plots the posterior means of the trajectories for each treatment over the 6 weeks of the trial. The BPRS for the medium and high dose individuals started to go back up by week 6 unlike under the completers only analysis. This is related to the fact that those dropping out were more likely to have been doing poorly under their respective treatments. Tables 7.7 and 7.8 are of primary interest for inference as they contain summary information from the posterior on the change from baseline to 6 weeks; in particular, they contain the change from baseline for all four treatments and the differences in the change from baseline among the

173 ANALYSIS OF SCHIZOPHRENIA CLINICAL TRIAL UNDER MAR (CONT. OF EXAMPLE 4.2) 167 Table 7.7 transpose and 1 decimal Posterior means of change from baseline to week 6 for all four treatments under the random effects model (R), unstructured model (U) and independence model (I). Low Medium High Standard MAR (R) -5.52(-12.20, 1.26) (-19.49, -8.20) (-17.84,-6.02) (-18.65, -7.45) MAR (U) -3.98( -9.23, 1.47) (-18.54, -9.23) (-16.06, -6.38) (-16.84, -7.72) MAR (I) (-17.68, -6.34) (-20.22,-10.40) (-23.60,-13.31) (-21.17,-11.26) treatments, respectively. All the changes from baseline were significantly different from zero (credible intervals excluded zero) except for the low dose treatment; this was significant in the completers only analysis. The low dose arm also exhibited the smallest improvement with a posterior mean of 2.9 and a 95% credible interval of ( 15.2, 10.1). In addition, note that the change from baseline using only the completers was larger (in absolute value) than the corresponding analysis here under MAR. Table 7.8 indicates none of the changes from baseline were significantly different between the four treatments as the credible intervals for all the differences covered zero. In addition, the posterior means were very different than the completers only analysis and credible intervals for the differences were wider. Alternative covariance structures For comparison to the covariance structure induced by the random effects model, we also fit a directly specified normal model with an unstructured covariance matrix differing by treatment, Σ(Tx), an unstructured matrix constant across treatments, Σ, and an independence covariance matrix, σ 2 I. Table 7.9 gives the DIC values for the random effects model and other three models. The unstructured covariance models fit much better than the random effects model and the unstructured covariance model which differed across treatments fit better than the one that was constant. All the correlated models fit much better than the independence models. There were small differences between the treatment-specific changes from baseline between the random effects and unstructured models (see Table 7.7). However, the independence model had large differences from the dependence models showing considerably more extreme changes (more negative). For the pairwise differences among the treatments (Table 7.8), the independence model again had large differences both in posterior means and in credible intervals. In terms of significant differences, the unstructured model gave significant differences (credible interval exclud-

174 168 DATA EXAMPLES: IGNORABILITY Table 7.8 Pairwise differences of the changes among the four treatments under the random effects model (R), unstructured model (U) and independence model (I). tranpose and get rid of diff notation diff12 diff13 diff14 diff23 diff24 diff34 MAR (R) 8.35(-0.24,17.00) 6.49(-2.25,15.20) 7.60(-1.01,16.29) -1.86(-10.04,6.18) -0.74(-8.68,7.18) 1.12(-6.90, 9. MAR (U) 9.92( 3.07,16.88) 7.27( 0.34,14.27) 8.37( 1.57,15.25) -2.65( -9.24, 3.94) -1.55(-8.06, 4.84) 1.10(-5.42, 7. MAR (I) 3.30(-4.20,10.79) 6.46(-1.22,14.14) 4.22(-3.33,11.79) 3.17( -3.97,10.27) 0.92(-6.03, 7.89) 0.84(-7.01, 8. Table 7.9 DIC for schizophrenia data. be consistent with columns in DIC tables and notation model D Pd DIC Random effects Σ by Tx common Σ Indep ing zero) for the comparisons between treatment 1 and the other three treatments, while the random effects model did not. Conclusion There were no significant differences in the change from baseline between the treatments similar to the completers only analysis. However, the point estimates and corresponding estimates of uncertainty were very different between the MAR and completers only analysis as expected. 7.3 Analysis of CTQ I using all the data under MAR (cont. of Example 4.3) Need to fill in this analysis: Joe Models We will focus on the MTM s for re-analysis under MAR. Serial correlation is modeled only for the weeks following the quit date. We now consider two different formulations for the marginal mean, the first (Model 1) assuming an separate unstructured mean for each treatment (as in Example 4.3) is the following redundent wiht chapter 4???, logit(µ ij ) = (1 Z j )α + Z j (β 0 + β 1 X i),

175 ANALYSIS OF CTQ I USING ALL THE DATA UNDER MAR (CONT. OF EXAMPLE 4.3) 169 Figure 7.1 Posterior means of the trajectories for each of the four treatments for the random effects model. and the second (Model 2), an unstructured mean with a constant treatment effect, logit(µ ij ) = (1 Z j )α + Z j (β 0 + β 1 X i), where superscript denotes parameters from the marginal rather than conditional distribution. For j = 6,..., 12, serial correlation follows logit(φ ij ) = ij + γ i Y i,j 1 ;

176 170 DATA EXAMPLES: IGNORABILITY Table 7.10 Parameters from fitting the MTM(1) model in Chapter 4 to longitudinal smoking cessation data from CTQ I. Parameter Posterior mean 95% credible interval MTM(1) β (-1.2,-.51) β 1.34 (-.09,.77) φ (4.2, 5.2) φ (-0.10, 1.4) Table 7.11 Comparison of dependence parameters from Models 1 and 2 fit to longitudinal smoking cessation data from CTQ I. Parameter Posterior mean 95% credible interval Model 1 φ (4.0, 5.1) φ (0.04, 1.8) Model 2 φ (4.1, 5.2) φ ( 0.04, 1.8) correlation is allowed to differ by treatment group by setting γ i = φ 0 + φ 1 X i. Priors are specified as in Data example 4.3. Results and Comparison with completers only analysis We first compare the models fit in Chapter 4 to the completers only to the models fit here using all the data under MAR. Table 7.11 contains the results under MAR. The posterior means and credible intervals for the dependence parameters are similar to the completers only analysis (cf: Table 4.9). However, the posterior means of the marginal mean parameters show some differences with the treatment effect on the quit rate (β 1 ) now smaller, but still not significant. NEED DIC Alternative dependence structures As a comparison, we also fit models with the same marginal mean structure, but under independence, φ 0 = φ 1 = 0. Table 7.12 contains summaries of the posterior distributions of the marginal mean parameters. The treatment effect is no larger than when assuming dependence and the credible interval now excludes zero, (.27,.68). Clearly, ignoring the

177 ANALYSIS OF WEEKLY SMOKING AND WEIGHT CHANGE FROM CTQ II USING ALL THE DATA UN Table 7.12 Parameters from fitting the model in Chapter?? under independence to the longitudinal smoking cessation data from CTQ I. Parameter Posterior mean 95% credible interval MTM(0) β (-.95,-.66) β 1.47 (.27,.68) dependence here results in a incorrect determination of a statistically significant treatment effect. Plots of the marginal means at each time under Models 1 and 2 for the MTM(1) s and independence models appear in Figures?? and??. (need these as individual figure files). need these plots Conclusions fill in 7.4 Analysis of weekly smoking and weight change from CTQ II using all the data under MAR intro on objectives from Chapter 1 Model We will use the joint model for binary and continuous longitudinal data introduced in Example Here S i = (S i1,..., S it ) is the longitudinal vector of binary quit status and Y i = (Y i1,..., Y it ) is the longitudinal vector of weekly weight gain with T = 6. X y is a quadratic polynomial in time with an interaction with treatment and X z is composed of a separate mean for each time and an interaction with treatment. We fit models allowing dependence between the two processes, B yz 0 (or equivalently, Σ yx 0), independence between the processes, B yx = 0, and independence within and between processes (diagonal Σ). Priors The vector of regression coefficients, β was given an improper uniform prior. Parameterizing Σ as (Σ zz, B zy, Σ yy ), we placed a (proper) uniform prior on the correlation matrix, Σ zz, an (improper) uniform prior on the (autoregressive) coefficient matrix B zy, and a flat prior on Σ yy. See Liu and Daniels (2007) for verification that this specification gives a proper posterior. Posterior sampling

178 172 DATA EXAMPLES: IGNORABILITY Table 7.13 DIC for CTQ II analysis. add in effective number of params etc model DIC Joint 2746 B = Indep 3776 The key to posterior sampling in this model is moving between two forms of the likelihood in deriving the full conditional distributions. The standard form for the multivariate likelihood is given in Example??. Given this form and the priors specified in the previous section, the full conditional distribution for β can easily be derived as a multivariate normal distribution. To derive the full conditional distributions of B zy and Σ yy, which will be multivariate normal and inverse Wishart, respectively, the multivariate normal likelihood can be factored into p(y z)p(z). For full details on the forms and derivations of the full conditional distributions for these models, see Liu and Daniels (2007). This model could be fit in WinBUGS. However, the sampling algorithm would be very inefficient as WinBUGS would not be able to move between the two forms of the likelihood in constructing the full conditionals as given above. Model Fit To compute the DIC I, we use Monte Carlo integration and re-weighting as discussed in Chapter 3 for the multivariate probit model. Table 7.13 gives the DIC values for the three models fit. Clearly, the joint model which allows dependence between the smoking and weight gain process provides the best fit. We also conducted posterior predictive checks using the techniques outlined in Section 6.6. In particular, at each iteration of the sampling algorithm, we computed test statistics (for each treatment and week) as the differences in quit rates and weight gains for the replicated full data at each iteration and the completed data set at each iteration using the observed data and the missing data sampled during the data augmentation step. define better using T similar to paper These statistics were chosen to examine consistent over- or underestimation of the mean for each process at each measurement time. The posterior predictive p-values were defined as the probabilities that these differences were bigger than zero; these are given in Table 7.14 remove table??.

179 ANALYSIS OF WEEKLY SMOKING AND WEIGHT CHANGE FROM CTQ II USING ALL THE DATA UN Table 7.14 Posterior predictive p-values for CTQ II analysis. Tx Response Weeks NE Quit rate Weight E Quit rate Weight Table 7.15 Posterior means of weekly quit rates for Exercise (E) and No Exercise (NE) treatment arms with 95% credible intervals for each of the three models considered. NE E Joint B = 0 indep observed.39 (.34,.45).39 (.34,.45) 0.38 (0.31,0.46) (.44,.55).50 (.44,.55) 0.5 (0.43,0.56) (.49,.66).55 (.49,.60) 0.56 (0.48,0.64) (.47,.59).53 (.47,.59) 0.56 (0.48,0.64) (.47,.59).53 (.47,.59) 0.53 (0.45,0.62) (.40,.52).46 (.40,.52) 0.46 (0.40,0.52) (.39,.51).45 (.39,.51) 0.47 (0.40,0.54) (.32,.43).37 (.32,.43) 0.38 (0.30,0.47) (.38,.49).44 (.38,.49) 0.45 (0.39,0.51) (.37,.48).42 (.37,.48) 0.47 (0.40,0.55) (.40,.52).46 (.40,.52) 0.50 (0.43,0.57) (.32.,.43).38 (.32,.43) 0.43 (0.36,0.51).42 All the p-values were between.2 and.6 providing no evidence of lack of fit. Inference on questions of interest Table 7.15 shows the posterior means of the quit rates under the full joint model, the model which assumes B = 0 (independence between weight change and smoking, and complete independence, Σ = I. The reason for the small difference between the model which assumes B = 0 and the full joint model is the fact that the dependence here is

180 174 DATA EXAMPLES: IGNORABILITY Table 7.16 Posterior means of difference in quit rates with 95% credible intervals at week 12 for each of the three models considered. add one more decimal place Model Posterior mean 95% credible interval joint.08 (.08,.08) B = 0.08 (.08,.08) indep.03 (.02,.04) observed.04 much stronger within each process than between the processes; for example, the correlations between the quit process over time are all greater than.5, while they are all less than.2 for the correlations between the quit and weight process. See Liu and Daniels (2007) for a more detailed exploration of the dependence structure and a discussion addressing the inference on the relationship between smoking and weight between the two treatments. Table 7.16 contains the treatment difference at the final week with 95% credible intervals. Clearly, the independence model underestimates the treatment difference (as does the inference based on the (unmodelled) observed data). This is due to the fact that under ignorability, the values filled in for the missing responses using data augmentation were more often smokers than quitters. Conclusions Surprisingly, the quit rate was significantly lower on the exercise treatment at the end of the trial. The inclusion of the auxiliary longitudinal process weight gain had little effect on the conclusions in this example. Point out HERS example in Chapter 4 did MAR implicitly; reference simulation in Li s paper???

181 CHAPTER 8 Models under nonignorable dropout 8.1 Introduction We will discuss three classes of models defined in terms of specifying the joint distribution f(r, y ω). However, before we do this, we factor this joint distribution into two pieces: the first can be identified by the observed data without any modelling assumptions but the second cannot, p(r, y ω) = p(y mis y obs, r, ω 1 )p(y obs, r ω 2 ). (8.1) The first piece, might be called the extrapolation distribution and is not identified without modelling assumptions. Many of the approaches discussed in this chapter will implicitly identify this extrapolation distribution either directly or indirectly and lead to non-distinct parameters ω 1 and ω 2. If the model lends itself to the factorization given above with distinct ω 1 and ω 2, this will provide a convenient platform for sensitivity analysis, the topic of Chapter 9. The parameters ω 1 are then not identified by the observed data without modelling assumptions. We provide a formal definition next. Definition 8.1. Nonparametrically non-identifiable (NNI) parameters A parameter ω is non-parametrically non-identifiable if it is a nonconstant function of ω 1, the parameters of the extrapolation distribution. We will discuss such NNI parameters and give examples in specific model settings throughout this chapter. 8.2 Selection models Background and history A natural way to deal with incomplete data is to first specify a full data response model and then append to it a model for the selection mech- 175

182 176 MODELS UNDER NONIGNORABLE DROPOUT anism, i.e., how observations are selected into (or out of) the sample. This corresponds to factoring full data model as p(r, y ω) = p(r y, ψ(ω))p(y θ(ω)), (8.2) and then specifying each component separately. This is the same factorization used to classify missing data, as discussed in Chapter 5 and most closely corresponds to the complete data setting where the full data response model is specified directly and then, for incomplete data, the missing data (selection) mechanism is appended to form the full data model. The direct specification of the full data response model facilitates interpretation of covariate effects. In addition, the missing data mechanism directly relates to the assumptions of MCAR, MAR, and MNAR. Other approaches to specify the full data model do not share these convenient features (see Sections 7.3 and 7.4). The typical strategy to model the two components, until recently, was to specify a parametric model for each and to assume that ψ and θ are separable. In the following, we will assume ψ and θ are separable and a priori independent unless stated otherwise. Note that this does not correspond to the (ω 1, ω 2 ) partition described in Section 7.1. Two of the more influential papers on parametric selection models are Heckman (1979) and Diggle and Kenward (1994). In both these papers the parameters of the full data response model and the missing data mechanism were distinct. A third influential paper (Wu and Carroll, 1988) used a selection model approach in the context of (shared) random effects in both the full data response model and in the missing data mechanism. Unconditional on the random effects, the resulting full data model has non-distinct parameters. We will defer discussion of this model until Section 7.4 and refer to it as a shared parameter model. Heckman (1979) specified the joint distribution of a bivariate y and r using a multivariate normal distribution (implicitly, a probit model for the binary indicators of missingness that is linear in y). Diggle and Kenward (1994) expanded this work specifically to the case of dropout in longitudinal studies and specified a logistic model for the missing data mechanism. This paper was the first in the statistics literature to propose a full likelihood based approach to handle nonignorable dropout. Following Diggle and Kenward s article were many subsequent articles which adapted and extended their original model in a likelihood framework (see, e.g., Fitzmaurice, Molenberghs, and Lipsitz, 1995; Baker, 1995; Molenberghs, Kenward, and Lesaffre, 1997; Liu, Waternaux, and Petkova, 1999; Albert, 2000; Heagerty and Kurland 2004). This work also led to several important papers on semiparametric selection models including Scharfstein et al. (1999).

183 SELECTION MODELS 177 We will start our exploration of selection models by introducing Heckman s model for a continuous bivariate response. Example 8.1. Selection Models for a bivariate normal response Consider the growth hormone clinical trial introduced in Data Example 1.2. Let y be the bivariate outcome with y 1 corresponding the measure at baseline and y 2 the (potentially missing) outcome at month 12. Let R be an indicator of whether the month 12 outcome, y 2 is observed (R = 0 corresponds to y 2 being missing). In the following, without loss of generality, we focus on one of the treatments and assume an unstructured mean over time with no additional covariates. Heckman proposed to jointly model the outcome and missingness using a trivariate normal distribution as follows, Y 1 Y 2 V where R = I{V > 0}. N µ 1, µ 2 µ v σ 11 σ 21 σ 22 σ 31 σ 32 σ 33 (8.3) This model implies that the selection probability follows a probit regression that is linear in Y 1 and Y 2, P (R = 1 y) = P (V > 0 y) = Φ(ψ 0 + ψ 1 y 1 + ψ 2 y 2 ) (8.4) Referring to (8.3), ψ 1 = σ 31 σ 11 + σ 32 σ 21, ψ 2 = σ 31 σ 12 + σ 32 σ 22, and ψ 0 = µ u ψ 1 µ 1 ψ 2 µ 2, with σ ij the ijth element of Σ 1 2. Note that the regression parameters in the missing data mechanism are functions of the mean and covariance parameters of the full data response model, p(y θ) which is bivariate normal with mean µ = (µ 1, µ 2 ) T and covariance matrix Σ 2. Also, the missing data mechanism is linear in y. If the mdm was thought to be nonlinear in y, the trivariate normal model given in (8.3) would not be valid. For identifiability, the model in (8.3) is typically parameterized such that the conditional variance, V ar(v y 1, y 2 ) = σ33 = σ 33 σ 3 Σ 1 2 σt 3 = 1, where σ 3 = (σ 31, σ 32 ) T and Σ 2 = V ar(y ). This is done to facilitate interpretation of the missing data mechanism coefficients, ψ in (8.4). Individual contributions to the observed data likelihood take the form, [ 0 ] 1 r [ r p(y 1, v θ, ψ)dv p(y 1, y 2, v θ, ψ)dv]. (8.5) 0

184 178 MODELS UNDER NONIGNORABLE DROPOUT where the first integrand is a bivariate normal derived from (8.3) and the second integrand is the trivariate normal given in (8.3). The model of Diggle and Kenward (for a bivariate response) is similar to the Heckman model but with the probit model for the missing data mechansim in (8.4) replaced by a logistic model, P (R = 1 y) = logit 1 (ψ 0 + ψ 1y 1 + ψ 2y 2 ). (8.6) This formulation obviously no longer corresponds to a joint trivariate normal distribution for the full data. The form of the observed data likelihood and more details on the Diggle and Kenward model will be reviewed for the general case of J responses in Example 8.2. The Heckman selection model presented above can be extended in several ways. First, there are obvious extensions to settings where J > 2 by increasing the dimension of the model. Second, binary longitudinal responses can be accomodated by making the multivariate normal in (8.3) into a multivariate probit model. Third, the constant means µ can be modified to include covariates. All the parameters in these parametric selection models are identified, including the missing data mechanism coefficient, ψ 2. This may seem a bit surprising since when R = 0, y 2 is not observed. In fact, it is only identified through the specification of a parametric model for the full data response. We call such parameters nonparametrically non-identifiable (NNI), as defined in Section 7.1. It can be shown that ψ 2 is a function of the parameters of the extrapolation distribution ω 1. We will go into more detail on NNI parameters in the mdm in selection models later and provide some further details on how these NNI parameters are explicitly identified Parametric selection models for longitudinal data As discussed briefly in the previous section, specification of selection models can be thought of as choosing two separate models: 1) a model for the missing data mechanism (mdm), p(r y, ψ) and 2) a model for the full-data response, p(y θ). General issues in choosing the latter were discussed in detail in Chapter 6 and still are applicable here. Unlike in Chapter 6, however, under an assumption of ignorable dropout, a model for the missing data mechanism, p(r y, ψ), must be specified. As discussed in Chapter 5, it is convenient to specify the mdm for longitudinal data with dropout using the hazard. In the following, we will

185 SELECTION MODELS 179 suppress covariates without loss of generality. Define F it to be the history of responses up to, but not including time t, F it = {y i1,..., y i,t 1 }. For the case of J fixed measurement times, the hazard is defined as h(t F i,t +1, ψ) = P (r it = 1 r i1 = = r i,t 1 = 0, F i,t +1, ψ). (8.7) A transformation (logit or probit) of h(f i,t +1, ψ) is often specified as additive in the components of F i,t +1. That is, h(t F i,t +1, ψ) = g 1 (ψ T t y i ) where g is the (probit or logit) link function and ψ t = (ψ t1,..., ψ tt ) T. Under MNAR, the components, ψ t,t, ψ t,t+1,..., ψ t,t (for all t) will be NNI. A simpler class of models for the missing data mechanism, referred to as missing non-future dependence (Kenward et al., 2003) is characterized by h(t F i,t +1 ; ψ) = h(t F i,t+1 ; ψ). (8.8) Thus, the hazard of dropout at time t depends on the past, F it and on the response at time t, y it, but not future responses. The estimating equation sensitivity analysis literature of Robins and colleagues use mdm s of this form in practice. The form of the function h(t F i,t+1 ; ψ) is often further simplified to h(t F i,t+1, ψ) = g 1 (ψ 0 + ψ 1 y i,t 1 + ψ 2 y it ) (8.9) where the hazard of dropout at time t only depends on the current response y it and the most recent past response, y i,t 1. For this simplified specification, there is only one NNI parameter, ψ 2, which considerably simplies sensitivity analyses (see Chapter 9). In addition, in this simplified form, a test for MNAR vs. MAR reduces to a test of ψ 2 = 0. However, this test is valid only under the assumption that the full data model is specified correctly. In the following two examples, we will present some additional details on parametric selection model for continuous longitudinal responses and binary longitudinal responses, respectively. Example 8.2. A parametric selection model for continuous responses Diggle and Kenward (1994) proposed a parametric selection model for a T -dimensional vector of longitudinal responses, y, with monotone dropout. We introduce this model in the context of the schizophrenia clinical trial (Data Example 1.1). In this trial, schizophrenia measurements were intended to be collected for six weeks but there were dropouts

186 180 MODELS UNDER NONIGNORABLE DROPOUT at each week after baseline. The full data response model (for the six weekly schizophrenia measurements) might be specified as [Y β, Σ] N(Xβ, Σ) (8.10) where θ = (β, Σ). The missing data mechanism, accounting for the weekly dropout, can be specified using a simple form of the hazard with a logistic link, logit{h(t F i,t+1, ψ)} = ψ 0 + ψ 1 y i,t 1 + ψ 2 y it. (8.11) as above. Thus, dropout at time t is allowed to depend on the immediate previous response, y i,t 1 and the (potentially) unobserved current response, y it. This formulation is within the class of missing non-future dependence discussed earlier. The observed data likelihood for this model takes a complex form. In particular, the contribution to the observed data likelihood for an individual who drops out at time t ik, [ j<k(1 h(t j h(t j F i,j+1, ψ)p(y ij F ij, θ)] h(t k F i,k+1, ψ)p(y ij F ij, p(y ik F ik, θ)dy ik. (8.12) The complexity of this likelihood is due to the integral with respect to the missing response y ik. Bayesian MCMC approaches avoid direct evaluation of this integral through data augmentation. Details are provided later in the chapter. The inability in parametric selection models to differentiate identified and unindentified parameters is evident from (8.12), which shows a case where the observed data likelihood is a function of all the model parameters. In the notation of Section 7.1, the observed data likelihood is a function of the parameters of the extrapolation distribution, ω 1. Example 8.3. A parametric selection model for binary responses There has been considerable literature on parametric selection models for longitudinal binary responses (Baker, 1995; Fitzmaurice, Molenberghs, and Lipsitz, 1995; and Albert, 2000) which mostly differ in how they specify the full-data response model. For example, Kurland and Heagerty (2004) proposed a parametric selection model for binary responses

187 SELECTION MODELS 181 that assumes the same form for the missing data mechanism as in Example 8.2 and then replace the multivariate normal likelihood for the full data response model with a MTM(1) (see Example 2.5). The observed data log likelihood takes the same form as in (8.12), with the terms involving the full data response model now being derived based on an MTM and integrals replaced by sums. We will demonstrate these models in Chapter 10 using data from the OASIS study (Data Example 1.3 to examine the effect of motivational enhancement intervention plus transdermal nicotine replacement on smoking cessation among alcoholic smokers. Individuals were assessed at one, three, six, and 12 months and dropout was considerable. A computationally simpler specification to the MTMs with a logit missing data mechanism assumes a multivariate probit model jointly for both the hazard and the longitudinal responses. In parametric selection models (as those described in Examples 8.2 and 8.3) all the NNI parameters in the mdm are identified. In the next section, we go through a few simple examples to see better how this identification occurs NNI mdm parameter identification In parametric selection models, NNI parameters in the missing data mechanism are identified from the parametric full data response model and the form of the the missing data mechanism in parametric selection models. We elaborate more on the identification of the missing data mechanism parameters, ψ and give some simple examples. Consider a cross-sectional setting where y may or may not be missing. Suppose the histogram of the observed y s looks like that given in Figure 8.1 and that we specify the following model for the missing data mechanism, logit(p(r = 1 y)) = ψ 0 + ψ 1 y, (8.13) where R = 1 corresponds to observing y. Suppose also that we specify the full data response model to be a normal distribution, N(µ, σ 2 ). By looking at the histogram of the observed y s in Figure 8.1, it is clear that for this histogram to be consistent with normally distributed y s, we need to fill in the right tail. This then implies that ψ 1 < 0, with further refinement of the posterior distribution to more closely approximate normality of the full-data response vector. However, if we assumed a parametric model for the full data response model that was consistent

188 182 MODELS UNDER NONIGNORABLE DROPOUT with the histogram for the observed y s (e.g., a truncated (right tail) normal), it would suggest that ψ 1 = 0. Thus, depending on the specification of the full-data response model, we would estimate either a MAR or a MNAR missing data mechanism. On the other hand, if the observed data response model was specified non-parametrically (e.g., Scharstein et al., 1999), then the data would provide no information about ψ 1. Frequency y Figure 8.1 Histogram of observed responses This univariate example readily extends to the longitudinal setting. For

189 SELECTION MODELS 183 example, consider a bivariate normal with missingness in Y 2. The corresponding missing data mechanism might be specified as logit(p(r = 1 y)) = ψ 0 + ψ 1 y 1 + ψ 2 y 2, (8.14) where R = 1 again corresponds to observing y 2. If we assume a bivariate normal distribution for the full data response model, then the same argument used above would be applicable in terms of the conditional distribution, p(y 2 y 1, θ) and the identification of ψ 2. Sensitivity of NNI parameters to assumptions on the full data response model was studied in a simple, but very informative, data example by Kenward (1998). He considered the bivariate setting with the second observation, y 2 potentially missing. He assumed the mdm was linear in y on the logit scale and assessed the sensitivity of the estimation of ψ to the distribution, p(y 2 y 1 ). If this distribution was assumed to be normal, the estimates and standard errors of ψ supported an assumption of MNAR. However, if this distribution was assumed to follow a t-distribution with only a few degrees of freedom, the estimates and standard errors of ψ supported an assumption of MAR. So by just changing the tails of the distribution of the full data response, inference concerning the missing data mechanism changed significantly. This and the previous hypothetical examples are a great illustration of the caution that must be used in parametric selection models as widely differing conclusions can be drawn based on unverifiable modelling assumptions (on the full data response). Several authors have suggested fixing parameters like ψ 2 in (8.14) at reasonable values (via expert opinion), and then maximizing the observed data log likelihood over the remaining non-nni parameters (Little and Rubin discussion of Scharfstein et al., 1999; Kurland and Heagerty, 2004) This is similar in spirit to the sensitivity analysis approaches of Robins and colleagues (e.g., Rotnitzky, Robins,and Scharfstein, 1998 and Scharfstein, Rotnitzky, and Robins, 1999), but in those cases, the observed data response model was specified nonparametrically. Thus, fixing the NNI parameters does not impact the fit to the observed responses. With the NNI parameter fixed at a specified value and the nonparametric estimate of the observed data response distribution, the full data response model is estimated under an assumption that the support of the distribution of the dropouts is the same as for the non-dropouts. Thus, implicitly, the missing data is only imputed in the range of the observed data; in general, parametric models allow extrapolation of missing data outside the range of the observed data. The problem with doing a sensitivity analysis within fully parametric models is that when the NNI are fixed at values of interest, the full data

190 184 MODELS UNDER NONIGNORABLE DROPOUT response model may not fit the observed data well. In addition, there is a best fit to the observed data within a given parametric specification (at the mle of ψ), so it is not clear why inference should be based on a model that does not provide the best fit. This approach is similar to specifying a prior for NNI parameters that is not consistent with the observed data/parametric model specification (Scharfstein et al, 2003). In any case, if this approach is to be undertaken, the fit of the model to the observed data should be compared to the fit where the NNI are estimated from the data (and the modelling assumptions); this indicates the best possible fit of that model to the observed data (note that it does not imply that the model provides a good fit in an absolute sense; we will discuss this more in Section??). For example, we can compare the DIC or PPL (more details on these in Section??) and make sure that it does not get considerably worse when these parameters are fixed at values thought to be reasonable. If the model fits are not very similar, this suggests that the values that the NNI parameters are being fixed at are not compatible with the full data response model. Thus, it is an inappropriate model to use for inference. More details on model fitting and checking under nonignorable missingness can be found in Section 7.5. In the next section, we will also discuss a more reasonable approach for sensitivity analysis using Bayesian semiparametric selection models which allows sensitivity analysis to be done without sacrificing model fit (to the observed data). These are more in the spirit of the sensitivity analysis approaches of Robins and colleagues. The preceding discussion has focused entirely on the full data response model identifying NNI parameters in the mdm. However, we have focused on the simple (and common) situation where the response y is entered into the mdm linearly. This is can be a very strong assumption. Recall the following, more general, specification of the mdm, h(t F i,t +1, ψ). Typically, as already mentioned, h(t F i,t +1 ; ψ) is specified as g 1 (ψ T t y) or using linear contrasts among the components of y (Diggle and Kenward, 1994). Such a specification facilitates elicitation of priors as then the components of ψ corresponding to potentially missing components of y have a simple interpretation (e.g., if g( ) is the logit function, then they are log odds ratios). However, this linearity in y is not always reasonable. For example, in the schizophrenia clinical trial (Example 1.1), subjects may be more likely to drop out when they are doing much better or much worse. This would suggest the need for a quadratic in y in the missing data mechanism. Such a quadratic relationship in the missing data mechanism also can occur when the full data response model is a mixture of normals (see Example 5.9 in Chapter 5 and Section 7.3.3). If we refer again to the observed data response distribution in Figure 8.1

191 SELECTION MODELS 185 and a full data response model that is a normal distribution, a missing data mechanism of the following form, h(t F i,t+1 ) = g 1 (ψ 0 + ψ 1 y t + ψ 2 y 2 t ) (8.15) with a quadratic in y t, would result in an estimated MNAR specification with ψ 1 being negative and ψ 2 being zero since for the full data response model to be normally distributed, the right tail needs to be filled in but the left tail is consistent with a normal distribution. Suppose instead that the observed response distribution looked like that given in Figure 8.2. If a quadratic model was specified for the mdm (8.15) in this case, ψ 2 > 0 and the mdm would estimated to be MNAR. On the other hand, with the same observed data response distribution but a linear model for the mdm (ψ 2 = 0), then ψ 1 would be estimated to be zero and the mdm would be estimated to be MAR. This suggests a need for flexible models for the mdm. Work on more flexible estimation of the form of the missing data mechanism has been undertaken by several authors. Lee and Berger (2001) have proposed Bayesian nonparametric estimation of the missing data mechanism in the simple setting of a univariate y and no covariates using Dirichlet process priors. Scharfstein and Irizzary (2003) entered baseline covariates, x into the mdm nonparameterically (using generalized additive models; cf: Example in Chapter 2). However, full extension of these ideas to the longitudinal setting with covariates is non-trivial. In practice, the typical sensitivity analyses involve specifying the mdm parametrically and the full data (or observed data) response model using a nonparametric or a flexible semiparametric model. By flexible, we mean the following: the prior and posterior for the NNI parameters in the mdm are similar, p(ψ 2 y obs, r) p(ψ 2 ) (e.g., in (8.9)); a fully nonparametric specification for the full (observed) data response model results in equivalence between the prior and posterior for ψ 2. Fully parametric selections are typically not flexible in that p(ψ 2 y obs, r) and p(ψ 2 ) may be very dissimilar We will discuss semi-parametric selection models next Semiparametric selection models Semiparametric selection models typically are composed of a parametric model for the mdm and (semi-)non-parametric model for the observed data response distribution (or the full data response distribution). For a continuous response y, this was done in the univariate case by Scharfstein et al. (1999) (and extended to a fully Bayesian model by Scharfstein

192 186 MODELS UNDER NONIGNORABLE DROPOUT Frequency y Figure 8.2 Histogram of observed responses et al., 2003), but the multivariate nature in the longitudinal case makes direct extensions difficult due to the curse of dimensionality (see, e.g., Robins and Ritov, 1997). However, in the case of binary (categorical) longitudinal data, such extensions are more feasible. In addition, the assumption that the support of the distribution of the dropouts is the same as the non-dropouts is much more likely. Such models would provide a more natural framework for conducting sensitivity analysis and/or us-

193 SELECTION MODELS 187 ing informative priors for NNI parameters than their fully parametric counterparts. Binary longitudinal responses We begin with the simplest longitudinal setting of a bivariate binary response with missingness only in the second component (Y 2 ) and no covariates. This case provides a useful platform for understanding fully nonparametric model specification for selection models with longitudinal binary data and will provide a starting point for a semi-parametric specification for the full data response model. Let Y = (Y 1, Y 2 ) T denote the full-data response, with R = 1 if Y 2 is observed and R = 0 if it is missing. The entire full-data distribution p(y, r ω) can be enumerated as in Table 8.1. Table 8.1 Multinomial parameterization of full-data distribution for bivariate binary data with possibly missing Y 2. R Y 1 Y 2 p(y 1, y 2, r ω) ω (0) ω (0) ω (0) ω (0) ω (1) ω (1) ω (1) ω (1) 11 The full-data distribution can be viewed as multinomial distribution with 7 distinct parameters (noting that r,y 1,y 2 ω y (r) 1,y 2 = 1). Those parameters corresponding to realizations with R = 1, ω (1) 00,..., ω(1) 11, all are identifiable from the observed data. Among those with R = 0, we can identify only P (R = 0, Y 1 = 0) = ω (0) 00 + ω(0) 01 and P (R = 0, Y 1 = 1) = ω (0) 10 + ω(0) 11. In this setting, the selection model is just a reparameterization of ω from the multinomial representation of the joint distribution of (R, Y 1, Y 2 ), to p(y 1, y 2, r) = p(r y 1, y 2 ) p(y 2 y 1 ) p(y 1 ),

194 188 MODELS UNDER NONIGNORABLE DROPOUT where with Hence, θ 1 = θ 2 1 (y 1 ) = p(y 1 ) Ber(θ 1 ) p(y 2 y 1 ) Ber(θ 2 1 (y 1 )) p(r y 1, y 2 ) Ber(π(y 1, y 2 )), logit(θ 1 ) = α logit(θ 2 1 (y 1 )) = β 0 + β 1 y 1 logit(π(y 1, y 2 )) = ψ 0 + ψ 1 y 1 + ψ 2 y 2 + ψ 3 y 1 y 2. 1 (ω (r) 10 + ω(r) 11 ) r=0 1 r=0 (ω (r) 11 y 1 + ω (r) 01 (1 y 1)) π(y 1, y 2 ) = ω (1) 00 (1 y 1)(1 y 2 ) + ω (1) 01 (1 y 1)y 2 + ω (1) 10 y 1(1 y 2 ) + ω (1) 11 y 1y 2 This model has 7 unique parameters, so it is saturated in time and missingness pattern. Because the model is saturated, it is nonparametric. Clearly, α(θ 1 ) is directly identified from the observed data. To identify the remaining parameters, there are several ways to constrain the distribution. This leads to different choices for the two sensitivity parameters also. A natural choice might correspond to parameters that are zero under MAR and/or that are easy to elicit. Here, ψ 2 and ψ 3 are zero under MAR and correspond to log odds ratios of missingness for y 2 given y 1. Thus, we will treat these as the sensitivity parameters here. We also point out that fixing the sensitivity parameters do not affect the fit of the model to the observed data. The parameters in the missing data mechanism have the following correspondence with the original multinomial probabilities, ω, ψ 0 = logit ψ 0 + ψ 1 = logit ψ 0 + ψ 2 = logit ψ 0 + ψ 1 + ψ 2 + ψ 3 = logit ω (1) 00 ω (1) 00 +ω(0) 00 ω (1) 10 ω (1) 10 +ω(0) 10 ω (1) 01 ω (1) 01 +ω(0) 01 ω (1) 11 ω (1) 11 +ω(0) 11 (8.16)

195 SELECTION MODELS 189 Fixing ψ 2 and ψ 3 results in all the model parameters being identified. To see this, notice that by subtracting the second and fourth expressions in (8.16), we obtain ψ 2 + ψ 3. After some algebra, we can obtain a closed form expression for ω (0). By combining this with the identified sum ω (0) 10 + ω(0) /ω(0) 11, we can obtained a closed form expression for ω(0) 10 and ω(0) 11 individually. Similarly, by doing some similar algebra starting with the first and third equations in (8.16), we can obtain identification of ω (0) 00 and ω (0) 01. Unfortunately, when T >> 2, such a setup is not practical as there are J(2 J 1) + (2 (J 1) 1) parameters, of which only J j=1 (2j 1) + (2 (J 1) 1) are identified; this leaves J j=1 (2J 2 j ) NNI parameters. So, for example, for J = 4, there are 34 NNI parameters (and this does not even include covariates)! For a situation with a large number of non-identified parameters, an alternative to a nonparametric model for the full data responses would be to shrink such a model towards a simpler model (Daniels and Scharfstein, 2007). The mdm is specified as a simple form which allows for relatively simple prior elicitation for NNI parameters. These models for the full data response will provide additional flexibility over fully parametric ones like MTMs and will provide consistent estimation of the parameters in large samples under correct specification of the mdm. We will explore such models on the OASIS data in Chapter 10. We will close out our discussion of selection models by providing guidance on features of posterior sampling specific to them Posterior Sampling strategies We provide some general guidance on, and strategies for, posterior sampling for selection models. First, we point out that by using data augmentation to fill in y mis, posterior sampling of ψ and θ proceeds using complete data techniques for the missing data mechanism and the full data response model, respectively, i.e., sampling p(ψ y, r) and p(θ y) separately. As a result, intractable integrals, such as those found in (8.12), do not need to be evaluated directly. So, the main new computing issue to discuss here will be sampling the missing data, y mis from p(y mis y obs, r, θ, ψ). If y mis is binary, then p(y mis y obs, r, θ, ψ) will follow a Bernoulli distribution with success probability appropriately constructed. For example, a missing component y ij will be sampled from a Bernoulli distribution, with success probability

196 190 MODELS UNDER NONIGNORABLE DROPOUT p(r y ij, y ij = 1, ψ)p(y ij, y ij = 1 θ) 1 k=0 p(r y ij, y ij = k, ψ)p(y ij, y ij = k θ), (8.17) where y ij corresponds to the vector of responses with y ij removed. Considerable simplification of this form in particular models is typical. When y mis is continuous, p(y mis y obs, r, θ, ψ) can be sampled using a very simple rejection sampling algorithm. First, sample a candidate y mis from p(y mis y obs, θ), i.e., the conditional distribution of y mis based on only the full data response model. This conditional distribution typically takes a simple form (see Chapter 6). Next, accept this candidate with probability p(r y mis, y obs, ψ). If the candidate is not accepted, sample another candidate and repeat until one is accepted. This approach has been used successfully in Scharfstein, Daniels, and Robins (2003). The efficiency of the sampler depends on magnitude of p(r y mis, y obs, ψ); if this probability is close to zero given the candidate y mis, the efficiency will be low. If this approach is not computationally efficient, a Laplace approximation to p(y mis y obs, r, θ, ψ) can be used as the candidate distribution in a standard Metropolis-Hastings step (recall, Chapter 3, p.??). For certain models with continuous responses, p(y mis y obs, r, θ, ψ), or a suitably data augmented version of this distribution, can be sampled from directly (e.g., the Heckman selection model described in Example 8.1) Summary of Pros and Cons of SMs The main advantages of SMs are the direct specifications of both the full data response model and the missing data mechanism. As a result, the primary parameters of interest (in most cases), the full data response model parameters, are directly specified as in the case of complete data (Chapter 2). In addition, how the missingness is related to the response, y and covariates, x is transparent. The main disadvantage in parametric selection models is the inability to partition ω into identified and unidentified components as discussed in Section 7.1 In particular, parametric selection models completely identify the extrapolation distribution, p(y mis y obs, r, ω 2 ). The ability to partition ω into ω 1 and ω 2 will be an attractive feature of the next class of models, mixture models.

197 MIXTURE MODELS Mixture models Background & general set up The development of mixture models for handling missing data can be traced at least to Rubin (1976), who described methods for using informative priors in surveys to capture and use subjective information about nonresponse. Since then a number of key papers have expanded mixture models for handling informative dropout in longitudinal data. Little (1993, 1994) developed general theory for mixtures of multivariate normal distributions in discrete time settings. For longer follow up times, mixtures of random effects models proved useful (Wu and Bailey, 1989; Mori et al., 1992, Hogan and Laird, 1997). Fitzmaurice et al (2000) developed moment-based approaches based on mixtures of generalized linear models. Roy (2002) addressed the issue of having a large number of dropout categories by using mixtures over latent classes. Hogan et al (2004) developed approaches for continuous dropout times based on mixtures of varying coefficient models. Reviews of model-based approaches can be found in Little (1995), Hogan and Laird (1997), Kenward, Molenberghs, and Thijs (2003), and Hogan, Roy and Korkontzelou (2004). Another key thread of research concerns model identification and sensitivity analysis. Here, Molenberghs and colleagues have developed an important body of work for the case of discrete-time dropout (cf. Molenberghs et al., 1998; Kenward et al., 2003), much of which will be described in this section. The mixture model approach factors the full-data model as p(y, r x, ω) = p(y r, x, ω) p(r x, ω). (8.18) The full data response distribution is obtained by averaging the over r (see Section 5.8.2). In this section, we describe some specific model formulations that give a broad representation of the settings where the models can be used and describe formal methods for identifying them. In general, when R is discrete, the component distributions comprise the set {p(y r) : r R}, where R is the support of R (set of all possible missing data patterns). When missingness is caused by dropout, then the component distributions p(y u) may also be a discrete set of distributions, or may be specified in terms of a continuous u Model identification Mixture models for either discrete or continuous dropout time are usually under-identified and require specific constraints to be imposed for

198 192 MODELS UNDER NONIGNORABLE DROPOUT fitting to data. There are various approaches to identifying the models; in this section, we focus on strategies that divide the full-data distribution into identified and non-identified components (see the extrapolation factorization (8.1)). For the case of discrete-time dropout, the MAR assumption is used as a basis for identifying the full-data distribution and for building a larger class of models that accommodate MNAR. For continuous-time dropout (or discrete time dropout with large number of support points), models can be identified by making assumptions about the mean of the full-data response as a function of time; e.g., assuming E{Y (t) U} is linear in t, with intercept and slope depending on U. Usually one assumes that given dropout at U = u, time trend for t < u and t u is either equivalent or related through some known function; for example, E{Y (t) U = u, t > u} = q(, u, t)e{y (t) U = u, t u} for some known q(, u, t). In this example, q(, u, t) cannot be identified, but using a prior on it identifies the full-data model. Identification via MAR constraints Although the usual goal of fitting a mixture model is to represent a MNAR mechanism, for the purposes of discussing model identification it helps to begin by showing how to impose the MAR condition. In Section 5.5, we saw that for the case of monotone dropout, MAR can be represented in terms of the hazard of dropout. For pattern mixture models, Molenberghs et al (1998) show that MAR is equivalent to the available case missing value (ACMV) constraint (Little, 1994). Specifically, MAR has the following representation. Theorem 8.1. MAR for discrete-time pattern mixture models under monotone dropout. Let Y 1,..., Y T denote the full-data responses with U {1, 2,..., J} representing dropout time. MAR holds if and only if, for j 2 and for all k j, p(y j y j 1,..., y 1, U = k) = p(y j y j 1,..., y 1, U > j). (8.19) The proof can be found in Molenberghs et al. (1998). This theorem states that the conditional distribution of Y j given past responses for those who drop out at some k < j is equivalent to the same conditional distribution for those who continue past time j. The subset that continue are an aggregation of those on dropout patterns j, j + 1,..., J. Under monotone dropout, the RHS of (8.19) is identifiable from observed data, while the LHS is not.

199 MIXTURE MODELS 193 To impose the MAR condition, one approach is to assume a parametric model for observables, and then use the MAR assumption to constrain the distribution of the missing data. Although this chapter is primarily concerned with MNAR models, it is almost always advisable to embed an MAR model within a broader class of MNAR specifications, so as to make clear the assumptions or parameter constraints that differentiate MNAR from MAR. Theorem 8.1 motivates this approach in the pattern mixture context and it is further discussed in Chapter 9. The examples in this chapter will illustrate the application of the theorem for specific mixture models. Other types of identification strategies Interior family constraints. The set of interior family constraints (Molenberghs et al., 1998) can be used to represent commonly-used mixture model identification strategies such as nearest neighbor, complete case and available case constraints. Consider the case of discrete-time dropout U, and measurement occasions j = 1,..., J. With complete case missing value (CCMV) constraints, the distributions {p(y j, y j+1,..., y J y 1,..., y j 1, u = j) : j < J} for those who drop out is equated to that of the completers, p(y u = J). An example of nearest neighbor or adjacent pattern constraints is to equate the distribution p(y j y 1,..., y j 1, u = j), where Y j is missing due to dropout, to the distribution p(y j y 1,..., y j 1, u = j+1), which is the nearest neighbor with an identified distribution for the missing Y j. Rather than rely on nearest neighbor or complete-case patterns, the available case missing value (ACMV) constraints use all available patterns where p(y j y 1,..., y j 1, u) is identified by data. Molenberghs et al. (1998) show that for discrete-time monotone dropout, MAR as stated in Theorem 8.1 is equivalent to ACMV. A detailed illustration of interior family constraints is given in Example 8.4. Non-future dependence missing value restrictions. Non-future dependent missing value restrictions (Kenward et al., 2003) are a convenient set of restrictions related to the missing non-future dependence missing data mechanisms discussed in Section They allow the probability of nonresponse to depend on the current (but possibly unobserved) value of the response variable Y. Specifically, the constraint is given by p(y j y j 1,..., y 1, u = k) = p(y j y j 1,..., y 1, u j). (8.20)

200 194 MODELS UNDER NONIGNORABLE DROPOUT for j 2 and for all k < j. They look very similar to the MAR restriction in Theorem 8.1, and the MAR restriction is in fact a special case. This restriction identifies all the unidentified conditional distributions in each pattern except for the one corresponding to current value of the response, p(y j+1 y j 1,..., y 1, u = j + 1). A primary motivation for using this set of constraints is that in a sensitivity analysis, only one univariate distribution in each pattern is left unidentified; the MAR restriction provides a starting point for conducting sensitivity analysis within this class of restrictions. These restrictions are equivalent to the misssing non-future dependence missing data mechanism introduced in the context of selection models, given by (8.9). Extrapolation. Unless the number of distinct dropout times is discrete and small (say 3 or 4), the number of constraints and/or sensitivity potentially becomes unmanageable for saturated or nonparametric models. Here structural constraints may be needed. One approach for longitudinal data is to assume that the mean response as a function of time follows a known function; this is the approach taken by several authors such as Wu and Bailey (1989), Hogan and Laird (1997), Fitzmaurice et al. (2001). As an example, suppose the number of measurement occasions is large but common across subjects. Assume (Y 1,..., Y J U = k) N(µ (k), Σ (k) ). One possible set of constraints is to assume the mean follows µ (k) j = f(t j ; β (k) ), where f(t; β) has a known form such as β (k) 0 + t j β (k) 1, and the variance has some simplified form such as Σ (k) = Σ. Because the variance matrix is assumed common across pattern, it will be identified. If the mean is linear over time within pattern, it will be fully identified for any pattern where at least one individual has two or more observed time points. Of course these models are restrictive, and there is no information in the data to verify the structural constraints. As with the discrete time models, we can separate the distributions p(y mis y obs, r) and p(y obs, r). For example, instead of assuming that mean response within pattern follows a single line before and after dropout, we might assume f(t, β (k) ) = β (k) 0 + β (k) 1 t + (k) (t u) +, where a + = ai(a > 0) is the positive part of a. This larger model assumes the slope is β (k) 1 for t u (prior to dropout), and β (k) 1 + (k) for t > u (following dropout). Unlike the models described for the discrete case, setting (k) does not imply MAR, but it does provide a separation between parameters that are identified and not identified. Similar

201 MIXTURE MODELS 195 parameterizations can be introduced for variance parameters, which is taken up in Chapter Mixture models with discrete-time dropout Discrete time dropout: continuous responses To understand the PMM for continuous data in discrete time, and its connection to selection models, it is useful to review the model for the bivariate case (Little, 1994). We rely on specifications using mixtures of normal distributions, which tend to be mathematically tractible. For continuous responses, there is no reason for analysis to be confined to specifications involving the normal distribution, although this area is largely open to further investigation. Consider a full data model for the joint distribution of (Y 1, Y 2, R), such that (Y 1, Y 2 ) is the bivariate full-data response, and R is a binary indicator of missingness for Y 2. Recall from Chapter 5 that the full-data PMM for bivariate normal data follows (Y 1, Y 2 R = r) N(µ (r), Σ (r) ) R Ber(φ); (8.21) where, for pattern r {0, 1}, the parameters are {µ (r) 1, µ(r) 2, σ(r) 11, σ(r) 22, σ(r) 12 }. Recall also that for pattern r, we can define { } ξ (r) 2 1 = β (r) 0(2 1), β(r) 1(2 1), σ(r) 2 1 to contain, respectively, the intercept, slope and residual variance for the regression of Y 2 on Y 1. This allows parameters for the two component distributions in (8.21) to be written as { α (r) = µ (r) 1, σ(r) 11, ξ(r)}. Clearly the parameters in ξ (1) 2 1 cannot be identified from the observables because they do not appear in the observed-data likelihood. Therefore either modeling constraints or prior distributions are needed. Little (1994) discusses a number of possible approaches for the bivariate normal case; Little and Wang (1996) and Kenward et al (2003) describe more general settings; these are discussed further below. Connection to selection models The pattern mixture model (8.21) is related to a discriminant analysis model based on mixtures of normal distributions (see Anderson, 1984,

202 196 MODELS UNDER NONIGNORABLE DROPOUT Ch. 6). Referring to (8.21), let y = (y 1, y 2 ) T, and let π(y) = P (R = 1 y); then π(y) φ Σ (1) 1 { } 2 exp 1 = 1 π(y) 1 φ 2 (y µ(1) ) T {Σ (1) } 1 (y µ (1) ) Σ (0) 1 { }. 2 exp 1 2 (y µ(0) ) T {Σ (0) } 1 (y µ (0) ) Taking logs, we have φ logit{π(y)} = log 1 φ Σ(1) 1 2 Σ (0) (y µ(1) ) T {Σ (1) } 1 (y µ (1) ) (y µ(0) ) T {Σ (0) } 1 (y µ (0) ). In general, the log odds of missingness in y 2, logit{π(y)}, is a function of each y j and y 2 j and of y 1y 2. When Σ (0) = Σ (1) = Σ, log odds of missingness is linear in y, and the implied selection model takes the form where logit{π(y)} = ψ 0 + ψ 1 y 1 + ψ 2 y 2, (8.22) ( ) φ ψ 0 = log φ 2 (µ(1)t Σ 1 µ (1) µ (0)T Σ 1 µ (0) ) ψ 1 = ψ 2 = 1 σ σ (µ (0) 1 µ (1) 1 ) β 1(2 1) σ (µ (0) 2 µ (1) 2 ) β 1(2 1) σ (µ (0) 2 µ (1) 2 ) (µ (0) 1 µ (1) 1 ). (8.23) MAR is satisfied when p(y 2 y 1, r = 1) = p(y 2 y 1, r = 0). Equality of the conditional distributions implies equality of conditional means. Multivariate normality within pattern gives E(Y 2 Y 1, R = r) = µ (r) 2 + σ 21 (Y 1 µ (r) σ 11 = { µ (r) 2 σ 21 µ (0) 1 σ 11 } 1 ) + σ 21 σ 11 Y 1 (8.24) Under the constraint that Σ (1) = Σ (0), variance components are equal across pattern, and therefore so is β 1(2 1) (= σ 21 /σ 11 ). Hence equality of conditional means is satisfied when the intercept terms in (8.24) are equal for r = 0, 1; i.e., when µ (0) 2 β 1(2 1) µ (0) 1 = µ (1) 2 β 1(2 1) µ (1) 1,

203 MIXTURE MODELS 197 or perhaps more intuitively when µ (0) 2 = µ (1) 2 + β 1(2 1) (µ (0) 1 µ (1) 1 ). Referring now to (8.23), it is straightforward to see that the MAR condition for the mixture of normals model, which implies equality of conditional means for r = 0, 1, also implies ψ 2 = 0. Embedding this model in a larger MNAR model is fairly trivial because only one parameter, µ (0) 2, is unidentified. The mixture of normals model for J = 2 is instructive for understanding the connections between mixture models and their implied selection models; however, as we will see in examples that follow, mixture models constructed from multivariate normal component distributions cannot generally be constrained in a way that is consistent with an MAR assumption. The usual approach is to assume a normal distribution for the observed data within pattern, and then to impose MAR constraints that will identify p(y mis y obs, r). As it turns out, p(y mis y obs, r) is typically not normal within pattern under MAR. In Example 8.4, we assume the joint marginals within pattern are normal; in Example 8.5, we assume the joint conditionals are normal. It turns out that the second case is much easier as a basis for MNAR. Example 8.4. Pattern mixture model based on marginal normality of observed-data distributions within pattern. Here we use the normal distribution for observed data, and use the MAR assumption to identify the distribution of missing data. The case J = 3 is used to illustrate. The full-data model is p(y 1, y 2, y 3, u ω) = p(y 1, y 2, y 3 u, θ) p(u φ), where, for k = 1, 2, 3, p(u φ) follows a multinomial distribution with φ k = P (U = k). To simplify notation, we denote each of the full-data component distributions as p k (y 1, y 2, y 3 ) = p(y 1, y 2, y 3 U = k). The observed data distributions within each pattern are specified using normal distributions, (Y 1 U = 1) N(µ (1), σ (1) ) (Y 1, Y 2 U = 2) N(µ (2), Σ (2) ) (Y 1, Y 2, Y 3 U = 3) N(µ (3), Σ (3) ). Now consder identifying p(y mis y obs, u) under MAR. To illustrate, we start with U = 1. We need to identify p 1 (y 2, y 3 y 1 ). By definition, p 1 (y 2, y 3 y 1 ) = p 1 (y 2 y 1 ) p 1 (y 3, y 2 y 1 ).

204 198 MODELS UNDER NONIGNORABLE DROPOUT The MAR constraint (8.19) implies p 1 (y 2 y 1 ) = p 2 (y 2 y 1 ) p 1 (y 3 y 2, y 1 ) = p 3 (y 3 y 2, y 1 ). The first term p 2 (y 2 y 1 ) is a mixture of the normal distributions from patterns U = 2 and U = 3, p 2 (y 2 y 1 ) = φ 2p 2 (y 2 y 1 ) p 2 (y 1 ) + φ 3 p 3 (y 2 y 1 ) p 3 (y 1 ). φ 2 p 2 (y 1 ) + φ 3 p 3 (y 1 ) Hence the distribution (Y 2 U = 1) is a discrete mixture of normals, even though (Y 2 U = 2) and (Y 2 U = 3) are normal (by specification). The identification strategy for the model in Example 8.4 uses the interior family constraints. Focusing again on those with U = 1, the distribution p 1 (y 2 y 1 ) can be identified by the mixture p 1 (y 2 y 1 ) = p 2 (y 2 y 1 ) + (1 )p 3 (y 2 y 1 ). Setting = 1 is a nearest neighbor identification scheme, using those with U = 2 to identify p 1 (y 2 y 1 ); setting = 0 identifies p 1 (y 2 y 1 ) using the complete case missing value (CCMV) approach. MAR corresponds to setting = φ 2 p 2 (y 1 ) φ 2 p 2 (y 1 ) + φ 3 p 3 (y 1 ). The weight is not identifiable without making assumptions. It can be chosen by the modeler, or it can be identified by imposing MAR. One can envision sensitivity analyses based on varying over [0, 1], or placing an informative prior on. Although most values of technically represent MNAR models, the full-data model space under this parameterization is indexed by a single parameter, and extrapolations are confined to combinations of the observed-data distributions. In the next example, we show how to parameterize the PMM to allow global departures from MAR. At each dropout time k, we assume the conditional distribution of those remaining in the study, p(y k y k 1,..., y 1, u k), follows a normal distribution. Example 8.5. Pattern mixture model based on conditional normal distributions for observed data. Consider again the case J = 3 with no covariates. From (8.19), the identifiable distributions are p 1 (y 1 ), p 2 (y 1, y 2 ), and p 3 (y 1, y 2, y 3 ). We specify these directly, but make the further assumption that certain unidentified distributions are normal. The distri-

205 MIXTURE MODELS 199 bution of (Y obs U) is given by (Y 1 U = k) N(µ (k), σ (k) ) k = 1, 2, 3 (Y 2 Y 1, U 2) N(α ( 2) 0 + α ( 2) 1 Y 1, τ ( 2) 2 ) (Y 3 Y 1, Y 2, U = 3) N(β (3) 0 + β (3) 1 Y 1 + β (3) 2 Y 2, τ (3) 3 ). To round out the full-data response distribution, we specify (Y mis U) as follows, (Y 2 Y 1, U = 1) N(α (1) 0 + α (1) 1 Y 1, τ (1) ) (Y 3 Y 1, Y 2, U = k) N(β (k) 0 + β (k) 1 Y 1 + β (k) 2 Y 2, τ (k) 3 ) k = 1, 2. The general MAR constraints require p 1 (y 2 y 1 ) = p 2 (y 2 y 1 ) p 1 (y 3 y 1, y 2 ) = p 3 (y 3 y 1, y 2 ) p 2 (y 3 y 1, y 2 ) = p 3 (y 3 y 1, y 2 ). The first of these can be satisfied by equating means and variances of p 1 (y 2 y 1 ) and p 2 (y 2 y 1 ), α (1) 0 + α (1) 1 y 1 = α ( 2) 0 + α ( 2) 1 y 1 ( y 1 R) τ (1) 2 = τ ( 2) 2. The first equality holds only when α (1) 0 = α ( 2) 0 and α (1) 1 = α ( 2) 1. Similar constraints can be imposed on the parameters indexing p 1 (y 3 y 1, y 2 ) and p 2 (y 3 y 1, y 2 ) to fully satisfy MAR. Writing the model in this way makes it fairly simple to embed the MAR specification in a large class of MNAR models indexed by sensitivity parameters and ξ. For example, write α (1) 0 = α ( 2) α (1) 1 = α ( 2) τ (1) 2 = ξτ ( 2) 2. Dropout is MAR when 0 = 1 = 0 and ξ = 1, and otherwise is MNAR. None of the sensitivity parameters appears in the observed data likelihood. Additional sensitivity parameters can be included for each model constraint, but of course in practice it is necessary to limit the dimensionality of these. A key consideration for constructing informative priors or calibrating sensitivity analyses is understanding the physical meaning of sensitivity parameters in context. This is discussed further in Chapter 9. In addition, in Section 10.1, we address each of these issues (model specification, dimensionality of sensitivity parameters, formulation and

206 200 MODELS UNDER NONIGNORABLE DROPOUT interpretation of priors, calibration of sensitivity analyses) in a detailed analysis of data from the Growth Hormone Study. Discrete-time dropout: binary responses Pattern mixture models can be used for binary data as well. Some key references here include Baker and Laird (1988), Ekholm and Skinner (1998), Birmingham and Fitzmaurice (2002), and Little and Rubin (2002). In this section, we describe pattern mixture models for the bivariate case J = 2 and for higher-dimensional settings where J > 2. For each model, we show how to impose MAR constraints using Theorem 8.1, and introduce parameters that index a larger class of MNAR models. Connections to selection models are made for comparison. Many of the key ideas that apply to continuous-data models also apply here. Example 8.6. Pattern mixture model for bivariate binary data. The bivariate case with missingness in Y 2 provides a useful platform for understanding model specification with binary responses. As with the continuous case, let Y = (Y 1, Y 2 ) T denote the full-data response, and let U denote the number of observed measurements (i.e., U = 2 if Y 2 is observed and U = 1 if it is missing). The entire full-data distribution p(y, u ω) can be enumerated as in Table 8.1, and viewed as a multinomial distribution with 7 distinct parameters. The pattern mixture model specifies the joint distribution in terms of p(y 1, y 2 u, α) p(u φ), where, referring again to Table 8.1, (α, φ) = g(ω) is just a reparameterization of ω. A simple PM specification is (Y 2 Y 1, U = k) Ber(µ (k) 2 1 ) (k = 1, 2) (Y 1 U = k) Ber(µ (k) 1 ) U Ber(φ). (8.25) Regression models can be used to structure the parameters for the responses, logit(µ (k) 2 1 ) = α(k) 0 + α (k) 1 y 1 logit(µ (r) 1 ) = β(k) (k = 1, 2). The 7 unique parameters in this PMM comprise 3 regression parameters each for k = 1, 2, and the Bernoulli probability φ = P (U = 1). Without further assumptions, α (1) 0 and α (1) 1 cannot be identified. In this simple model, MAR holds when p 1 (y 2 y 1 ) = p 2 (y 2 y 1 ), and can be imposed by setting (α (1) 0, α(1) 1 ) = (α(2) 0, α(2) 1 ), yielding µ(1) 2 1 = µ(2) 2 1.

207 MIXTURE MODELS 201 For handling MNAR mechanisms, the regression formulation is favored over the the multinomial parameterization because it facilitates simple representations of departures from MAR in terms of differences between the α parameters in the component distributions p k (y 2 y 1 ); specifically, if we let α (1) 0 = α (2) α (1) 1 = α (2) 1 + ( 1 0 ), (8.26) then ( 0, 1 ) = (0, 0) represents MAR, and departures from MAR can be represented by nonzero values of the s. The parameters form the basis for sensitivity analysis or informative prior distributions that capture MNAR. For y = 0, 1, the parameter y is a log odds ratio comparing the odds that Y 2 = 1 between completers and dropouts, conditionally on Y 1 = y, { } E(Y2 U = 1, Y 1 = y)/{1 E(Y 2 U = 1, Y 1 = y)} y = log. E(Y 2 U = 2, Y 1 = y)/{1 E(Y 2 U = 2, Y 1 = y)} Another advantage to using the regression formulation is that its representation of MNAR extends readily to settings where J > 2. To illustrate, consider a binary response vector Y = (Y 1,..., Y J ) T. Define dropout time as as the number of observed measurements, U = j R j. The full data model can be specified in terms of the dropout-specific component distributions p k (y 1,..., y J ) and the dropout model P (U = k). Assume the component distributions follow a transition model such that, for k = 1,..., J, (Y 1 U = k) Ber(µ (k) 1 ) (Y j Y 1,..., Y j 1, U = k) Ber(φ (k) j ) j = 2,..., J. A general regression formulation for φ (k) j would clearly have too many parameters if it were to include main effects, two-way and higher-order interactions involving Y 1,..., Y j 1. For purposes of illustration, one can consider a simplified model where serial dependence has order one. Using a logistic regression, logit(φ (k) j ) = γ (k) j + θ (k) j Y j 1. (8.27) Even using this simplified structure, the parameters (γ (k) j, θ (k) j ) cannot be identified for j > k. In the next example, we illustrate how to specify the PMM for J = 3 using first-order Markov models as the component distributions. In Section 10.3, we use a similar model to analyze data from the OASIS trial.

208 202 MODELS UNDER NONIGNORABLE DROPOUT In that example, we provide details on model specification under both MAR and MNAR, interpretation of sensitivity parameters, and guidance about specification of informative priors that capture qualitative assumptions about departures from MAR. Example 8.7. Pattern mixture model for repeated binary data (J = 3). Here we consider a model for the joint distribution p(y 1, y 2, y 3, u). For k = 1, 2, 3, we assume p k (y 1, y 2, y 3 ) = p k (y 1 )p k (y 2 y 1 )p k (y 3 y 2, y 1 ), and further assume first-order serial dependence, so that p k (y 3 y 2, y 1 ) = p k (y 3 y 2 ). Within pattern k, we have with (Y 1 U = k) Ber(µ (k) ) (Y 2 Y 1, U = k) Ber(φ (k) 2 ) (Y 3 Y 1, Y 2, U = k) Ber(φ (k) 3 ), logit(µ (k) ) = β (k) logit(φ (k) j ) = γ (k) j + θ (k) j Y j 1 (j = 2, 3). (8.28) The MAR constraint (8.19) is satisfied when p 1 (y 2 y 1 ) = p 2 (y 2 y 1 ) p 1 (y 3 y 2 ) = p 2 (y 3 y 2 ) = p 3 (y 3 y 2 ). The first part of the MAR constraint can be imposed by assuming (Y 2 Y 1, U 2) Ber(φ ( 2) j ), with and then setting logit(φ ( 2) j ) = γ ( 2) j + θ ( 2) j Y j 1 (j = 2, 3) (8.29) γ (1) 2 = γ ( 2) 2 θ (1) 2 = θ ( 2) 2. (8.30) In the absence of covariates, models (8.28) and (8.29) are compatible because with binary data, the mixture distribution p 2 (y 2 y 1 ) obtained by averaging the component-specific logistic models for p 3 (y 2 y 1 ) and p 3 (y 2 y 1 ) is equivalent to the single logistic model for data aggregated over patterns 2 and 3. The second part of the MAR constraint that requires p 1 (y 3 y 2 ) = p 2 (y 3 y 2 ) = p 3 (y 3 y 2 ) can be satisfied by setting γ (k) 3 = γ (3) 3 θ (k) 3 = θ (3) 3

209 MIXTURE MODELS 203 for k = 1, 2. As with the continuous data model in Example 8.5, the MAR model can be easily embedded in a larger class of models that allow MNAR by introducing sensitivity parameters. For example, referring to (8.30), we can write γ (1) 2 = γ ( 2) 2 + (1) 0 θ (1) 2 = θ ( 2) 2 + ( (1) 1 (1) 0 ). (8.31) The parameters (1) 0 and (1) 1 correspond to log odds ratios that compare odds of Y 2 = 1 between those who drop out at time 1 (U = 1) and those who continue past time 1 (U 2). Specifically, notice that so that γ (1) 2 = logit{p (Y 2 = 1 Y 1 = 0, U = 1)} γ ( 2) 2 = logit{p (Y 2 = 1 Y 1 = 0, U 2)} { } (1) odds(y2 = 1 Y 1 = 0, U = 1) 0 = log. odds(y 2 = 1 Y 1 = 0, U 2) Similarly, under the parameterization in (8.31), (1) 1 is the same log odds ratio, but conditioned on on Y 1 = 1, (1) 1 = log { odds(y2 = 1 Y 1 = 1, U = 1) odds(y 2 = 1 Y 1 = 1, U 2) More sensitivity parameters can be introduced along these lines, and the model configured such that MAR is implied by setting the sensitivity parameters to zero. In Chapter 9, we delineate principles for finding sensitivity parameters and conducting sensitivity analyses. Further consideration of this model and a detailed illustration using data from the OASIS Study, is provided in Section } Mixture models for continuous-time dropout Our illustrations of mixture models to this point have focused on settings where dropout occurs in discrete time and for small numbers of measurement times. It is possible to apply these models in continuous-time settings as well. Here, we change notation for the missingness process from R to U, where U > 0 is time to dropout. The full-data model then follows p(y, u ω) = p(y u, ω) p(u ω),

210 204 MODELS UNDER NONIGNORABLE DROPOUT and the full-data response model is obtained by averaging over the dropout distribution, p(y ω) = p(y u, ω) p(u ω) du. For continuous data, a general formulation of this joint distribution, developed by Hogan et al. (2004), uses a varying coefficient model for the conditional distribution p(y u, x, ω). Several other approaches, such as the standard pattern mixture model for discrete-time dropout (Little, 1993, 1994; Hogan and Laird, 1997) and the conditional linear model (Wu and Bailey, 1989) can be viewed as special cases of the varying coefficient model approach. Moreover, although the VCM approach is developed for continuous responses and is based on a mixture of normal distributions (used below to illustrate), in principle the approach can be used for other types of distributions. In the mixture of normals case, we assume that the full-data response vector Y = (Y 1,..., Y J ) arises from a process {Y (t) : t 0}. Conditionally on dropout at time U = u, we assume Y (t) has a mean function denoted by µ(t u) and variance given by V (s, t u); e.g., V (s, t u) = σ 2 (t u)ρ(s, t u) for s t. We further assume that the process is conditionally normal (i.e., a Gaussian process) [Y (t) U = u] N(µ(t u), V (s, t u)). To obtain E{Y (t)}, we integrate the conditional mean function over the distribution of dropout time, E{Y (t)} = µ(t u) p(u) du. A question arises as to the functional form of µ(t u); in the VCM approach, it is assumed that µ(t u) is a known function of t for any given u, but as a function of u, it can be any smooth function. To illustrate, we use a simple formulation where µ(t u) is linear in t, and the goal is to estimate the overall mean of {Y (t)}. Example 8.8. Varying coefficient model for estimating intercept and slope under dropout. Consider a full-data response that is generated by potentially observing the process {Y i (t)} at time points t i1,..., t iji. Hence Y i = (Y i1,..., Y iji ) T = (Y i (t i1 ),..., Y i (t iji )) T. Dropout for individual i occurs at U i. Let x i (t) = {x i1 (t),..., x ip (t)}

211 MIXTURE MODELS 205 represent a p-dimensional covariate process, and let denote a J i p covariate matrix. X i = {x i (t i1 ) T,..., x i (t iji ) T } T = (x T i1,..., xt ij i ) T A varying coefficient model for those who drop out at u is [Y i X i, U i = u] N( X i β(u), Σ i (u) ), where β(u) is a p 1 parameter vector comprised of functions of u; i.e., β(u) = (β 1 (u),..., β p (u)) T. The functions can be specified using penalized splines (cf. Ruppert 2002; Craniceanu et al., 2005); details of likelihood construction for a specific case are given in Chapter 10. For this example we assume Σ i (u) = Σ i {φ(u)} = Σ i (φ); that is, we assume the conditional variance is parameterized by the vector φ(u), but that φ(u) = φ does not depend on dropout time; this assumption may be relaxed by modeling covariance parameters as smooth functions of u; see Daniels and Zhao (2003) for example, and see Chapter 6. This model implies that conditional on dropout time, and therefore E{Y i (t) x i (t), U = u} = µ i (t u) = x i (t)β(u), E(Y ij X i, U i = u) = x i (t ij )β(u) = x ij β(u). When x ij = (1, t ij ), E(Y ij t ij, U i = u) follows a straight line as a function of t, but the intercept and slope parameters are functions of the dropout time u. For this model, the mean of Y i (t) itself is a straight line because E(Y ij ) = {β 1 (u) + β 2 (u)t ij } p(u) du (8.32) = β 1 + β 2 t ij, where, for p = 1, 2, β p = β p (u) p(u) du. An important feature of this model is that the distribution of dropout times, p(u), can be left unspecified. For inference, a Dirichlet process prior can be used to ensure flexibility. Parametric distributions for p(u) also can be used where appropriate, but in general leaving the distribution unspecified does not introduce significant additional complications. A detalied analysis of an HIV clinical trial, including details on model and prior specification, is used to illustrate in Section 10.2.

212 206 MODELS UNDER NONIGNORABLE DROPOUT For normally distributed data, the VCM approach reduces to the standard multivariate normal repeated measures regression and consequently the missing data mechanism reduces to MAR when β(u) = β and Σ(u) = Σ. When the regression coefficients are a nonconstant function of u, the working assumption for the VCM is that conditional on U = u, the regression coefficient remains the same both before and after u. This assumption is particularly important for time-varying covariates, the most common being time itself. In the simple model described in Example 8.8, it is assumed that among those dropping out at U = u, the slope β 1 (u) applies both prior to and after dropout. Hence the full-data mean of Y at a fixed time t = t will be based on extrapolations for those who have dropped out prior to t. Need to insert picture for illustration. It is possible to relax the assumption about constant slope (or covariate effect) both before and after dropout by adding one or more sensitivity parameters; this is addressed in Su and Hogan (2007) and discussed further in Section Both the conditional linear model (Wu and Bailey, 1989; Hogan and Laird, 1997) and the standard pattern mixture model (Little, 1993) are special cases of the model in Example 8.8. In the conditional linear model, which applies to situations where dropout is continuous, β(u) is assumed to have a known functional form (such as a polynomial in u), and the variance parameters are assumed to be constant across dropout; i.e., Σ(u) = Σ. The standard pattern mixture model, where dropout times are discrete, is simply a mixture of normals where components of β(u) are step functions. In principle, the VCM can be applied to settings where dropout time is continuous and responses are discrete. In that case, calculation of full-data functionals (such as the mean) requires integrating a nonlinear link function over the dropout distribution. This is in contrast to (8.32), where it is only necessary to integrate the functions β(u) over p(u). Mixture models or selection models? With binary (or categorical) data, when the measurement occasions are discrete and dropout is the sole cause of missingness, the duality between the models holds for any J, but of course the dimensionality increases exponentially: for J measurement occasions with J possible dropout times, the number of unique parameters is J(2 J 1), and any realistic analysis must rely on simplifying assumptions to reduce the dimension of the parameter space. This raises an obvious question of which parameterization to use.

213 MIXTURE MODELS 207 With a selection model, the simplifications must be made in terms of the full-data response distribution p(y 1,..., y J ) and the selection mechanism p(u y 1,..., y J ). Possible strategies include limiting the association structure in the full data response distribution to two-way interactions, or limiting the selection mechanism such that dropout at t depends only on a small part of the observed history (say Y t and Y t 1 ). For purposes of sensitivity analysis however, this can become problematic because unless the full-data response model is nonparametric, full-data model parameter ω can be identified from the observed data, and the choice of an appropriate sensitivity parameter is not obvious. An alternative is a semiparametric formulation for the full-data response model (Daniels et al, 2007); see Section 9.4 for an example. By contrast, model simplifications in mixture models may be more feasible. An example is the assumption of first-order dependence within pattern, shown in (8.27). There is a clear delineation between identified and non-identified parameters, and simplifying assumptions as they apply to the observed data can be empirically critiqued. This is explored further in Chapter 9. Our analysis of data from the OASIS study uses both a pattern mixture approach and a semiparametric selection model, and brings these distinctions into sharper focus Identifying and inferring covariate effects in mixture models The full-data model for the PMM is a mixture over component distributions corresponding to missing data pattern or dropout time. Covariate effects must therefore be interpreted in terms of the mixture distribution. For the PMM with identity link, covariate effects for the full data can sometimes have a simple representation as a weighted average over pattern-specific covariate effects. We focus on covariates X that are exogenous, either time-invariant (e.g. baseline characteristics, gender), fixed functions of time (e.g. age), and whose effects are time invariant (see Roy and Lin for settings with stochastic time-varying covariates subject to missingness from dropout). A key consideration involved in deriving and identifying the covariate effects is whether the probability of missingness or dropout depends on X. PMM with identity link function within pattern Example 8.9. Bivariate outcome where dropout does not depend on co-

214 208 MODELS UNDER NONIGNORABLE DROPOUT variates. Consider the PMM for bivariate data where interest is in estimating the effect of baseline covariates X. Recall that the full-data model is p(y 1, y 2, r x, ω) = p(y 1, y 2 r, x, ω) p(r x, ω). For simplicity, assume p(r x, ω) = p(r ω), which implies that the effect of X on Y can be fully characterized by its within-pattern effects. Then the PM specification is Hence (Y 1, Y 2 X, R) N(Xβ (r), Σ (r) ) R Ber(φ). E(Y X) = E R {E(Y X, R)} = φxβ (1) + (1 φ)xβ (0) = X{φβ (1) + (1 φ)β (0) }, and the covariate effect is weighted average of pattern-specific coefficients. These can be identified because the covariate has a constant effect over time. More generally, dropout may depend on covariates, in which case p(y x) = p(y r, x) p(r x) dr. Usually we are interested in µ(x) = E(Y X = x), computed via µ(x) = µ (r) (x) p(r x) dr. This example considers the more general setting with a discrete mixture of normals. Example Discrete mixture of normals where dropout depends on covariates. Denote the full-data response by Y = (Y 1,..., Y J ) T, with missing data indicators R = (r 1,..., r J ) T. To make things simpler, we assume dropout is the only cause of missingness, where U = j R j and U {1,..., J}. (More generally, it is possible to define a multinomial random variable S to represent all 2 J unique patterns of missingness.) The associated pattern mixture model is (Y U = u, X = x) N( xβ (u), Σ (u) (x) ) (U X = x) Mult(φ 1 (x),..., φ J (x)), where φ u (x) = P (U = u x). In practice this could be represented using a saturated model if X is discrete and low dimensional (e.g. treatment

215 MIXTURE MODELS 209 group in a randomized trial); otherwise it would be specified using a model such as relative risk regression. The mean response as a function of covariates is µ(x) = E(Y X = x) = u φ u (x) xβ (u). The covariate effect is seen to depend on x, µ(x) x = { x φ } u(x) x + φ u(x) β (u). (8.33) u Even though we used a mixture of normals for this example, (8.33) applies for any model with an identity link within pattern. For capturing the effect of fixed differences x x, we have µ(x) µ(x ) = u {φ u (x)x φ u (x )x }β (u). (8.34) Clearly when p(u x) = p(u), we have φ u (x)/ x = 0 and φ(x) φ(x ) = 0 for the discrete case. Hence (8.33) simplifies to u φ uβ (u) and (8.34) simplifies to u φ u(x x )β (u). The foregoing example implicitly treats covariate effect as applying to the entire vector x, but in practice we usually are interested in a single covariate conditioning on the others. Without loss of generality, partition the covariates as x = (x 1, x 2 ), and suppose we are interested only in the effect of x 1. Referring to (8.34) for illustration, we see that the effect of x 1 on the full-data mean may in fact depend on x 2 because φ u (x 1, x 2 ) φ u (x 1, x 2). Hence the the effect of x 1 should be summarized at some meaningful values of x 2. PMM with nonlinear link functions within pattern For PMM that are specified as mixtures of regression models, evaluating covariate effects is somewhat more complicated if nonlinear link functions are used (e.g., logistic or log). Let µ (u) j (x) = E(Y j X j = x, U = u) denote the within-pattern mean, and assume g{µ (u) j (x)} = xβ (u), where g : R R is a smooth monotone function.

216 210 MODELS UNDER NONIGNORABLE DROPOUT In general the effect of X on the full-data mean of Y j is µ j (x) = { } µ (u) j (xβ (u) ) φ u (x) x x u = [{ } { }] x g 1 (xβ (u) ) φ u (x) + g 1 (xβ (u) ) x φ u(x) u If dropout does not depend on covariates, then φ u (x) = φ u, and µ j (x) = { } x x g 1 (xβ (u) ) φ u (x). (8.35) u Example Computing covariate effect for mixture of loglinear models. If a log link is used within pattern, we have and g 1 (xβ (u) ) = exp(xβ (u) ), x g 1 (xβ (u) ) = β (u) exp(xβ (u) ). Now assume p(u x) = p(u); for the mixture of loglinear models, (8.35) becomes µ j (x) x = u β (u) exp(xβ (u) ) φ u. Hence even if dropout is independent of x, the covariate effect still depends on x when the link function within pattern is nonlinear. In this example, we see that when p(u x) = p(u), the effect of x on the full-data mean of Y j is a weighted average of regression coefficients β (u), with weights that depend on x. Inspection of (8.35) shows this will be true in general for nonlinear g. 8.4 Shared parameter models (SPM) General structure The general form of the shared parameter model was given in Section In the most general case, a SPM takes the form p(y, r, ω) = p(y, r η, ω) p(η ω) dη; notice that the chief characteristic of these models is that a single set of shared parameters usually random effects applies to the joint

217 SHARED PARAMETER MODELS (SPM) 211 distribution of y and r. In many cases it is assumed that Y and R are independent conditionally on η, though this is not a requirement (see Ten Have et al.). Theoretically, all SPM can be represented either as a mixture model or selection model, but the functional form of the component distributions is typically not tractible. Example Shared parameter model for normally-distributed fulldata response with dropout. Henderson et al (2000) describe a general structure for the SPM whereby the full-data response is characterized by a Gaussian process having between- and within-subject variation. Missingness is induced by dropout, which depends through a second model on individual-specific random effects characterizing between-subject variation of the responses. The Henderson et al. specification assumes a continuous time process Y (t) that can be right-censored by dropout at time U; to make the notation agree with our conventions, let Y j = Y (t j ). Then their model is written as Y j = X j β + Z j + e j λ U (t j ) = λ 0 (t j ) exp(z j γ), (8.36) where Z j is a realization at t j of a latent Gaussian process Z(t) that characterizes between-subject variation, and e j is a realization at t j of a stationary Gaussian process e(t) that characterizes within-subject residual variation. Hence the first part of the model represents the distribution p(y x, z), and the second part of the model represents p(u z) via its hazard function λ U (t). Contextually, this model can be motivated by the need to deal with responses measured with error, whereby X j β + Z j is the error-free version of Y j, with measurement error captured by e j. The shared parameters index the distribution of Z = (Z 1,..., Z J ) T. Consider a simple case where X j = (1, t j ), e j N(0, σ 2 ), and Z j = X j η, where η = (η 0, η 1 ) T N(0, Ω) are subject specific random effects for intercept and slope. Here the underlying error free process follows a straight line over time. Then the model (8.36) is written as follows, using subject indices i to emphasize sources of variation Y ij = X ij β + X ij η i + e ij λ U (t ij ) = λ 0 (t ij ) exp(x ij η i γ). The full-data likelihood for this model is proportional to p(y, u x, β, Ω, σ, γ) = p(y x, η, β, Ω, σ) p(u x, η, γ) p(η Ω) dη. This is a shared parameter model with Y U (η, X).

218 212 MODELS UNDER NONIGNORABLE DROPOUT This Henderson et al. model is highly general because it uses a stochastic process formulation. Some earlier formulations of shared parameter models can be seen as specific or similar cases. Wu and Carroll (1987) proposed essentially the same model as (8.36) to handle selection bias due to dropout; their version uses a probit model for the hazard but easily accommodates other link functions. Extensions of this framework for handling dropout can be found in Follman and Wu (1995), Ten Have et al, (list those papers). Schluchter (1992) and Hogan and Daniels (2000) each use a version of (8.36) where p(u x, η) follows a normal or lognormal distribution. Although it is not necessary to specify a parametric model for the dropout or event time distribution, they can make posterior sampling much easier and should be used when appropriate. See Hogan and Daniels (2000) for model-checking strategies on the dropout distribution. The shared parameter structure is used quite a bit to formulate the joint distribution of repeated measures and event times. Repeated measures with dropout is one process that gives rise to a joint distribution, but in other cases the main interest is the event time distribution and the repeated measures are treated as stochastic covariates whose distribution needs to be modeled. Wulfsohn and Tsiatis (1997) use the shared parameter model structure to model the hazard of AIDS-related death as a function of error-free longitudinal CD4 count; hence their main interest is in coefficients of the hazard model γ. DeGruttola and Tu (1994) have a similar objective but use a parametric model for the event times, similar to Schluchter (1992) and Hogan and Daniels (2000). Other examples can be found in Faucett and Thomas (1996). Reviews of this literature can be found in Hogan and Laird (1997) and more recently Davidian and Tsiatis (2004) Pros and cons of shared parameter models Shared parameter models are very effective for decomposing variance for multivariate processes, and work in this area has been very effective at facilitating joint modeling of repeated measures and event times. Most commonly, shared parameter models are used either to (a) make adjustments for selection bias, when the main objective is drawing inference about the full-data distribution of repeated measures, or (b) model the hazard of an event time as a function of stochastic time-varying covariates. With respect to handling dropout, these models can be effective for complex data structures (e.g. multivariate longitudinal responses, situations

219 MODEL COMPARISONS AND ASSESSING MODEL FIT 213 where observations are taken very frequently across time), or when the main outcome of interest can be conceptualized as a latent variable (for example severity of disease as measured by several indicators). In the latter case, the mechanism relating the full-data distribution of interest (the latent variable) is explicit. For settings where the models are used to handle dropout, the functional form of p(y r) or p(r y) can be obscured by the latent variable structure, which has to be integrated out. Consequently, the missing data mechanism is not always transparent; equivalently, it will be difficult in many models to write down and critique the induced distribution of missing responses, p(y mis y obs, r). The latent variable structure also can make it difficult to separate p(y mis y obs, r) from p(y obs, r), and it is therefore hard to embed these models in a larger class of models for sensitivity to assumptions. Finally, shared parameter models frequently rely on distributional assumptions about η which governs both observed and missing data for identification. These assumptions are frequently made for convenience and not motivated by context. To summarize, shared parameter models are very useful for characterizing joint distributions of repeated measures and event times; their application to the problem of making full-data inference from incomplete longitudinal data should be made with caution and with an eye toward justifying the required assumptions. Sensitivity analysis is an open area of research for these models. 8.5 Model comparisons and assessing model fit For model comparisons, the major difference from the ignorable models described in Chapter 6 is that we now explicitly specify the missing data mechanism, p(r y, ψ(ω)). So now comparisons between models and model checks can be based on the full data model, p(r, y ω), not just the full data response model, p(y θ). In situations where the unidentified parameters cannot be partitioned from the identified parameters (recall Section 7.1) as in parametric selection models and shared parameter models, careful attention must be paid to the fit of the parametric full data model to observed data because the observed data distributions are not directly modelled (as in PMMs in Section 7.3). For likelihood based criteria, like the DIC, this corresponds to the fit of the observed data likelihood, L(ω y obs, r). Bad fit indicates the full data modelling assumptions are not consistent with the observed data. None of the following metrics and checks for comparing/assessing fit

220 214 MODELS UNDER NONIGNORABLE DROPOUT of models provide information about the feasibility of the implicit or explicit assumptions about the missing data. They only provide information about how modelling assumptions, priors, and/or specific missing data assumptions as an ensemble the observed data, (r, y obs ). Thus, the following criteria/checks cannot be used to assess specific missing data assumptions except within the context of a specific fully parametric model for the full data. But as we saw before, that cannot be verified either DIC The DIC is based on the likelihood. For non-ignorable dropout, the full data likelihood is proportional to p(r, y ω). Under ignorable dropout, it was proportional to p(y θ) since we did not need to specify the exact form of the missing data mechanism. In Chapter 6, we discussed two forms of the DIC, DIC I and DIC II. The DIC I is based on the the observed data likelihood, which for non-ignorable models is proportional to L(ω y obs, r) p(r, y ω)dy mis. (8.37) This is typically not available in closed form for SMs and SPMs. When it is available in closed form, DIC I can often be computed entirely in winbugs. However, for continuous data with a multivariate normal full data response model, winbugs is unable to compute the observed data likelihood. In such cases, the first term of the DIC can be computed directly in winbugs by directly writing the observed data likelihood as a function within winbugs. The second term, the likelihood evaluated at the posterior mean would need to be computed using the output from winbugs (e.g., in the R software which can call winbugs). We provide examples of this on the book webpage. The DIC II is constructed as the posterior predictive expectation of the DIC based on the full data likelihood, Ey mis y obs,rdic(y, r) = 4Eω,y mis [log L(ω y obs, y mis, r)] +2Ey mis log L(E[ω y obs, y mis, r] y obs, y mis, r) (8.38) Computing this requires the evaluation of E[ω y obs, y mis, r] (in the 2nd term of (8.38)) at each value of y mis sampled during the data augmentation step of the sampling algorithm. The first term in (8.38) is the expectation of the full data log likelihood with respect to p(y mis, ω y obs, r)

221 MODEL COMPARISONS AND ASSESSING MODEL FIT 215 which can be computed trivially (except for SPMs) within the sampling algorithm for all the models considered (and can be computed in win- BUGS by writing a full data likelihood function). The 2nd term, given E[ω y obs, y mis, r], can be computed simply by writing a simple function (e.g., in R) to average the term over the samples of y mis. Both these criterion can be used to compare the fit of the full data models to the observed data, (y obs, r). We will discuss implementation and computation of DIC I and II for selection models, pattern mixture models, and shared parameter models next. The potential computational problems include computation of the observed data likelihood for DIC I and the expectation E[ω y obs, y mis ] for each sampled value of y mis for DIC II. Dealing with these computational problems will require simple functions to be written, for example, in R to process the winbugs output (see book webpage). Computation for Selection models Using a selection model factorization, the observed data log likelihood required for DIC I will typically not be available in closed form. Exceptions include multivariate normal specifications for the full data response model combined with probit specification for the hazard of the mdm that is linear in y (recall Example 8.1). However, the observed data likelihood can, in general, be computed using a re-weighting type approach described next. Recall the form of the observed data likelihood given in (8.37) and assume distinct (and a priori independent) parameters. It can be shown that the observed data likelihood is equal to p(r y; ψ)p(y θ)dy mis = p(y obs θ)e ymis y obs,θ [p(r y mis, y obs, ψ)]. (8.39) The latter expectation can be computed simply by sampling from p(y mis y obs, θ) which is typically available in closed form. However, a separate sample would need to be taken for every value of θ sampled in the MCMC algorithm. To avoid this, we recommend obtaining this sample for one value of θ, say θ = E[θ y obs, r], and then re-weight using weights, w l given by w l = p(y(l) mis y obs, θ) p(y (l) mis y obs, θ ). (8.40) For DIC II, we can run one additional Gibbs sampler with y mis fixed; it might be fixed at a likely value, say y mis = E[y mis y obs, r]. Then, the following weights could be used to compute E[θ, ψ y (a) mis, y obs, r],

222 216 MODELS UNDER NONIGNORABLE DROPOUT for any value y (a) mis. w l = p(θ(l), ψ (l), y (a) mis, y obs, r) p(θ (l), ψ (l), y mis, y, (8.41) obs, r) Computation for Pattern mixture models In a pattern mixture formulation, the full data likelihood is factored as p(y r, α)p(r φ). So, for DIC I, the observed data likelihood takes the form p(r φ) p(y r, α)dy mis, which can be computed in closed form here as long as the pattern specific distributions, p(y r, α) offer a closed form for p(y r, α)dy mis. Thus, there should rarely be any computational problems with DIC I for PMMs and it can be computed in winbugs (except for multivariate normal specifications as discussed earlier). For DIC II, we need to compute E[α, φ y mis, y obs, r] for all sampled values of y mis. There are two options. The first is to use a similar reweighting strategy similar to that used in Chapter 6. For y mis = y (a) mis, the weights are w l = p(α(l), φ (l), y (a) mis, y obs, r)/p(y(a) mis y obs, r) p(α (l), φ (l), y (l) mis, y, (8.42) obs, r) with p(α (l), φ (l), y (l) mis, y obs, r) = p(y (l) α, r)p(α r)p(r; φ). The term p(y mis y obs, r) could be computed using Monte Carlo integration as in the ignorable case. An alternative approach would be to do one additional Gibbs sampler run with y mis fixed at (a likely value), say y mis = E[y mis y obs, r]. Then, the following weights could be used to compute E[α, φ y (a) mis, y obs, r], w l = p(α(l), φ (l), y (a) mis, y obs, r) p(α (l), φ (l), y mis, y obs, r), (8.43) for any value y (a) mis. This was the same approach suggested for computing DIC II for SMs. Computation for Shared Parameter Models SPM s pose a particular problem for computation of DIC, similar to generalized linear mixed models (Example 2.2). Except for special cases (e.g., model of Wu and Carroll ) the unconditional (on the shared parameters) joint distribution of (y, r), p(y η, r)p(r η)p(η)dη cannot

223 MODEL COMPARISONS AND ASSESSING MODEL FIT 217 be computed in closed form (i.e., cannot integrate out the shared parameters) which is required to compute both DIC I and II. To evaluate p(y, r ω), a Monte Carlo integration could be done at each iteration to compute the integrated (over η) likelihood. However, this would be extremely inefficient. A more computationally feasible approach might be to use Laplace approximations (Tierney and Kadane, 1986) to approximate the likelihood integrated over η. For DIC I, the observed data likelihood for a SPM is proportional to p(y η, r)p(r η)p(η)dymis dη. The first integral is typically available in closed form. To evaluate the second integral, as above, at each we recommend using a Laplace approximation. This would not be able to be done in winbugs. For DIC II, a similar re-weighting strategy to that used for SMs and PMMs could be implemented to compute E[ψ, θ y obs, y (a) mis ] with the intractable integral involving η again evaluated using a Laplace approximation. To compare the observed data fit of SPMs to PMMs or SMs, the above approach is required. However, for comparisons among different shared parameter models, the likelihood conditional on the shared parameters could be used, p(y η, r)p(r η) (cf: discussion of DIC with random effects models in Ch. 3). This could be computed in winbugs for DIC I (as the observed data log likelihood will be available in closed form), but for DIC II, the re-weighting approaches would be necessary. Comparison with ignorable models Comparing the fit of MNAR and non-ignorable MAR models with ignorable MAR formulations is problematic due to the mdm not being specified under ignorability. For this situation, we recommend assessing the fit of the full data response model, p(y θ), to y obs (as opposed to the fit of the full data model to (y obs, r)). So, we can compute the DIC (I or II) using the likelihood based on only the full data response model, p(y θ(ω)). The observed data likelihood based on the full data response model will be available in closed form for both selection and pattern mixture models (but not for SPM s) for DIC I. For DIC II, we will need to use re-weighting strategies (as discussed above) to compute E[θ y obs, y mis, r]. Summary As discussed in Chapter 6, the relative merits and operating characteristics of these two forms of the DIC with incomplete data, DIC I and II, need careful study including the computational efficiency of the reweighting based approaches. For some models, the DIC can be computed

224 218 MODELS UNDER NONIGNORABLE DROPOUT completely in winbugs. For others, parts of the DIC need to computed using simple user written functions in R. We provide some examples of using the DIC in these models in Chapter PPL PPL is another criterion that can be used to assess the relative fit of models to the observed data. Unlike with the DIC, the problem of incompatible likelihoods when comparing MNAR and ignorable MAR specifications and/or comparing SPMs to SMs or PMMs, is absent when computing PPL (as long as the relevant loss functions are not themselves likelihood-based). Since the missing data mechanism is modelled in non-ignorable models, the posterior predictive distribution for a completed dataset is p(y rep, r rep y obs, r) = p(y rep, r rep ω)p(ω y obs, r)dω. (8.44) In the following, we will focus on the Gelfand and Ghosh loss function (3.22) under squared error loss. As suggested in Chapter 6, we will use this loss under a summary statistic for each subject (i), T (y i, r i ). However, there is a complication. Each sample of (y rep, r rep ) from the posterior predictive distribution results in a y obs that does not have a componentwise correspondence with the observed y obs, since r rep r. As a result, j r ij j r ij (i.e., a different number of observed responses). In addition, it is typically the case that E[Y ij ] is a function of t ij. Statistics T (, ) should be constructed to circumvent these difficulties; for example, constructing statistics that are a function of the residuals, e.g., ( yij E(yij) V ar(y ij) ), with some additional standardization for the 1/2 number of observed responses, j r ij. An alternative approach would be to use statistics that only compare components of y and y rep at each iteration that are both observed. For example, T i (y, r) = J j=1 w jy ij Rij, where Rij = I{r ij = rij = 1}. A third strategy would be to condition on the observed r and proceed using statistics as described in Chapter 6. So, for example, define T i = J j=1 w j(y ij y rep,ij )R ij where R ij are the components of the observed r. This is similar in spirit to the recommendations for using the DIC to compare ignorable and non-ignorable models. The main conceptual problem with this approach is that it ignores the uncertainty in the

225 MODEL COMPARISONS AND ASSESSING MODEL FIT 219 random variable r and implicitly treats the data r differently from y in that we are replicating the entire vector y, but not r. A final strategy would be to use a loss based on the observed data log likelihood (a deviance loss) from the full data model. However, as mentioned earlier, this would then make PPL likelihood based Posterior predictive checks Throughout the text, we have recommended posterior predictive checks to assess the fit of the specific aspects of a parametric model. In the setting of non-ignorable missingness, we typically want to assess the fit of the full data model to features of the observed data, (y obs, r), or the observed responses, y obs or missing data indicators, r, separately depending on whether aspects of the full data model or full data response model or the missing data mechanism are of interest. For the mdm, we can compute statistics that assess the agreement of r, the observed missing data indicators and r rep, the replicated missing data indicators. Since our focus is monotone missingness (dropout), we can compute the posterior predictive distribution T i = j r ij j r ij. If the missing data mechanism fits well, we would expect this distribution to have a median around zero. Checks on the full data response model will be similar to those discussed in Chapter 6, where for ignorable missingness, we assessed the fit of aspects of the full data response model. In particular, we recommended posterior predictive checks based on completed datasets. One of the reasons for this was that under ignorability, we did not (need to) specify the form of the missing data mechanism, p(r y, ψ). However, even if we do specify the mdm as in non-ignorable models, there are other problems conducting checks based on replicated observed datasets, defined as y obs,rep = y rep {r rep = 1}. The reason is that we only observed in our original incomplete dataset the components of y corresponding to the non-zero components of r (as previously discussed in the context of PPL). The non-zero components r will be different than the non-zero components of r rep. We will suggest some approaches to deal with this given that we expect to attain more power to detect model departures with the replicated observed datasets as opposed to the replicated complete datasets as discussed in Chapter 6. There are a several approaches to conduct posterior predictive checks using replicated observed datasets, some of these mirroring the strategies suggested for PPL. The first would be to have comparisons at

226 220 MODELS UNDER NONIGNORABLE DROPOUT each iteration only for those components of y obs corresponding to replicated values of y rep with the corresponding components of r rep being non-zero. This implicitly would attach indicators to y ij and y rep,ij, I{r ij = r rep,ij = 1}. A second option would be to compare the distributions of only the observed data or appropriate summaries for each individual. At each iteration, we could compare the empirical cdf of individual level summaries of y obs and y obs,rep, for example, ȳ i,obs and ȳ i,obs,rep. We might summarize the difference between the empirical cdf s as the largest deviation (the statistic used in the Kolmogorov-Smirnov test (Kolmogorov, 1933; Smirnov, 1939)) and examine that distribution. Statistics T can also be computed which are a function of r and y together, essentially assessing the fit of aspects of the full data model. As an illustration, we suggest a possible check for shared parameter models in the next example which will assess the fit of the full-data model. Example Testing conditional independence in SPMs A common assumption in SPMs is conditional independence of (r, y) given η (shared parameters). In particular, p(r, y η, ω) = p(r η, ω)p(y η, ω). We will assess this assumption using replicated complete datasets. Consider the SPM given in Example We can compare the correlation between y and a suitable function of g(r), e.g., j r ij to the corresponding sample from posterior predictive distribution. If the conditional independence assumption within the model is reasonable, the correlation between y and g(r) predicted by the model should not underestimate the correlation estimated from the observed data. If it does, it suggests that a function of j r ij needs to be added to the full data response model in Example Further reading Models that directly model marginal covariate effects Wilkins and Fitzmaurice (2006) developed what they termed a hybrid model for nonignorable longitudinal binary data. Specifically, they model the joint distribution of (y, r) using a canonical log linear model (Fitzmaurice and Laird, 1993 ) which allows for direct specification of the marginal means for the longitudinal binary data and for the dropout indicators. However, within this setup, it can be hard to specify sensible parsimonious longitudinal dependence. Roy and Daniels (2006) extend earlier work by Roy (2003) on grouping dropout times into patterns using a latent class approach in the framework of mixture models. Using earlier ideas from Heagerty (1999), they

227 FURTHER READING 221 construct the model to allow direct specification of the marginal means (marginalized over the random effects and the dropout distribution). Models that accomodate dropout and intermittment missingness Several papers have discussed approaches to handle dropout and intermittent missingness simultaneously. Albert (2000) and Albert et al. (2002) constructed models with a multinomial random variable specified at each time indicating still in study, intermittent missing, or dropout. Lin, McCulloch, and Rosenheck (2004) deal with intermittent missingness by grouping the missing data patterns using a latent pattern mixture approach. Troxel et al. (1998) uses a pseudo-likelihod approach that models the missingness indicator at time t conditional on only the potentially missing response at time t. reference Keunbaik ordinal MTM s in Chapter 2 Semiparametric selection models for continuous longitudinal data For continuous longitudinal responses, nonparametric specifications for the full data response model are not feasible. To offer more flexibility, error (residual) or random effects distributions can be specified nonparametrically using DPPs (e.g., Kleinman and Ibrahim, 1998). For semiparametric selection models, it is possible to saturate the model in the mdm, but prior elicitation for the full data response model would be difficult. Model Checking An alternative approach for model checking would be to examine residuals for the observed data standardized with means and variances conditional on being observed. See Dobson and Henderson (2003) for a discussion of such approaches.

228

229 CHAPTER 9 Informative Priors and Sensitivity analysis 9.1 Introduction We begin this chapter with a series of quotes.... scientists are inveterate optimists. They ask questions of their data that their data cannot answer. Statisticians can either refuse to answer such questions, or they can explain what is needed over and above the data to yield an answer and be properly cautious in reporting their results. (Diggle, 1999)... with incomplete data, subjectivity should be controlled rather than avoided. (Molenberghs et al., 2001) Clearly the biggest challenge in conducting sensitivity analysis is the choice of one or more sensibly parameterized functions whose interpretation can be communicated to subject matter experts with sufficient clarity... (Scharfstein et al., 1999) These three statements emphasize the importance of assessing the impact of assumptions for which the data provide no information on inference. A natural starting point for assessing these assumptions is MAR. The idea is to then move away from MAR in a sensible and coherent manner. In this context, we propose four principles that should serve as a guide for assessing these assumptions: 1. Delineate missing data assumptions that cannot be verified by data and represent the assumptions in terms of unidentified parameters. 2. Parameterize or formulate these assumptions such that information about them can be elicited from subject matter expert (as in Scharfstein quote) or imported from historical data and such that there is a value that corresponds to missing at random (MAR). 3. Quantify certainty or uncertainty about these non-identifiable aspects of the model 223

230 224 INFORMATIVE PRIORS AND SENSITIVITY ANALYSIS 4. Derive a coherent way to represent the sensitivity of the target inference to departures from these assumptions These principles can be implemented through the use of informative priors or by constructing a sensitivity analysis. In both cases, these are specified to characterize departures from MAR. It is important to be able to represent those features of the model and/or assumptions for which the observed data provides no information. The informative prior may be informed by expert opinion, historical data, or some combination of the two. The Bayesian set up is ideal for this because we can, at least in principle, quantify uncertainty about assumptions through these priors. It forms a coherent way to bring such information into the inference. Constructing a sensitivity analysis can be viewed as a special case of using informative priors where we examine sensitivity to the choice of the priors; in particular, examination of sensitivity of inference to a series of point mass priors (this corresponds to the sensitivity analysis approach of Robins and colleagues) or a summary of conditional posteriors. The general approach proposed here is based on models that allow coherent exploration of sensitivity to missing data assumptions and that properly account for our uncertainty about them. We emphasize that any approach used should be coherent. By coherent, we mean the ability to move smoothly through the space of non-ignorable models, ideally these being indexed by one (or a few) nonidentifiable parameters. This can be accomplished by determining a model which fit the observed data well and then examining a continuum of specifications within that class. This is in opposition to the incoherent approach of fitting several models that make very different assumptions about both the observed and missing data; for example, fitting a selection model, a pattern mixture model, and a shared parameter model and assessing sensitivity of the inferences of interest. A obvious way to factor the joint distribution p(r, y ω) that will facilitate adherence to our four principles is p(r, y ω) = p(y mis y obs, r, ω 1 )p(y obs, r ω 2 ). (9.1) Recall in Chapter 8 we referred to the first component, p(y mis y obs, r, ω 1 ) as the extrapolation distribution. The extrapolation distribution is the distribution of the missing data given the observed data. In the following, we will refer to this factorization as the extrapolation factorization. It is often the case that some components of ω 1 will be identified and some will not. As such we partition ω 1 as ω 1 = (η, δ), where η is identi-

231 PATTERN MIXTURE MODELS 225 fied and δ is not. Elicitation of informative priors and sensitivity analysis will then be based on δ alone. So, to properly make inferences under nonignorable dropout, we need to construct a (probability) distribution for likely values of δ, i.e., an informative prior distribution, and draw inference under this prior (Rubin, 1977). A typical sensitivity analysis (see e.g., Scharfstein et al. (1999) ) is to consider a range of values for nonidentified parameters, δ, and fit the model at each of these values, examining how the inference of interest varies with δ. As discussed earlier, this corresponds to assessing sensitivity to point mass prior specifications. Doing a sensitivity analysis using point mass priors avoids the need to specify weights for the different values in the elicited range (and past the range). However, this approach does not give a precise measure of uncertainty to the unverifiable missing data assumptions or the advantage of a single inference. In the next two sections, we will provide details on how priors might be specified and convenient classes of models within which uncertainty about missing data assumptions can easily be addressed. 9.2 Pattern mixture models The factorization that defines pattern mixture models make them a natural class of models in which to conduct sensitivity analyses. Recall, the PMM factorization, This can be further factored as p(y, r ω) = p(y r, α(ω))p(r φ(ω)). (9.2) p(y, r ω) = p(y r, α(ω)p(r φ(ω)) = p(y mis y obs, r, α 1 (ω))p(y obs r, α 2 (ω))p(r φ(ω)) In the second equality, the first term is the extrapolation distribution, and thus α 1 corresponds to ω 1, and last two terms are identifiable from the observed data. Thus, this factorization lends itself to conducting sensitivity on the unidentified components of α Specification of priors Recall that in (9.1), ω 2 is identified from the data; components of (ω 1, η) are identified while others, δ are not. So, sensitivity analysis and elicita-

232 226 INFORMATIVE PRIORS AND SENSITIVITY ANALYSIS tion of informative priors should be based on δ. As such, we recommend the following prior formulation for the full data parameters, p(ω 1, ω 2 ) = p(η, δ, ω 2 ) = p(ω 2 )p(η)p(δ η) (9.3) where p(δ η) is the prior that needs to be elicited. We give several examples of this in Chapter 10. This prior and the extrapolation factorization fit naturally within the framework of mixture models with ω 2 = (φ, α 2 ) and ω 1 = (α 1 ). The form of this prior is similar to those proposed in Rubin (1977) which were formulated in a non-longitudinal setting and parameterized in terms of how similar corresponding identified and non-identified parameters were across patterns. Under MAR, the second component, p(δ η) is typically a point mass prior at the necessary function of η. For elicitation (under MNAR), re-parameterization typically results in p(δ η) = p(δ). See Examples in Chapter 10. Example 9.1. Point mass priors under MAR. Consider a bivariate response y i = (y i1, y i2 ) T with missingness in y i2, denoted by R i = 1. A pattern mixture formulation (refer to Example 8.5) might assume p(y i2 y i1, R i = r) N(B (r) 0 + B (r) 1 y i1, σ (r) p(y i1 R i = r) N(µ (r) 1, σ(r) 1 ) where ω 2 = (µ (0) 1, µ(1) 1, σ(0) 1, σ(1) 1, B(0) 0, B(0) 1, σ(0) 2 ) are the (identified) parameters in the observed data model and ω 1 = (B (1) 0, B(1) 1 σ(1) 2, B(0) 0, B(0) 1 σ(0) are the parameters of the extrapolation distribution. ω 1 can be further partitioned into the identified parameters, η = (B (0) 0, B(0) 1 σ(0) 2 ) corresponding to p(y i2 y i1, R i = 0) and the unidentified parameters, 2 ) δ = (B (1) 0, B(1) 1 σ(1) 2 ), corresponding to p(y i2 y i1, R i = 1). We saw in Section 7.3, that under nonignorable MAR, 2 ) p(y i2 y i1, R i = 1) = p(y i2 y i1, R i = 0) (9.4) Equating the conditional distributions for R = 1 and R = 0 here, correspond to setting δ = η, i.e., B (1) 0 = B (0) 0, B(1) 1 = B (0) 1, and σ(1) 2 = σ (0) The implied prior for δ (under MAR) is 2. p(δ η) = I{δ = h(η)}, (9.5) a point mass prior on a function h( ) of the identified paramters, η and h(η) = (B (0) 0, B(0) 1, σ(0) 2 ).

233 SELECTION MODELS 227 To conduct a sensitivity analysis, we can examine conditional posteriors over a range of fixed values for δ. This also corresponds to fitting the model using point mass priors for δ, p(δ η) = I{δ = h(η)}, where h(η) is often a constant function of η. For example, in Example 9.1, set B (1) j = B (0) j + δ j and σ (1) 2 = ξσ (0) 2 In terms of a non-point mass prior, this parameterization would result in the conditional prior p(δ η) being independent of η, p(δ η) = p(δ) where δ = (δ 1, δ 2, ξ). It is interesting to note that in this setting, if independent priors were specified for ξ and (δ 1, δ 2 ), then inference on the marginal mean y 2 would not depend on ξ; this is mainly a result of the orthogonality between B 1 and σ Selection models The prior (9.3) and the corresponding extrapolation factorization (9.1) are not well suited to parametric selection models. In parametric selection models (Section 7.2), all full data parameters are identified and the NNI parameters of the extrapolation distribution, δ are complex functions of the identified missing data mechanism parameters φ and the full data response model parameters θ. Thus, it does not make sense to do sensitivity analysis, as in doing such, we are moving away from the model that fits the the observed data best. As opposed to prior (9.3), independent priors are typically specified for the missing data mechanism parameters, φ and the full data response model parameters, θ, i.e., p(φ, θ) = p(φ)p(θ). (9.6) Using prior (9.6) in semiparametric selection models (Schartstein et al, 2003) allows construction of sensible sensitivity analysis unlike the use of the same prior in parametric selection models. The key here is in that semiparametric selection models, parameters that are nearly unidentified can be found. We go through a simple example next to illustrate this and clarify several other important issues along the way. Example 9.2. Sensitivity analysis with semiparametric selection models for cross-sectional data Denote the response as y and the missing indicator as r. Suppose we specify the following semiparametric selection model, [Y ] p(y θ) logitp (R = 1 y) = φ 0 + φ 1 y (9.7)

234 228 INFORMATIVE PRIORS AND SENSITIVITY ANALYSIS In Scharfstein et al. (2003), such a model was specified with a prior of the form (9.6). The distribution of the response was specified nonparametrically using a Dirichlet process prior (DPP) with hyperparameters θ. Clearly, the data provide no information on the missing data mechanism parameter φ 1 and the DPP on the distribution of the complete data response allows φ 1 to not be identified unlike in a parametric model for the full data response. As the sample size grows and the precision of the DPP remains fixed, this approach will share the property of the semiparametric selection models of Robins and colleagues in that, implicitly, there will be no imputation of missing values beyond the range of the observed responses. Even though independent priors are specified for φ and θ, the corresponding priors on ω 1 and ω 2 (the parameters of the extrapolation factorization) are not independent (see Scharfstein et al., 2003). So, even with an independent (informative) prior on φ and a noninformative prior on θ, there will be some minimal data information provided on ω 1 through the identified parameters, ω 2 via this dependent prior. This implies that p(φ 1 ) p(φ 1 y, r), but they are often still approximately equal in semiparametric selection models (see, e.g., Scharfstein et al, 2003). Semiparametric selection models can therefore be a viable alternative to mixture models for sensitivity analyses. The underlying factorization into the full data response model and the missing data mechanism facilitates elicitation of priors directly on parameters related to the missing data mechansim (e.g., φ 1 in (9.7)) which is a convenient set of parameters for elicitation (see Scharfstein et al, 2003 and 2006 ). We provide an example in Chapter 10. The main stumbling block to implementation of these models is the feasibility of specification of a full data response models for (continuous) longitudinal data with many time points (cf: Chapter 8). 9.4 Elicitation of expert opinion and formulation of sensitivity analyses As in Chapter 3, we leave the reader to refer to O Hagan and Garthwaite (2006, book), Chaloner (1996), and other references for a thorough discussion of methods for elicitation. As outlined in the previous sections and the previous chapter, elicitation of informative priors is most natural in mixture models and semiparametric selection models. In the former, elicitation is most naturally conducted on differences in parameters between patterns; see Section 9.1.

235 A NOTE ON SENSITIVITY ANALYSIS IN FULLY PARAMETRIC MODELS 229 for an example. For the latter, elicitation is implemented on parameters that describe how the dropout is related to changes in the unobserved responses, e.g., the hazard ratio of dropping out at time t for a standard deviation change in the response; see Section A note on sensitivity analysis in fully parametric models Various authors have assessed the sensitivity of inferences in incomplete data problems to distributional assumptions on the full data response model for parametric selection models (Kenward, 1998) with distinct parameters and the shared parameter distribution in SPMs. Such approaches do not adhere to the principles discussed in Section 8.1 as all the parameters in such models are identified. In some sense, such an exercise could be viewed as an exercise in model selection as one of the full data response distributions considered will fit the observed data the best and it is not an unreasonable exercise to find the best-fitting parametric model. For example, instead of assuming the full data response model is necessarily multivariate normal, more general models might be considered that contain the multivariate normal as a special case, e.g., multivariate t-distributions with the degrees of freedom estimated or multivariate skew-normal distributions (Azzalini and Valle, 1996) with the skewness parameter(s) estimated. 9.6 Further reading Some other approaches to sensitivity analysis focus only on local departures from MAR, typically in the context of parametric selection models. As discussed in this chapter, parametric selection models are not well suited to standard sensitivity analyses for longitudinal data with dropout. Verbeke et al. (2001) use the ideas of local influence from Cook (1986) to assess subjects that have a considerable impact on the missing data mechanism and full data response model parameters. Troxel, Ma, and Heitjan (2004) propose a measure of sensitivity to nonignorability by doing a Taylor-series approximation to the parametric selection model likelihood and evaluating it at the parameter estimates under ignorability. Unfortunately, both of these approaches rely on flawed parametric selection models (cf: Section 7.2). An alternative approach has been proposed in Copas and Eguchi (2005). They suggest addressing model uncertainty given that full data models make assumptions untestable from the observed data in incomplete data

236 230 INFORMATIVE PRIORS AND SENSITIVITY ANALYSIS settings. In particular, they suggest examining sensitivity to small departures from the model.

237 CHAPTER 10 Case studies: Missing not at random 10.1 Growth Hormone Study: Pattern mixture model Overview The growth hormone study is described in detail in Section 1.2. In short, the trial examines quadriceps strength in a cohort of 160 individuals over one year. Measurements are taken at baseline, 6 months and 12 months. Of the 160 randomized, only 111 (69%) had complete follow up. The observed data appear in Table To illustrate various types of models, we confine attention to the exercise only arm (E) (n = 40) to the GH plus exercise arm (GH+E) (n = 38). On arm E, 7 individuals (18%) had only one measurement, 2 (5%) had two measurements, and 31 (78%) had three. Missingness is more prevalent on the GH+E arm, with 12 (32%) having only one measurement, 4 (11%) having two, and 22 (58%) having three. Missingness is caused by dropout and follows a monotone pattern. The objective is to compare mean quadriceps strength at 12 months between the two groups. We will use a pattern mixture model approach. The PM model will first be specified under MAR; we will then elaborate the specification to allow MNAR using two additional nonidentified parameters. We will illustrate how to conduct and interpret a sensitivity analysis using these parameters. Model specification Our first analysis uses the pattern mixture model specified in Example 8.5. The relevant variables for the full-data model are (Y 1, Y 2, Y 3 ) T = quad strength measures at months 0, 6, 12 Z = treatment group indicator (0 = GH, 1 = GH+E) (R 1, R 2, R 3 ) T = response indicators (1 = observed, 0 = missing). 231

238 232 CASE STUDIES: MISSING NOT AT RANDOM We will define patterns based on observed follow up time U, defined as U = j R j. The PM specification under MAR is made separately for the observed and missing data. To avoid too many superscripts, we write normal distributions in terms of standard deviations rather than variances (e.g. σ rather than σ 2 ). The full-data distribution, separately by pattern U {1, 2, 3} and treatment Z {0, 1}, is given as follows. Missing data components are marked by ; parameters for these are not identified. (Y 1 Z = z, U = 1) N(µ (1) z, σ(1) z ) (Y 1 Z = z, U = 2) N(µ (2) z, σ z (2) ) (Y 1 Z = z, U = 3) N(µ (3) z, σ z (3) ) (Y 2 Y 1, Z = z, U = 1) N(α (1) 0z + α(1) 1z Y 1, τ (1) 2z ) (Y 2 Y 1, Z = z, U 2) N(α ( 2) 0z (Y 3 Y 1, Y 2, Z = z, U = 1) N(β (1) 0z (Y 3 Y 1, Y 2, Z = z, U = 2) N(β (2) 0z (Y 3 Y 1, Y 2, Z = z, U = 3) N(β (3) 0z + α ( 2) 1z Y 1, τ ( 2) 2z ) + β(1) 1z Y 1 + β (1) 2z Y 2, τ (1) 3z ) + β(2) 1z Y 1 + β (2) 2z Y 2, τ (2) 3z ) + β(3) 1z Y 1 + β (3) 2z Y 2, τ (3) 3z ) (U Z = z) Mult(φ z ). (10.1) The multinomial parameter is φ z = (φ 1z, φ 2z, φ 3z ) with components defined as φ kz = P (U = k Z = z) for z {0, 1}. MAR constraints Under MAR, the distribution of missing data is identified using the constraints listed in (8.19) in Theorem 8.1. Denote by p kz (y) the distribution p(y U = k, Z = z). For the missing data in pattern U = 1, we have and for pattern U = 2 we have p 1z (y 2 y 1 ) = p ( 2)z (y 2 y 1 ) p 1z (y 3 y 1, y 2 ) = p 3z (y 3 y 1, y 2 ), p 2z (y 3 y 1, y 2 ) = p 3z (y 3 y 1, y 2 ). Because we have assumed the unidentified components of our model are normally distributed, MAR can be satisfied by equating means and

239 GROWTH HORMONE STUDY 233 variances. For example, to identify p 1z (y 2 y 1 ), we set α (1) 0z + α(1) 1z Y 1 = α ( 2) 0z + α ( 2) 1z Y 1 τ (1) 2z = τ ( 2) 2z. (10.2) In order for the first constraint to hold for all possible values of Y 1, we require α (1) 0z = α ( 2) 0z α (1) 1z = α ( 2) 1z. (10.3) Similarly, to identify p 1z (y 3 y 1, y 2 ) and p 2z (y 3 y 1, y 2 ), we equate regression parameters and variance components as follows: for k = 1, 2, β (k) 0z = β (3) 0z β (k) 1z = β (3) 1z β (k) 2z = β (3) 2z τ (k) 3z = τ (3) 3z. All told, 22 constraints are needed to impose the MAR assumption for this model; for each treatment arm, there are 8 constraints on regression parameters and 3 on variance components. Parameterization under MNAR We can reparameterize the model using sensitivity parameters to embed the MAR constraint in a larger class of MNAR models for the full-data distribution. Each constraint is represented by a sensitivity parameter, denoted generically by for constraints on β parameters and by ξ for constraints on τ parameters. Regression parameter constraints in (10.3) are rewritten as α (1) 0z = α ( 2) 0z + (1:2) 0z α (1) 1z = α ( 2) 1z + (1:2) 1z, where the superscript (1:2) on denotes a relation between parameters from patterns 1 and ( 2) (using a slight abuse of notation). The remaining sensitivity parameters on regression parameters satisfy, for k = 1, 2, β (k) 0z = β (3) 0z + (k:3) 0z β (k) 1z = β (3) 1z + (k:3) 1z β (k) 2z = β (3) 2z + (k:3) 2z.

240 234 CASE STUDIES: MISSING NOT AT RANDOM For the variance components, we have τ (1) τ ( 2) 2z 2z = ξ (1:2) 2z τ (1) 3z = ξ (1:3) 3z τ (3) 3z τ (2) 3z = ξ (2:3) 3z τ (3) 3z. From this point, (, ξ) represents the full collection of and ξ parameters listed above. The MNAR model includes MAR as a special case: setting = 0 and ξ = 1 yields MAR. At this point it is helpful to understand the model parameterization in terms of the full-data distribution. First, let (α I, β I, τ I ) denote the identified α, β and τ parameters from (10.1), and let (α NI, β NI, τ NI ) represent the nonidentified parameters. The nonidentified parameters appear only in distributions marked with. Now partition ω as (ω I, ω NI, ω S ) where ω I = (α I, β I, τ I, µ, σ, φ) ω NI = (α NI, β NI, τ NI ) ω S = (, ξ). In this notation, subscript S refers to sensitivity parameters. The nonidentified parameters ω NI are deterministic functions of (ω I, ω S ), but this parameterization nonetheless proves useful for specifying missing data mechanisms directly through the prior. The full-data likelihood is given by p(y, u ω) = p(y u, µ, σ, α, β, τ,, ξ) p(u φ) = p(y mis y obs, u, µ, σ, α, β, τ,, ξ) p(y obs u, µ, σ, α I, β I, τ I ) p(u φ). Notice that the sensitivity parameters (, ξ) appear only in the missing data part of the model. In general, the prior specification takes the form p(ω) = p(ω NI ω I, ω S ) p(ω I, ω S ). However, ω NI is a deterministic function of (ω S, ω I ); i.e., we have simply reparameterized (ω I, ω NI ) in terms of (ω I, ω S ). Hence p(ω) = p(ω I, ω S ). Quite generally, the standard MAR constraint is given any prior that yields P ( = 0, ξ = 1) = 1 for p(ω S ). It is also possible to center p(ω S ) at MAR and introduce variance to reflect uncertainty about the assumption. For example, assuming N(0, νi) reflects a prior belief that MAR holds, with the strength of the belief governed by ν. To move beyond the MAR assumption, we can make alternate choices for p(, ξ). A sensitivity analysis can be conducted by examining posterior

241 GROWTH HORMONE STUDY 235 inferences about treatment effect (or any parameter of interest) over a bounded set of values (say S) for the sensitivity parameters. Concretely, this involves summarizing posterior treatment effects based on point mass priors P ( =, ξ = ξ ) = 1, where (, ξ ) S. A practical question arises as to the choice of S because its dimension can be high (in this case, there are 22 sensitivity parameters). In the analysis that follows, we first draw inferences using the MAR prior, and then illustrate the implementation of a sensitivity analysis. Pattern mixture analysis with point mass MAR prior The MAR prior for this example is constructed by assuming p(ω I, ω S ) = p(ω I )p(ω S ), placing diffuse priors on ω I, and specifying p(ω S ) by placing a point mass at = 0, ξ = 1. The identified parameters have the following independent priors (for simplicity we do not include superscripts and subscripts e.g., all µ parameters have the same prior; refer to (10.1) for specifics): µ N(0, 10 6 ) α N(0, 10 4 ) β N(0, 10 4 ) σ 2 N(0, 20 2 ) I(0 < σ 2 < 10 4 ) τ 2 N(0, 20 2 ) I(0 < σ 2 < 10 4 ) ψ Dirichlet(1, 1, 1) The priors for σ 2 and τ 2 are half-normal distributions (following the recommendations of Gelman, 2006). Posterior inference was implemented using two parallel Markov chains with 25,000 iterations each. We discarded results from the first 5000 per chain (burn-in) and based posterior inference on the remaining 40,000 draws. Inference for the mean for each treatment group at each time point is summarized in Table For comparison we also include inferences from a standard multivariate normal specification (in the MVN model, a diffuse Wishart prior was used for the variance covariance matrix of each treatment distribution). As expected, the posterior means are very similar; posterior variability is slightly higher in the mixture model, which can be attributed to the larger number of parameters. Analysis using MNAR priors The pattern mixture model can be expanded to allow for MNAR through appropriate specificaiton of the sensitivity parameter prior p(, ξ). This

242 236 CASE STUDIES: MISSING NOT AT RANDOM Table 10.1 Posterior mean (s.d.) for quadriceps strength at each time point, stratified by treatment group. MVN = multivariate normal distribution assumed for joint distribution of responses; MAR = missing at random constraints; MNAR-1 uses uniform priors on parameters equivalent over treatment; MNAR-2 different by treatment. Pattern Mixture Models Treatment Month MVN MAR MNAR-1 MNAR-2 E 0 65 (4.2) 66 (4.6) 66 (4.6) 66 (4.6) 6 81 (4.4) 82 (4.7) 80 (4.9) 80 (4.9) (3.7) 73 (4.0) 70 (4.3) 69 (4.6) GH+E 0 69 (4.2) 69 (4.4) 69 (4.4) 69 (4.4) 6 81 (6.0) 81 (6.4) 78 (6.8) 78 (6.8) (6.3) 78 (6.7) 73 (7.4) 72 (8.1) Difference at 12 mos. 5.7 (7.3) 5.4 (7.8) 3.1 (8.2) 2.6 (9.3) prior will convey information about the degree to which the distribution of missing responses differs from that of observed responses. In general, priors for components of will shift the posterior means away from their MAR values, with > 0 indicating that the mean response is higher for missing observations than for observed ones, relative to MAR. Under this parameterization of the PMM, the conditional variance parameters τ from non-identified distributions do not appear in the posterior distribution of the marginal mean parameters; hence priors for components of ξ will not affect posterior inference about marginal means. There are two obstacles that must be overcome in specifying MNAR priors: the high dimension of the parameter space for (, ξ) (in this case 22), and the need to make a specific choice of prior distribuiton. Even if a series of point mass priors will be used to conduct a sensitivity analysis, we must specify the range of values to include. Our approach is as follows: First, we constrain the parameter space. Because our inference only concerns differences in marginal means, sensitivity parameters ξ are not used; this reduces the number of sensitivity parameters to 22 6 = 16. Next, we focus on departures from MAR in terms of the intercept sensitivity parameters {( (1:2) 0z, (1:3) 0z, (2:3) 0z ) : z = 0, 1}, reducing the number of sensitivity parameters to 6. The justification for this choice is based mainly on interpretability; each represents a difference in the con-

243 GROWTH HORMONE STUDY 237 ditional mean of Y j given previous Y s between patterns. For example, (1:2) 0 = E(Y 2 Y 1, U = 1) E(Y 2 Y 1, U 2) represents the difference in E(Y 2 Y 1 ) between those with missing and observed Y 2 (suppressing treatment index). Similarly, (1:3) 0 = E(Y 3 Y 2, Y 1, U = 1) E(Y 3 Y 2, Y 1, U = 3) (2:3) 0 = E(Y 3 Y 2, Y 1, U = 2) E(Y 3 Y 2, Y 1, U = 3) As a final constraint, we assume (1:3) 0z = (2:3) 0z, and denote the common parameter by ( :3) 0z ). The implication is that, relative to MAR, the difference in E(Y 3 Y 1, Y 2 ) between those with observed and missing Y 3 is common across missing data patterns U = 1 and U = 2. Hence the reduced set of sensitivity parameters is given by = {( (1:2) 0z, ( :3) 1z ) : z = 0, 1} and has dimension 4. There are many approaches to presenting the results of sensitivity analysis; in what follows, we will represent sensitivity to departures from MAR in terms of posterior mean and posterior probabilities for treatment effect over a set of values S. To be specific, let θ = µ 31 µ 30 denote the treatment effect, with θ > 0 if quad strength on GH+E is greater than on E. The sensitivity analysis is captured by examining plots of versus the posterior mean E(θ Y obs, U, Z, ) and versus the posterior probability P (θ > 0 Y obs, U, Z, ) over S. Analysis 1: Assume common across treatment Our first sensitivity analysis assumes the parameters are common between the treatments; hence, the plots will be contours of the posterior mean and posterior probabilities for θ over a two-dimensional domain ( (1:2) 0, ( :3) 0 ) S. To determine the S, we fit the regressions (Y 2 Y 1 ) and (Y 3 Y 1, Y 2 ) separately by treatment group, and used the residual standard deviation to scale the range of. Specifically, we assumed that relative to MAR, quad strength among those with missing responses could only be be lower than those with observed responses, and we set the maximum range of departures from MAR roughly equal to the maximum estimated residual standard deviation. Based on Table 10.2, where residual standard error ranges from 13 to 22 at time 2 and from 8.5 to 14 at time 3, we set S = [ 20, 0] [ 15, 0]. To help with interpretation, it is useful to understand the effect of departures from MAR on the marginal means µ 2z = E(Y 2 Z = z) and µ 3z = E(Y 3 Z = z). For clarity we suppress treatment indices. Consider

244 238 CASE STUDIES: MISSING NOT AT RANDOM first µ 2 ; in the pattern mixture model, µ 2 = 3 φ k E(Y 2 U = k). (10.4) k=1 In this summation, E(Y 2 U = 1) is not identified and is therefore a function of sensitivity parameters; specifically, E(Y 2 U = 1) = E{E(Y 2 Y 1, U = 1)} = E(α (1) 0 + α (1) 1 Y 1 U = 1) = α (1) 0 + α (1) 1 µ(1) = (α ( 2) 0 + (1:2) 0 ) + (α ( 2) 1 + (1:2) 1 )µ (1). Recall that we have constrained (1:2) 1 = 0, so that E(Y 2 U = 1) = α ( 2) 0 + (1:2) 0 + α ( 2) 1 µ (1). Referring back to (10.4), write µ 2 = µ 2 ( (1:2) 0 ) to emphasize its dependence on the sensitivity parameter. The effect of departures from MAR on µ 2 is captured by the difference µ 2 ( (1:2) 0 ) µ 2 (0) = φ 1 (1:2) 0. (10.5) Hence the contribution of the sensitivity parameter to the shift in µ 2 is proportional to the fraction dropping out at U = 1. As with (10.4), the marginal mean of Y 3 is the weighted average µ 3 = 3 φ k E(Y 3 U = k). k=1 Because neither E(Y 3 U = 1) nor E(Y 3 U = 2) are identified, µ 3 depends on both sensitivity parameters; i.e., µ 3 = µ 3 ( (1:2) 0, ( :3) 0 ). Calculations similar to those above can be used to show that µ 3 ( (1:2) 0, ( :3) 0 ) µ 3 (0, 0) = (φ 1 + φ 2 ) ( :3) 0 + φ 1 β (3) 2 (1:2) 0. (10.6) Expressions (10.5) and (10.6) can be used in conjunction with output from Table 10.2 to calibrate the sensitivity parameter range. At the boundary value ( (1:2 0, ( :3) 0 ) = ( 20, 15), we see that for z = 0 (E), µ 2 ( 20) µ 2 (0) = 7 40 ( 20) = 3.5 µ 3 ( 20, 15) µ 3 (0, 0) = ( )( 15) (.59)( 20) = 6.6 (10.7)

245 GROWTH HORMONE STUDY 239 and for z = 1 (GH+E), µ 2 ( 20) µ 2 (0) = ( 20) = 6.3 µ 3 ( 20, 15) µ 3 (0, 0) = ( )( 15) + 38 (.65)( 20) = (10.8) Hence when the sensitivity parameters are equivalent by treatment group, the dropout proportion influences which treatment group will have greater sensitivity to departures from MAR. The plots in Figure 10.1 show contours of posterior means for µ 3z (z = 0, 1) and treatment difference θ, and of P (θ > 0). Because dropout rate is higher on the GH arm, the gradient for µ 31 away from MAR is steeper than for µ 30 ; the posterior mean of µ 31 ranges from 78 to 68; for µ 30 the range is 72.5 to Posterior treatment effect ranges from 5 to 1, with associated posteriors for P (θ > 0) taking values from.74 to.56. These plots represent conditional posterior inferences at fixed values of S, and therefore can be viewed as a tool for exploring features of the unconditional posterior. They can be used as a diagnostic tool for assessing the range of sensitivity of conclusions to reasonable departures from assumptions. Ideally, however, posterior inference about the treatment effect should take uncertainty about into account. If prior belief about the values for is confined to the fact that it falls within S, and if each S is assumed to have equal a-priori probability, then we can assign a Uniform prior over S and summarize treatment effects using a single posterior distribution. Results from this approach are summarized in Table 10.1 under MNAR-1. As expected, posterior marginal means have shifted downward. Uncertainty about is reflected in the increased posterior SD for parameters that rely on (means at months 6, 12), but importantly, not for fully identified parameters (means at month 0). Posterior mean difference is 3.1 (PSD = 8.2), which is closer to zero but does not change the qualitative conclusion under MAR that quad strength on the GH+E arm is similar to E. Analysis 2: Assume differs by treatment In the previous analysis, we assumed a common across the treatment arms. However it may be more reasonable to assume that the degree of departure from MAR differs on the two arms. Recall that our previous

246 240 CASE STUDIES: MISSING NOT AT RANDOM Table 10.2 Posterior mean (posterior SD) of regression parameters for pattern mixture model (10.1) fit under MAR constraints. Treatment Group Model Covariate Parameter E (z = 0) GH+E (z = 1) (Y 2 Y 1, U 2) Int. α ( 2) 0z 23 (6.9) 14 (15) Y 1 α ( 2) 1z.89 (.10).97 (.19) τ ( 2) 2z (Y 3 Y 1, Y 2, U = 3) Int. β (3) 0z 11 (5.5) 5.3 (12) Y 1 β (3) 1z.21 (.13).45 (.19) Y 2 β (3) 2z.59 (.12).65 (.14) τ (3) 3z analysis was based on the sensitivity parameters (1:2) 0 = E(Y 2 Y 1, U = 1) E(Y 2 U 2) ( :3) 0 = E(Y 3 Y 1, Y 2, U = 2) E(Y 3 Y 2, Y 1, U = 3) = E(Y 3 Y 1, Y 2, U = 1) E(Y 3 Y 2, Y 1, U = 3). In this analysis, we allow these parameters to vary by treatment, but to keep the dimension of equal to 2, we combine them by assuming (1:2) 0z = ( :3) 0z = 0z. It is again helpful to quantify the effect of 0z on the shift in marginal means; arguing as before, it is straightforward to show that µ 2z ( 0z ) µ 2z (0) = 0z φ 1z µ 3z ( 0z ) µ 3z (0) = 0z {φ 2z + φ 1z (1 + β (3) 2 )}. We set S = [0, 20] [0, 20], based on the largest residual standard deviation in Table Using calculations similar to (10.7) and (10.8), the following maximum shifts in µ 2z and µ 3z relative to MAR are µ 20 ( 20) µ 20 (0) = 3.5 µ 30 ( 20) µ 30 (0) = 6.3 µ 21 ( 20) µ 21 (0) = 6.6 µ 31 ( 20) µ 31 (0) = 12.5.

247 GROWTH HORMONE STUDY 241 Figure 10.2 shows the contours of the posterior mean of θ and of P (θ > 0), conditional on fixed values of ( 00, 01 ) S. The contour plot of P (θ > 0) indicates the specific departures from MAR that would change the qualitative conclusions about treatment effect. In particular, the part of the posterior distribution having P (θ > 0) >.9 is in the region where 01 is near zero but 11 is between 20 and 15; i.e. if dropouts on GH+E are roughly MAR, but dropouts on E have quad strength much lower than expected under MAR. As with Analysis 1, we can use an informative prior on. For illustration we use a uniform prior on S = [ 20, 0] [ 20, 0]; the posterior is summarized in Table The results largely agree with Analysis 1, but posterior variances are higher because of the wider range for. Summary of pattern mixture analysis Our strategy here has been to construct the full-data distribution as a PMM. The general model allows missingness to be MNAR; with MAR as a special case. The degree of departure from MAR is quantified by a set of sensitivity parameters (, ξ). The sensitivity parameters are completely non-identified, so the fit of the model to observed data is identical across the assumed missing data mechanisms. Several key issues must be addressed when implementing a PMM analysis. First, the general model has a large number of sensitivity parameters. In practical settings, the dimension must be reduced; we have given suggestions for doing this. Second, analyses based on examining conditional posteriors at fixed values of (, ξ) over a pre-determined range S can be useful, but specification of S can be subjective. Our approach has been to calibrate S using the posterior mean of variance parameters from the MAR model. Other approaches are certainly possible and will depend on context. It is important to ensure that choices for S do not extrapolate the missing data outside of a reasonable range (e.g., negative values of quadriceps strength). Third, it must be kept in mind that the contour plots generated for Analyses 1 and 2 represent conditional posteriors, and that ideally, a final inference should be based on a well-motivated prior distribution for the non-identified sensitivity parameters. The priors for used in Analyses 1 and 2 are informed solely by a range, and all possible values in the range are weighted equally in our example here. In practice, however, it is likely that more information is known about departures from MAR, whether from previous studies or from expert opinion. In the case study presentation given in Section 10.3, we illustrate the use of informative priors based on elicited expert opinion.

248 242 CASE STUDIES: MISSING NOT AT RANDOM 10.2 Pediatric AIDS trial: Mixture of varying coefficient models Overview In this example we use a mixture of varying coefficient models to analyze data from the pediatric clinical trial described in Section?? (add to Chapter 1). Individuals were scheduled for follow up every 12 weeks, but the actual measurement times varied considerably from individual to individual. Dropout time was fixed at the time of the last observed measurement. The observed CD4 data, in the log scale, is shown in Figure 10.3, with several individual trajectories highlighted. From the small random sample of individuals, we see that those with longer follow up tend to have higher overall CD4 counts and less steep declines over time. Our objective is to characterize the average CD4 count over time in both treatment arms, and to draw inference about the difference in mean CD4 at the end of the study period. We will assume that the average (log) CD4 is linear in time with slope and intercept that depend on dropout time. The conditional distributions (given dropout time) will be averaged across the distribution of dropout times separately by treatment. Figure 10.4 is an exploratory plot that can be used to motivate the mixture model. It shows individual-specific OLS slope estimates plotted against dropout time, and suggests a possibly nonlinear association between slope and dropout time. However it should be kept in mind that the slope estimates for earlier dropouts are based on fewer observations and tend to have higher variance. Inference from the mixture of VCM will be compared to inference under a standard regression model that assumes MAR. For a previous analysis of these data using a similar model, see Hogan et al. (2005). The Bayesian formulation given here derives mainly from Su and Hogan (2006), who provide a more extensive treatment, including detailed sensitivity analysis. Model specification: CD4 counts The model used here is described more generally in Section 8.3.4; the specific formulation is similar to that given in Example 8.8. Treatment assignment is deonted by Z (1 = high dose AZT, 0 = low dose AZT). Let s > 0 denote time elapsed from beginning of follow up, let Y (s) = log{cd4(s) + 1}, and let V > 0 denote length of follow up (i.e., dropout time). Our objective is to characterize the distribution [Y (s) Z] over

249 PEDIATRIC AIDS TRIAL 243 the interval 0 < s < S, where S is the scheduled end of follow up. Figure 10.3 gives a sense of the joint distribution of Y (s) and V. It turns out that for the model we ultimately use, centering and rescaling the time axis leads to a more interpretable treatment effect and more stable implementation of posterior sampling. For centering point C and scale factor L, let t = (s C)/L and U = (V C)/L. In the application we use C = U = 139 and L = range(u) = max i U i min i U i. As with Example 8.8, we assume the process Y i (t) is potentially observed at a set of points t i = (t i1,..., t i,ni ) T, giving rise to the full-data vector Y i = (Y i (t i1 ),..., Y i (t i,ni )) T = (Y i1,..., Y i,ni ) T. Conditional on dropout, we assume the log CD4 trajectory is linear in time. Let x ij = (1, t ij ), so that X i = (x T i1,..., xt i,n i ) T is the n i 2 matrix for design on time. The full-data distribution p(y, u z) is factored as p(y u, z) p(u z); the first factor follows (Y i U i = u, Z i = z) N(µ iz (u), Σ iz ). (10.9) The conditional mean µ iz (u) is an n i 1 vector having elements µ ijz (u) = E(Y ij u, z) = β 0z (u) + β 1z (u)t ij, (10.10) where β 0z (u) and β 1z (u) are smooth but unspecified functions of u. When using the rescaled time axis, the intercept term β 0z (u) is the full-data mean of log CD4 count at time t = 139 among those who drop out at time u; the slope β 1z (u) is the full-data mean change in log CD4 from baseline to end of the study among those dropping out at u. The variance Σ iz is parameterized using a random effects structure Σ iz = Σ iz (Ω, σ) = X i ΩX T i + σ 2 I ni, where Ω is a 2 2 positive definite matrix and I ni is an identity matrix having dimension n i. The smooth functions β 0z (u) and β 1z (u) are represented in terms of penalized splines (Berry, Carroll, Ruppert, 2002). Suppressing subscripts for now, we have K β(u; ξ, ψ) = ξ 0 + ξ 1 u + ψ k u s k 3, (10.11) where s k are fixed knot points. In the full-data loglikelihood, we introduce a penalty term that shrinks the coefficients of (10.11) to induce smoothness; it takes the form λφ T Dφ, where φ = (ξ, ψ). In our application we only penalize higher-order coefficients ψ, so that [ ] D = 2 K, 0 K 2 Q K K k=1

250 244 CASE STUDIES: MISSING NOT AT RANDOM where the (k, k ) entry of Q is s k s k 3. Now write v(u) = (1, u) and w(u) = ( u s 1 3,..., u s K 3 ), and let ψ = D 1 2 ψ and w(u) = w(u)d 1 2. The penalized splines have a linear model representation as β(u, φ) = v(u)ξ + w(u) ψ, where the ψ N(0, 2λI) are viewed as random effects with variance moderated by the scale λ associated with the penalty term. Applying this separately to each of the four functions (intercept and slope function for each treatment group), we have, for l = 0, 1 and z = 0, 1, with ψ N(0, 2λ lz I). β lz (u, φ) = β(u, φ lz ) = v(u)ξ lz + w(u) ψ lz, To emphasize the parameterization of the CD4 distribution conditional on dropout time, model (10.9) can be rewritten as (Y i u, z) N{X i β(u; φ z, λ z ), Σ(Ω z, σ z )}, where φ z = (ξ 0z, ξ 1z, ψ 0z, ψ 1z ) and λ z = (λ 0z, λ 1z ). Priors for this part of the model are ξ lz N(0, xxi) (l = 0, 1; z = 0, 1) ( ψ lz λ lz ) N(0, 2λ lz I) (l = 0, 1; z = 0, 1) λ lz IG(, ) (l = 0, 1; z = 0, 1) Ω 1 z Wishart(, 3) σ 2 IG(, ). Model specification: Dropout times For treatment groups z = 0, 1, the distributions p(u z) are left essentially unspecified, and we use Bayesian nonparametric approaches to characterize them. In particular, we use a Dirichlet mixture of normal distributions, where (U i z) N(α iz, ν z ) and α iz has a Dirichlet process distribution with a normal base measure. Technical details can be found in MacEachern and Muller (1998). Analysis of pediatric AIDS data The primary objective is to draw inference about the difference in mean (log) CD4 cell count at the end of follow up; denote this by θ. In the mixture model, this corresponds to the difference in integrated slope

251 PEDIATRIC AIDS TRIAL 245 functions β z1 (u) between treatment groups, θ = β 11 (u) p 1 (u) du β 10 (u) p 0 (u) du. Treatment effect is summarized using the posterior mean and standard deviation for θ and the posterior probability that θ > 0. In addition to the mixture model, we fit a standard random effects model that assumes MAR; the latter corresponds to a special case of the mixture model where the β lz (u) are constant functions of u. Figure 10.5 shows posterior means for the smooth functions β lz (u) together with posterior estimates of individual-specific intercepts and slopes. The plots of intercept versus dropout time indicates that mean CD4 is higher among those with longer follow up, with a more pronounced effect in the low dose arm; likewise, CD4 slopes are higher among those with longer follow up times. Inferences about treatment-specific intercepts and slopes, and treatment effect, are summarized in Table In general, inferences under the mixture model indicate that overall mean (intercept term) and slope over time are lower under the MNAR assumption. The effect of the MNAR assumption (relative to MAR) is most pronounced in the low dose arm, where the association between CD4 count and dropout time is stronger (Figure 10.5). In fact the shift in intercepts and slopes from MAR to MNAR exceeds two posterior SD for all parameters except the intercept on high dose. Posterior SD is generally greater for the mixture model. The MNAR assumption also influences treatment effect; under MAR, the posterior mean of θ is.37 (PSD =.21) with P (θ > 0) =.96, indicating superiority of high dose; under MNAR we have posterior mean.01 (PSD =.25) and P (θ > 0) =.49, which is far more equivocal. A graphical summary of difference in the mean of log CD4 as a function of follow up time is shown in Figure Summary Studies like the pediatric AIDS trial require methods that can handle continuous time dropout. The mixture model given here induces an MNAR dropout process by assuming the full-data response distribution, conditional on dropout time u, has mean that is linear in time; the intercept and slope parameters are smooth but unspecified functions of u. An advantage of the model is that the functional form of the missingness mechanism does not have to be known; it is characterized by functions that can be left unspecified. However it is necessary to specify the the form of E{Y (t) u} as a function of t; here we assume it is linear with

252 246 CASE STUDIES: MISSING NOT AT RANDOM Table 10.3 Posterior means and standard deviations for model parameters in Example 8.8. Model Dose Parameter REM (MAR) VCM (MNAR) Low Intercept 5.6 (.13) 5.2 (.11) Slope 1.6 (.14) 2.1 (.13) High Intercept 5.4 (.14) 5.3 (.14) Slope 1.9 (.16) 2.1 (.19) Difference (θ).37 (.21).01 (.25) P (θ > 0).96 (.19).49 (.50) intercept β 0 (u) and intercept β 1 (u). A key consequence is that conditionally on U = u, the mean of Y (t) has the same functional form before and after dropout; for the linear case, the implication is that slope is the same. This property can be relaxed by expanding the model to accommodate sensitivity parameters that allow (for example) a different slope following dropout. For example, we can expand (10.10) so that E(Y ij u, z) = β 0z (u) + β 1z (u)t ij + (t ij u) +, where a + = I(a > 0). In this formulation, β 1z (u) is the slope prior to dropout time u and β 1z (u) + is the slope after u. More generally will depend on u and on treatment group. Another generalization of the model allows the variance-covariance structure to vary smoothly with u; this can be accomplished using appropriate parameterization of Σ. Details can be found in Su and Hogan (2006) OASIS Study: Analysis via pattern mixture models Data description For each individual, the full data comprise the 4 1 response vector of smoking outcomes Y = (Y 1, Y 2, Y 3, Y 4 ) T, obtained at months 1, 3, 6 and 12 following baseline; the corresponding vector of response indicators R = (R 1, R 2, R 3, R 4 ) T ; and treatment group Z (1 if randomized to enhanced intervention, 0 if standard intervention). The main objective is to compare smoking rates at month 12 between those randomized to

253 OASIS STUDY: ANALYSIS VIA PATTERN MIXTURE MODELS 247 Z = 1 versus Z = 0. Inference is made in terms of the odds ratio λ = odds(y 4 = 1 Z = 1) odds(y 4 = 1 Z = 0). The full data model follows a mixture over dropout times U, where dropout is defined as the last observed measurement time; however, this trial has intermittent missingness, and U is defined as follows: U = 1 if R = (1, 0, 0, 0) 2 if R = (1, 1, 0, 0) 3 if R {(1, 1, 1, 0), (1, 0, 1, 0)} 4 if R {(1, 1, 1, 1), (1, 0, 1, 1), (1, 1, 0, 1), (1, 0, 0, 1)} To handle intermittent missingness, we assume missingness is MAR within dropout pattern; that is, we assume p(y mis y obs, u, z, r) = f(y mis y obs, u, z), or that Y mis is independent of R given (Y obs, U, Z). Figure 10.7 displays the number of subjects in each dropout time k and the smoking rate by dropout pattern from observed data. For both treatment arms, those who complete the study (U = 4) show lower smoking rates than dropouts. Furthermore, for U {2, 3}, smoking rate appears to increase preceding dropout in ET arm. The relatively few subjects with U = 2 or U = 3 are combined into a single pattern which, using a slight abuse of notation, is labeled as U = (2 : 3). Model specification The full-data response distribution is a mixture over the dropout times, factored as p(y mis, y obs, u z) = p(y mis y obs, u, z) p(y obs u, z) p(u z). For the dropout distribution, we assume (U Z = z) Mult(ψ 1z, ψ (2:3)z, ψ 4z ). The component distributions follow (Y 1 U = k, Z = z) Ber(µ (k) 1z ) (Y j Y j 1, U = k, Z = z) Ber(φ (k) jz ), (10.12) where logit(φ (k) jz ) = γ(k) jz + θ(k) jz Y j 1. (10.13) This model assumes that within dropout pattern, the joint distribution of Y can be captured using a first-order serial dependence structure.

254 248 CASE STUDIES: MISSING NOT AT RANDOM Table 10.4 Representation of full-data model given by (10.13). Empty cells canot be identified by observed data. Treatment subscripts suppressed for clarity. U = y E(Y 2 Y 1 = y) E(Y 3 Y 2 = y) E(Y 4 Y 3 = y) U = (2 : 3) 0 γ (2:3) 2 γ (2:3) 3 1 γ (2:3) 2 + θ (2:3) 2 γ (2:3) 3 + θ (2:3) 3 U = 4 0 γ (4) 2 γ (4) 3 γ (4) 4 1 γ (4) 2 + θ (4) 2 γ (4) 3 + θ (4) 3 γ (4) 4 + θ (4) 4 Table 10.3 shows the structure of the model in the logit scale; empty cells represent quantities that cannot be identified by observed data. Identifying the model under MAR As with the Growth Hormone example, model identificaiton under MAR is based on condition (8.19) in Theorem 8.1. To satisfy MAR under the first-order dependence structure given by (10.13), we require p k (y j y j 1, z) = p >j (y j y j 1, z) for each k < j and for j = 2, 3, 4. Again, this implies that the distribution (Y j Y j 1, Z) is equivalent for those dropping out prior to j and those still in follow up at j. For those still in follow up at j, define φ (>j) jz = E(Y j Y j 1, Z = z, U > j), and let logit(φ (>j) jz ) = γ (>j) jz + θ (>j) jz Y j 1. (10.14) It can be shown that, in this setting, (10.14) is compatible with the fulldata model (10.13); in fact, the set of regression parameters in (10.14) is fully determined by those in (10.13). Now consider identifying coefficients of (10.13) among those with U = k < T. The MAR constraint is satisfied when, for each j > k, γ (k) jz = γ (>j) jz θ (k) jz = θ (>j) jz. The MAR identification scheme is shown in Table 10.3.

255 OASIS STUDY: ANALYSIS VIA PATTERN MIXTURE MODELS 249 Table 10.5 Identification of full-data model (10.13) under MAR. y E(Y 2 Y 1 = y) E(Y 3 Y 2 = y) E(Y 4 Y 3 = y) U = 1 0 γ ( 2) 2 γ ( 3) 3 γ (4) 4 1 γ ( 2) 2 + θ ( 2) 2 γ ( 3) 3 + θ ( 3) 3 γ (4) 4 + θ (4) 4 U = (2 : 3) 0 γ (2:3) 2 γ (2:3) 3 γ (4) 4 1 γ (2:3) 2 + θ (2:3) 2 γ (2:3) 3 + θ (2:3) 3 γ (4) 4 + θ (4) 4 U = 4 0 γ (4) 2 γ (4) 3 γ (4) 4 1 γ (4) 2 + θ (4) 2 γ (4) 3 + θ (4) 3 γ (4) 4 + θ (4) 4 Parameterizing the full data model under MNAR A set of unidentified parameters can be used to reparameterize (10.13) to reflect departures from MAR. Specifically, among those who drop out at U = k < T, define, for each j > k, a pair of parameters ( (k) 0jz, (k) 1jz ) that satisfy γ (k) jz = γ (>j) jz + (k) 0jz θ (k) jz = θ (>j) jz + ( (k) 1jz (k) 0jz ). Each parameter represents a log odds ratio comparing odds of smoking at time j, conditional on Y j 1, between those who dropped out at U = k < j and those still in follow up at j. Specifically, (k) yjz = log { odds(yj = 1 Y j 1 = y, Z = z, U = k) odds(y j = 1 Y j 1 = y, Z = z, U > j) Setting each (k) yjz = 0 implies MAR. In the OASIS trial, the dimension of is 16. To reduce the parameter space, we make the simplifying assumption that (k) yjz = y, leaving the two-dimensional parameter = ( 0, 1 ). The simplificaiton implies that departures from MAR may differ depending on Y j 1, but conditionally on Y j 1 are constant across measurement time, treatment group, and dropout time. Naturally, other simplifications can be considered, depending on the application. Analysis under MAR assumptions As a starting point, we present analysis under MAR assumption. First we assign point mass zero to ( 0, 1 ) for both treatment arms, which }.

256 250 CASE STUDIES: MISSING NOT AT RANDOM reflects the MAR assumption with 100% certainty. Next, bivariate normal distributions with mean µ = (0, 0) and covariance matrix Σ are assigned to. For simplicity, we assume no correlation between 0 and 1. The strength of prior belief about MAR is encoded by the diagonal elements of Σ. To illustrate the effect of uncertainty about MAR, we compare inferences under two different covariance matrices, for both treatment arm, ( ) ( ) Σ 1 = , Σ 2 = The matrix Σ 1 represents greater a priori certainty about MAR than Σ 2. The specific values in Σ 1 and Σ 2 were determined from the posterior standard deviations of the γ and θ parameters in model (10.14) under standard MAR analysis: the standard deviations in Σ 1 and Σ 2 respectively correspond to half and twice the posterior standard deviations of γ 2 and θ under MAR (standard deviations of both treatment arms are around 0.4 for both γ and θ parameters). Hence the prior based on Σ 1 implies that both δ 0 and δ 1 lie in ( 0.4, 0.4) with probability With Σ 2, the intervals are ( 1.6, 1.6). Table 10.6 shows the posterior means and standard deviations of marginal smoking rates E(Y t Z = z) at each month and treatment odds ratio at month 12. Not surprisingly, incorporating uncertainty about MAR through Σ 1 and Σ 2 yields 95% credible intervals for treatment effect that are wider than under the point mass prior pr(δ 0 = δ 1 = 0) = 1. Furthermore, the point mass MAR assumption and the MAR prior governed by Σ 1 (smaller variance) give an almost identical results, whereas the MAR prior with larger variance (Σ 2 ) shows considerably more variability (see Figure 10.8). Analysis using informative prior distributions for This analysis is in progress OASIS Study: Analysis via selection models A selection model factorization of the full-data model follows p(y ω) p(r y, ω). We will consider both parametric and semi-parametric selection models here. As discussed in Chapters 8 and 9, the semiparametric selection models are better suited to the use of informative priors for MNAR missingness though they can be hard to specify for a large number of observation times and/or for continuous responses.

257 OASIS STUDY: ANALYSIS VIA SELECTION MODELS 251 Table 10.6 Posterior mean (sd) of marginal smoking rates E(Y t X = x) at each month and treatment odds ratio at month 12 under the MAR assumption. ET ST MAR Normal MAR Normal constraint Σ 1 Σ 2 constraint Σ 1 Σ (0.03) 0.83 (0.03) 0.83 (0.03) 0.85 (0.03) 0.85 (0.03) 0.85 (0.03) (0.03) 0.87 (0.03) 0.87 (0.03) 0.85 (0.03) 0.85 (0.03) 0.85 (0.03) (0.04) 0.82 (0.04) 0.81 (0.05) 0.84 (0.03) 0.84 (0.03) 0.84 (0.04) (0.05) 0.78 (0.05) 0.77 (0.08) 0.88 (0.03) 0.88 (0.03) 0.88 (0.04) OR (0.18, 1.09) (0.20,1.16) (0.15,1.50) Parametric selection model specification For the first factor, we assume first oder stationary Markov model (the same model that was used within pattern in the mixture model formulation), p(y 1 ) Ber(θ 1 ) p(y t y t 1 ) Ber(θ t t 1 ), and impose the first-order dependence constraint via logit(θ t t 1 ) = β 0t + β 1 y t 1. In general, the selection model can be assumed to depend on any aspect of the joint distribution of y, but we constrain it to depend only on the past and possibly current value. In particular, let R t = 1 if Y t is observed, and zero otherwise. Define π t (y) = P (R t = 1 R t 1 = 1, Y = y). We assume logit{π t (y)} = λ 0 + λ 1 y t 1 + λ 2 y t + λ 3 y t y t 1 (10.15) Note that λ 2 = λ 3 = 0 induces an ignorable MAR missing data mechanism. Semi-parametric selection model specification The missing data mechanism is specified as in (10.15). So, the sensitivity parameters are λ 2 and λ 3. Priors for these parameters were elicited

258 252 CASE STUDIES: MISSING NOT AT RANDOM indirectly based on the mixture model formulation; see Lee and Hogan (2006). However, the full data response model is replaced with logit(θ t t 1 ) = β 0t + β 1t y t 1 + t 1 k=2 β k,ty t k + l<j β lj,ty t j y t l + r<s<p β rsp,ty t r y t s y t p. (10.16) This is a saturated model with 2 4 = 16 parameters.. To reduce the number of parameters, we center it at a first order Markov models by placing shrinkage priors on the all the parameters beyond the first order parameters, β k,t N(0, τ 1 ), k = 2,..., t 1 β lj,t N(0, τ 2 ), l < j β rsp,t N(0, τ 3 ), r < s < p Here, we have allowed a separate variance for each of the first, second, and third order interactions, τ 1, τ 2, and τ 3, respectively (Note: since T = 4, only up to third order interactions); for more details on this model, see Daniels and Scharfstein (2007). This model can be fit in WinBUGS. So, the formulation here, as proposed in Chapter 8, is a nonparametric (or flexible parametric) model for the response, with a simple parametric model for the missing data mechanism. An alternative to (10.16) that is less suited to shrinkage would be to specify a separate multinomial (with 2 4 = 16 categories) for each treatment. Priors The priors for the parameters were specified as follows: τ 1 j Γ(1, 1) j = 1,..., 3 β 0t N(0, 100) t = 2,..., 4 β 1t N(0, 100) t = 2,..., 4 λ j N(0, 100) j = 1, 2 (10.17) The prior for (λ 3, λ 4 ) can be found in Lee and Hogan (2006). Results Posterior means and standard deviations of the marginal smoking rates

259 OASIS STUDY: ANALYSIS VIA SELECTION MODELS 253 Table 10.7 Posterior mean (sd) of marginal smoking rates E[Y t X = x] at each month and treatment odds ratio at month 12. Month ET Parametric SM Semiparam SM 1.83 (.03).83 (.03) 3.88 (.03).89 (.03) 6.84 (.04).84 (.04) (.04).83 (.04) ST 1.85 (.03).85 (03) 3.86 (.03).86 (.03) 6.85 (.03).85 (.03) (.03).90 (.03) OR (month 12).56 (.21,1.17).62 (.24,1.28) by month and treatment are given in Table Estimates of the first order Markov parameters, β 1t in (10.16) [not shown] suggested the stationary assumption in the parametric selection model was not reasonable (given the expert information introduced into the problem via the informative prior on (λ 2, λ 3 )). The treatment OR for the semiparametric model was larger and more variable than the parametric selection model and was closer to the estimates obtained using the mixture model (which incorporated the same expert opinion), but assumed a different full data response model. comparison via DIC to be added

260 254 CASE STUDIES: MISSING NOT AT RANDOM TRT E TRT GH delta delta delta delta0 TRT Difference pr(trt Difference>0) delta delta delta delta0 Figure 10.1 Contours of posterior means for the following parameters as a function of from Analysis 1: µ 30 and µ 31 (mean quad strength at 12 months), θ = µ 31 µ 30 (treatment effect), and P (θ > 0).

261 OASIS STUDY: ANALYSIS VIA SELECTION MODELS 255 TRT Difference pr(trt Difference>0) delta_gh delta_gh delta_e delta_e Figure 10.2 Contours of posterior mean θ = µ 31 µ 30 (difference in quad strength) and of P (θ > 0) as function of from Analysis 2.

262 256 CASE STUDIES: MISSING NOT AT RANDOM log e (CD4 + 1) time (weeks) Figure 10.3 Summary of CD4 data in log scale.

263 OASIS STUDY: ANALYSIS VIA SELECTION MODELS 257 Lease Squares Slopes Lease Squares Slopes follow-up time (weeks) follow-up time (weeks) Figure 10.4 OLS slopes vs dropout times.

264 258 CASE STUDIES: MISSING NOT AT RANDOM intercept (high dose) intercept (low dose) slope (low dose) follow-up time (standardized) follow-up time (standardized) slope (high dose) follow-up time (standardized) follow-up time (standardized) Figure 10.5 Posterior inference about betas slopes vs dropout times.

265 OASIS STUDY: ANALYSIS VIA SELECTION MODELS 259 (CD4 + 1) a 0 = 0, a 1 = Figure 10.6 Treatment effects under MAR and VCM.

266 260 CASE STUDIES: MISSING NOT AT RANDOM Enhancement arm Standard arm u=1 (54) u=2 (12) u=3 (16) u=4 (67) overall u=1 (41) u=2 (9) overall u=4 (89) u=3 (10) time time Figure 10.7 The proportion smoking by dropout pattern at each month, stratified by treatment arm (the number in parenthesis indicates the number of subjects in pattern k).

267 OASIS STUDY: ANALYSIS VIA SELECTION MODELS MAR Mixture Normal (Σ 1 ) Normal (Σ 2 ) Figure 10.8 The posterior distribution of treatment OR by different priors on (δ 0, δ 1). The MAR assumption places mass at (δ 0, δ 1) = (0, 0) with probability one.

268

269 CHAPTER 11 Application of missing data methods to problems in causal inference 11.1 Introduction Potential outcomes p(y 0, y 1, A x) Here the potential outcomes are Y 0 which corresponds to the potential response for A = 0 and Y 1 which corresponds to the potential response for A = 1. We only observe one of Y 0 or Y 1. If the individual is randomized to A, then... However, it is not uncommon to be randomized to a treatment, but then to not comply with the treatment. We will discuss two approaches to determine the causal effect of the treatment, as opposed to the intention to treat effect (the causal effect of being randomized to the treatment in a trial where there is non-compliance) Framework To understand the potential estimation problems, it is convenient to write down the full data (structural) models, which directly models the potential outcomes, y ia = β 0 + β 1 a + ɛ i. (11.1) Of course when a = 1, we observe y i1 and we do not observe y i0 and vice versa for a = 0. Clearly, some assumptions or constraints will be required to estimate causal effect of interest, β 1. Since we only observe one of Y 0 and Y 1, it is common to write down the observed data model, y i = β 0 + β 1 A i + ɛ i (11.2) 263

270 264APPLICATION OF MISSING DATA METHODS TO PROBLEMS IN CAUSAL INFERENCE for an individual who was randomized to treatment A. Under perfect compliance, β1 will be an estimate of the causal effect of treatment A. However, under non-compliance, it will be a biased estimate of the causal effect of treatment A. In the next two sections, we will discuss two likelihood based approaches, instrumental variables and principal stratification, to estimate causal effects, like β 1. We will also apply these approaches to data from the CTQ clinical trials Instrumental variables (IV) We will start out with the cross-sectional setting and will focus on randomized clinical trials with non-compliance. One way to identify the causal effect of treatment is to identify an instrumental variable (IV). A variable, R, with respect to model (11.1), is an instrument variable if 1. R i is independent of ɛ i 2. E[A i R i = r] is a nonconstant function of r. If R i is an instrument, the causal parameter β 1 can be estimated using the following set of regressions for the observed data, Y i = β 0 + β 1 A i + ɛ i (11.3) A i = γ 0 + γ 1 R i + δ i. (11.4) It can be shown that under a flat prior on the parameters??, the β 1 component of the joint posterior mode is ˆβ 1 = Cov(y,r) Cov(a,r) (11.5) The typical semiparametric approach to estimate β 1 here is two stage least squares (Angrist, Imbens, and Rubin, 1996) and corresponds to the mle under normality. Notice from (11.5) that is γ 1 = 0, then the estimator for β 1 no longer exists (i.e., β 1 is no longer identified). In addition, if the relationship between the instrument r and the treatment a is weak, the estimate for β 1 will be unstable and the posterior for β 1 will have heavy tails (Lancaster, Ch 8., 2004). Further discussion of weak instruments will continue in Section 10.?. Recent work have proposed semiparametric Bayesian approaches (Chib and Hamilton, 2002?; Conley, et al., 2006). data at one point for HERS

271 INSTRUMENTAL VARIABLES (IV) Longitudinal extension Comment on SUTVA For data collected over time, define A it to be the treatment received at time t and Āit the history of treatment received up to and including time t. A linear structural model can be written as Y it (ā t ) = α t + βg(ā t )θ + ɛ it (ā t ), (11.6) where we now also allow the errors, ɛ it to depend on the history of treatment received. For the model, R i is an instrumental varaible if, for all t = 1,..., T, 1. R i is independent of ā t A t I(A it = ā t )ɛ it (ā t ) 2. E[g(Āit)] R i = r] is a nonconstant function of r. These conditions imply R i is an instrument for every time period. In the following, we will extend this semiparametric model to a fully parametric model by assigning a multivariate normal distribution to the error terms, following the development in Chib and Hamilton (2000,2002) Parametric model for longitudinal case We first consider the joint distribution of the vector of potential outcomes and treatment conditional on the instrument, r, p(y 0, y 1, a r), where y j = (y 1j,..., y T j ) T. We can factor this joint distribution as p(y 0, y 1, a r) = p(y 0, y 1 r)p(a y 0, y 1, r). (11.7) The two conditions for an instrument above imply the following two conditions for our model, 1. p(y 0, y 1 r) = p(y 0, y 1 ) 2. p(a y 0, y 1, r) nonconstant in r The first condition implies that the potential outcomes do not depend on the instrument. The second condition is that conditional on the potential outcomes, the distribution of the compliance depends on the instrument, r. The condition is equivalent(??) to condition 2. above. In particular, we can write

272 266APPLICATION OF MISSING DATA METHODS TO PROBLEMS IN CAUSAL INFERENCE E[a r] = ap(a y 0, y 1, r)p(y 0, y 1 r)dy 0 dy 1 = ap(a y 0, y 1, r)p(y 0, y 1 )dy 0 dy 1. From this expression, it is clear that the E[a r] is a non-constant function of r iff p(a y 0, y 1, r) is as well given that E[a r] is just a weighted average of p(a y 0, y 1, r) with weights independent of r. We will use a multivariate probit formulation for the joint distribution of the treatment and the potential outcomes, similar to the ones used for the parametric selection models in Section 7.?. Before introducing the model, we introduce some additional notation. Let a it be the latent variable underlying the binary treatment indicator a it for the probit formulation. At time t, we assuming the following trivariate normal regression model, a it y it0 y it1 N γ 0t + γ 1t r i µ 0t µ 1t, 1 σ 12 σ 13 σ 12 σ 22 σ 23 σ 13 σ 23 σ 3 3 We point out a few things about this model formulation. First, the (1, 1) element of Σ is fixed at one for identifiability. Second, the data provide no information about the covariance between the potential outcomes, σ 23 (Koop and Poirier, 1997) since the pair (y it0, y it1 ) is never observed. There are several ways to specify longitudinal dependence in this model. Here, mention random effects, big MVN, etc. just include random effects, a la chib above. The identifiabililty of the covariances σ 12 and σ 13 comes through the instrument. In particular, if γ 1t = 0 (no instrument), then the covariances between a it and y ita will be inestimable since only observe y ita for a it = a. However, with the instrument, can estimate the correlation between the residual (after regressing on the instrument) of a it and y ita for a = 0, 1. This model can easily be extended to longitudinal binary responses using a probit formulation Causal treatment effects marginal effect recovered extracting causal effects of interest

273 PRINCIPAL STRATIFICATION 267 The causal effect of the treatment at time t in the model (?), for any treatment history up to time t is E[Y it1 Y it0 ] = µ 1t µ 0t (11.8) Weak instruments and informative priors In the case of a weak instrument, obtain addition stability by using an informative prior on the regression coefficient γ 1. more from tony s book Sensitivity analyses for IV issue of non-identifiability of Cov(Y 0, Y 1 ); can do sensitivity or assign prior sketch possible sensitivity analysis to exclusion restriction Some Extensions sensitivity analysis to exclusion restriction causal effect for given treatment history Case study: CTQ data or HERS? extension to longitudinal data with example: full HERS? or CTQ? 11.4 Principal stratification base on Frangakis stuff - only in binary case simple example? with binary with placebo-active (check FR, 2002 Biometrics) example with Jason with CTQ sensitivity analysis - problem specific extension to longitudinal??

274 268APPLICATION OF MISSING DATA METHODS TO PROBLEMS IN CAUSAL INFERENCE 11.5 Further reading Instrumental variables continuous mixture of normals using DPP on scale parameter (Chib, 2002) informative priors with weak instruments (Tony L.) Other approaches G-computation Tony:quantile regression

275 APPENDIX A Notation 269

276 270 NOTATION For a T T symmetric matrix Σ, with elements σ jk, vech(σ) = (σ 11, σ 12, σ 22, σ 13,..., σ 33,..., σ T T ).

277 References [AC93] James H. Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88: , [AFWS02] Paul S. Albert, Dean A. Follmann, Shaohua A. Wang, and Edward B. Suh. A latent autoregressive model for longitudinal binary data subject to informative missingness. Biometrics, 58(3): , [AIR96] Joshua D. Angrist, Guido W. Imbens, and Donald B. Rubin. Identification of causal effects using instrumental variables (Disc: P ). Journal of the American Statistical Association, 91: , [Aka73] H. Akaike. Information theory and an extension of the maximum likelihood principle. In B.N. Petrov and F. Csaki, editors, Proceedings of 2nd International Symposium on Information theory, pages Akademiai Kiado [Budapest], [Alb00] Paul S. Albert. A transitional model for longitudinal binary data subject to nonignorable missing data. Biometrics, 56(2): , [AV96] A. Azzalini and A. Dalla Valle. The multivariate skew-normal distribution. Biometrika, 83: , [AW00] Omar Aguilar and Mike West. Bayesian dynamic factor models and portfolio allocation. Journal of Business & Economic Statistics, 18(3): , [BA98] Kenneth P. Burnham and David Raymond Anderson. Model Selection and Inference: a Practical Information-theoretic Approach. Springer-Verlag Inc, [Bak95] Stuart G. Baker. Marginal regression for repeated binary data with outcome subject to non-ignorable non-response. Biometrics, 51: , [BB92] J. O. Berger and J. M. Bernardo. On the development of reference priors (Disc: P49-60). In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 4. Proceedings of the Fourth Valencia International Meeting, pages Clarendon Press [Oxford University Press], [BC64] G.E.P. Box and D.R. Cox. An analysis of transformations. Journal of the Royal Statistical Society B, 26: , [BC93] N. E. Breslow and D. G. Clayton. Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88:9 25,

278 272 REFERENCES [BCR02] Scott M. Berry, Raymond J Carroll, and David Ruppert. Bayesian smoothing and regression splines for measurement error problems. Journal of the American Statistical Association, 97(457): , [BD05] C. Botts and M.J. Daniels. Likelihood approximations in bayesian multiple curve fitting. Department of Statistics, University of Florida, Tech report, [Ber85] J. O. Berger. Statistical Decision Theory and Bayesian Analysis (Second Edition). Springer-Verlag Inc, [Ber06] J Bernardo. Reference analysis. manuscript, [BG98] Stephen P. Brooks and Andrew Gelman. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7: , [BK99] David J. Bartholomew and M. Knott. Latent Variable Models and Factor Analysis. Edward Arnold Publishers Ltd, [BLZ94] Philip J. Brown, Nhu D. Le, and James V. Zidek. Inference for a covariance matrix. In Aspects of Uncertainty, pages 77 92, [BMM00] John Barnard, Robert McCulloch, and Xiao-Li Meng. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica, 10(4): , [BNC94] O. E. Barndorff-Nielsen and D. R. Cox. Inference and Asymptotics. Chapman & Hall Ltd, [BO88] James O. Berger and Anthony O Hagan. Ranges of posterior probabilities for unimodal priors with specified quantiles. In J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, editors, Bayesian Statistics 3, pages Clarendon Press [Oxford University Press], [BP96] J. O. Berger and L. R. Pericchi. The intrinsic Bayes factor for linear models. In Bayesian Statistics 5 Proceedings of the Fifth Valencia International Meeting, pages 25 44, [BRMZR97] Karen Bandeen-Roche, Diana L. Miglioretti, Scott L. Zeger, and Paul J. Rathouz. Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92: , [BST05] J.O. Berger, W. Strawderman, and D. Tang. Posterior propriety and admissibility of hyperpriors in normal hierarchical models. Annals of Statistics, pages????, [CC96] Mary Kathryn Cowles and Bradley P. Carlin. Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91: , [CE05] John Copas and Shinto Eguchi. Local model uncertainty and incomplete-data bias (with discussion). Journal of the Royal Statistical Society, Series B: Statistical Methodology, 67, [CFRT05] G. Celeux, F. Forbes, C.P. Robert, and D.M. Titterington. Deviance information criteria for missing data models. Bayesian Analysis (In press), [CG98a] Siddhartha Chib and Edward Greenberg. Analysis of multivariate probit models. Biometrika, 85: , [CG98b] Siddhartha Chib and Edward Greenberg. Analysis of multivariate probit models. Biometrika, 85: , 1998.

279 REFERENCES 273 [CH90] M. J. Crowder and D. J. Hand. Analysis of Repeated Measures. Chapman & Hall Ltd, [CH00] Siddhartha Chib and Barton H. Hamilton. Bayesian analysis of crosssection and clustered data treatment models. Journal of Econometrics, 97(1):25 50, [CH02] S. Chib and B.H. Hamilton. Semiparametric bayes analysis of longitudinal data treatment models. Journal of Econometrics, 110:67 89, [Cha96] Kathryn Chaloner. Elicitation of prior distributions. In Donald A. Berry and Dalene K. Stangl, editors, Bayesian Biostatistics, pages Marcel Dekker Inc, [CI05] Q. Chen and J.G. Ibrahim. Semiparametric models for missing covariate and response data in regression models. Biometrics (In Press), [CISW03] Ming-Hui Chen, Joseph G. Ibrahim, Qi-Man Shao, and Robert E. Weiss. Prior elicitation for model selection and estimation in generalized linear mixed models. Journal of Statistical Planning and Inference, 111(1-2):57 76, [CL00] Bradley P. Carlin and Thomas A. Louis. Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall Ltd, [CLT96] Tom Y. M. Chiu, Tom Leonard, and Kam-Wah Tsui. The matrixlogarithmic covariance model. Journal of the American Statistical Association, 91: , [CM97] Cindy L. Christiansen and Carl N. Morris. Hierarchical Poisson regression modeling. Journal of the American Statistical Association, 92: , [Coo86] R. Dennis Cook. Assessment of local influence (C/R: P ). Journal of the Royal Statistical Society, Series B: Methodological, 48: , [CR87] D. R. Cox and N. Reid. Parameter orthogonality and approximate conditional inference (C/R: P18-39). Journal of the Royal Statistical Society, Series B: Methodological, 49:1 18, [CR92] Paul J. Catalano and Louise M. Ryan. Bivariate latent variable models for clustered discrete and continuous outcomes. Journal of the American Statistical Association, 87: , [CR01] Vincent J. Carey and Bernard A. Rosner. Analysis of longitudinally observed irregularly timed multivariate outcomes: Regression with focus on cross-component correlation. Statistics in Medicine, 20(1):21 31, [CRC04] C.M. Crainiceanu, D. Ruppert, and R.J. Carroll. Spatially adaptive bayesian p-splines with heteroscedastic errors. Johns Hopkins University, Dept. of Biostatistics Working Papers, Working Paper 61, [Cre93] Noel A. C. Cressie. Statistics for Spatial Data. John Wiley & Sons, [CRS95] R. J. Carroll, D. Ruppert, and L. A. Stefanski. Measurement Error in Nonlinear Models. Chapman and Hall, [CRW01] Chin-Tsang Chiang, John A Rice, and Colin O Wu. Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variables. Journal of the American Statistical Association, 96(454): , 2001.

280 274 REFERENCES [CRW04] C.M. Crainiceanu, D. Ruppert, and M. Wand. Bayesian analysis for penalized spline regression using win bugs. Johns Hopkins University, Dept. of Biostatistics Working Papers, Working Paper 61, [CRW05] C. M. Crainiceanu, D. Ruppert, and M. P. Wand. Bayesian analysis for penalized spline regression using winbugs. Technical report, Department of Biostatistics, Johns Hopkins University, [Cza00] Claudia Czado. Multivariate regression analysis of panel data with binary outcomes applied to unemployment data. Statistical Papers, 41(3): , [Dan98] Michael J. Daniels. Computing posterior distributions for covariance matrices. In Sanford Weisberg, editor, Computing Science and Statistics. Dimension Reduction, Computational Complexity and Information. Proceedings of the 30th Symposium on the Interface, pages Interface Foundation of North America, [Dan99a] Michael J. Daniels. A prior for the variance in hierarchical models. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 27: , [Dan99b] Michael J. Daniels. A prior for the variance in hierarchical models. The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 27: , [Dan05] M.J. Daniels. Bayesian modelling of several covariance matrices. Tech report, [Daw] A.P. Dawid. [DG95] Gauri Sankar Datta and Jayanta Kumar Ghosh. On priors providing frequentist validity for Bayesian inference. Biometrika, 82:37 45, [DG98] Marie Davidian and David M. Giltinan. Nonlinear Models for Repeated Measurement Data. Chapman & Hall Ltd, [DG99] Michael J. Daniels and Constantine Gatsonis. Hierarchical generalized linear models in the analysis of variations in health care utilization. Journal of the American Statistical Association, 94:29 42, [DG03] Marie Davidian and David M. Giltinan. Nonlinear models for repeated measurement data: An overview and update. Journal of Agricultural, Biological, and Environmental Statistics, 8(4): , [DGT94] Victor De Gruttola and Xin Ming Tu. Modelling progression of CD4- lymphocyte count and its relationship to survival time. Biometrics, 50: , [DH03] Angela Dobson and Robin Henderson. Diagnostics for joint longitudinal and dropout time modeling. Biometrics, 59(4): , [DHLZ02a] Peter Diggle, Patrick Heagerty, Kung-Yee Liang, and Scott L. Zeger. Analysis of Longitudinal Data. Oxford University Press, [DHLZ02b] Peter Diggle, Patrick J. Heagerty, Kung-Yee Liang, and Scott L. Zeger. Analysis of Longitudinal Data, Second Edition. Oxford University Press, [Dig88] Peter J. Diggle. An approach to the analysis of repeated measurements. Biometrics, 44: , [Dig99] Peter J. Diggle. Comment on Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Sta-

281 REFERENCES 275 tistical Association, 94: , [DK94] P. Diggle and M. G. Kenward. Informative drop-out in longitudinal data analysis (Disc: p73-93). Applied Statistics, 43:49 73, [DK99a] Michael J. Daniels and Robert E. Kass. Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association, 94: , [DK99b] Michael J. Daniels and Robert E. Kass. Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association, 94: , [DK01a] Michael J. Daniels and Robert E. Kass. Shrinkage estimators for covariance matrices. Biometrics, 57(4): , [DK01b] Michael J. Daniels and Robert E. Kass. Shrinkage estimators for covariance matrices. Biometrics, 57(4): , [DKG01] I. DiMateo, R.E. Kass, and C. Genovese. Bayesian curve fitting with free-knot splines. Biometrika, 88: , [DLR77] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (C/R: P22-37). Journal of the Royal Statistical Society, Series B: Methodological, 39:1 22, [DMS98] D. G. T. Denison, B. K. Mallick, and A. F. M. Smith. Automatic Bayesian curve fitting. Journal of the Royal Statistical Society, Series B: Statistical Methodology, 60: , [DN06] M.J. Daniels and S-L Normand. Longitudinal profiling of health care units based on mixed multivariate patient outcomes. Biostatistics, 7:1 15, [DP02] M.J. Daniels and M. Pourahmadi. Bayesian analysis of covariance matrices and dynamic models for longitudinal data. Biometrika, 89(3): , [Dun03] David B. Dunson. Dynamic latent trait models for multidimensional longitudinal data. Journal of the American Statistical Association, 98(463): , [DV98] Peter J. Diggle and Arūnas P. Verbyla. Nonparametric estimation of covariance structure in longitudinal data. Biometrics, 54: , [DZ03] Michael J. Daniels and Yan D. Zhao. Modeling the random effects covariance matrix in longitudinal data. Statistics in Medicine (in press), [EC03] Lynn E. Eberly and George Casella. Estimating Bayesian credible intervals. Journal of Statistical Planning and Inference, 112(1-2): , [EG97] Mark D. Ecker and Alan E. Gelfand. Bayesian variogram modeling for an isotropic spatial process. Journal of Agricultural, Biological, and Environmental Statistics, 2: , [EM96] Paul H. C. Eilers and Brian D. Marx. Flexible smoothing with B-splines and penalties. Statistical Science, 11(2):89 121, [ES98] Anders Ekholm and Chris Skinner. The Muscatine children s obesity data reanalysed using pattern mixture models. Journal of the Royal Statistical Society, Series C: Applied Statistics, 47: , [Esc95] Michael D. Escobar. Nonparametric Bayesian methods in hierarchical

282 276 REFERENCES models. Journal of Statistical Planning and Inference, 43:97 106, [Fer82] Thomas S. Ferguson. Sequential estimation with Dirichlet process priors. In Shanti S. Gupta and James O. Berger, editors, Statistical Decision Theory and Related Topics III, in two volumes, Volume 1, pages Academic Press, [Fir87] D. Firth. Discussion of parameter orthogonality and approximate conditional inference. Journal of the Royal Statistical Society, Series B: Methodological, 49:22 23, [FL93] Garrett M. Fitzmaurice and Nan M. Laird. A likelihood-based method for analysing longitudinal binary responses. Biometrika, 80: , [FL97] Garrett M. Fitzmaurice and Nan M. Laird. Regression models for mixed discrete and continuous responses with potentially missing values. Biometrics, 53: , [FLL94] Garrett M. Fitzmaurice, Nan M. Laird, and Stuart R. Lipsitz. Analysing incomplete longitudinal binary responses: A likelihood-based approach. Biometrics, 50: , [FLW04] Garret M. Fitzmaurice, Nan M. Laird, and James H. Ware. Applied Longitudinal Analysis. Wiley Interscience, [FLZ96] Garrett M. Fitzmaurice, Nan M. Laird, and Gwendolyn E. P. Zahner. Multivariate logistic models for incomplete binary responses. Journal of the American Statistical Association, 91:99 108, [FM03] Emilio Ferrer and John J. McArdle. Alternative structural models for multivariate longitudinal data analysis. Structural Equation Modeling, 10(4): , [FML95] Garrett M. Fitzmaurice, Geert Molenberghs, and Stuart R. Lipsitz. Regression models for longitudinal binary responses with informative dropouts. Journal of the Royal Statistical Society, Series B: Methodological, 57: , [FT96] Cheryl L. Faucett and Duncan C. Thomas. Simultaneously modelling censored survival data and repeatedly measured covariates: A Gibbs sampling approach. Statistics in Medicine, 15: , [Ful87] W. A. Fuller. Measurement Error Models. Wiley, [Ful96] Wayne A. Fuller. Introduction to Statistical Time Series. John Wiley & Sons, [FW95] Dean Follmann and Margaret Wu. An approximate generalized linear model with random effects for informative missing data (Corr: 97V53 p384). Biometrics, 51: , [GA01] Ralitza V. Gueorguieva and Alan Agresti. A correlated probit model for joint modeling of clustered binary and continuous responses. Journal of the American Statistical Association, 96(455): , [GDC92] A. E. Gelfand, D. K. Dey, and H. Chang. Model determination using predictive distributions, with implementation via sampling-based methods (Disc: P ). In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 4. Proceedings of the Fourth Valencia International Meeting, pages Clarendon Press [Oxford University Press], [Gel05] A Gelman. Prior distributions for variance parameters in hierarchical

283 REFERENCES 277 models. Bayesian Analysis (in press), [GG84] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6: , [GG98] Alan E. Gelfand and Sujit K. Ghosh. Model choice: A minimum posterior predictive loss approach. Biometrika, 85:1 11, [GH97] Robert D. Gibbons and Donald Hedeker. Random effects probit and logistic regression models for three-level data. Biometrics, 53: , [GH03] Malay Ghosh and Jungeun Heo. Default Bayesian priors for regression models with first-order autoregressive residuals. Journal of Time Series Analysis, 24(3): , [GKO05] P.H. Garthwaite, J.B. Kadane, and A. O Hagan. Statistical methods for eliciting probability distributions. Journal of the American Statistical Association, 100: , [GM97] E. I. George and R. E. McCulloch. Approaches for Bayesian variable selection. Statistica Sinica, 7: , [GMW04] P. Gustafson, Y.C. MacNab,, and S Wen. On the value of derivative evaluations and random walk suppression in markov chain monte carlo algorithms. Statistics and Computing, 14:23 38, [GR92] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences (Disc: P , ). Statistical Science, 7: , [Gre95] Peter J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82: , [GS90] Alan E. Gelfand and Adrian F. M. Smith. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85: , [Guo04] Wensheng Guo. Functional data analysis in longitudinal settings using smoothing splines. Statistical Methods in Medical Research, 13(1):49 62, [Gus97] Paul Gustafson. Large hierarchical Bayesian analysis of multivariate survival data. Biometrics, 53: , [GVMV + 05] A. Gelman, I. Van Mechelen, G. Verbeke, D.F. Heitjan, and M. Meulders. Multiple imputation for model checking: completed-data plots with missing latent data. Biometrics, 61:74 85, [GW92] W. R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 41: , [Has70] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97 109, [HC96] James P. Hobert and George Casella. The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association, 91: , [HDD00] Robin Henderson, Peter Diggle, and Angela Dobson. Joint modelling of longitudinal measurements and event time data. Biostatistics, 1(4): , [Hea99] Patrick J. Heagerty. Marginally specified logistic-normal models for

284 278 REFERENCES longitudinal binary data. Biometrics, 55: , [Hea02] Patrick J. Heagerty. Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics, 58(2): , [Hec79] James J. Heckman. Sample selection bias as a specification error. Econometrica, 47: , [HG94] Donald Hedeker and Robert D. Gibbons. A random-effects ordinal regression model for multilevel analysis. Biometrics, 50: , [HK01] P.J. Heagerty and B.F. Kurland. Misspecified maximum likelihood estimates and generalised linear mixed models. Biometrika, 88: , [HK04] P. Heagerty and B. Kurland. Marginalized transition models for longitudinal binary data with ingorable and non-ignorable drop-out. Statistics in Medicine, 23: , [HL97] Joseph W. Hogan and Nan M. Laird. Mixture models for the joint distribution of repeated measures and event times. Statistics in Medicine, 16: , [HL04a] J.W. Hogan and T. Lancaster. Instrumental variables and marginal structural models for causal inference from longitudinal data. Statistical Methods in Medical Research, 13:17 48, [HL04b] J.W. Hogan and J.Y. Lee. Structural quantile models for longitudinal observational studies with time-varying treatment. Statistica Sinica, 14: , [Hod98] James S. Hodges. Some algebra and geometry for hierarchical models, applied to diagnostics (Disc: P ). Journal of the Royal Statistical Society, Series B: Statistical Methodology, 60: , [HRK04] J.W. Hogan, J. Roy, and C. Korkontzelou. Biostatistics tutorial: Handling dropout in longitudinal studies. Statistics in Medicine, 23: , [HRWY98] Donald R. Hoover, John A. Rice, Colin O. Wu, and Li-Ping Yang. Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika, 85: , [HTF01] Trevor Hastie, Robert Tibshirani, and J. H. Friedman. The elements of statistical learning: data mining, inference, and prediction: with 200 fullcolor illustrations. Springer-Verlag Inc, [HW01] J.W. Hogan and F. Wang. Model selection for longitudinal repeated measures subject to informative dropout. Technical Report, [HZ00] Patrick J. Heagerty and Scott L. Zeger. Marginalized multilevel models and likelihood inference (with comments and a rejoinder by the authors). Statistical Science, 15(1):1 26, [IC00] Joseph G. Ibrahim and Ming-Hui Chen. Power prior distributions for regression models. Statistical Science, 15(1):46 60, [ID04] O. Ilk and M.J. Daniels. Marginalized transition random effects models for multivariate longitudinal binary data. Tech report, [Jef61] Harold Jeffreys. Theory of Probability, 3rd edition. Oxford University Press, [JHS + 03] C. Y. Jones, J. W. Hogan, B. Snyder, R. S. Klein, A. Rompalo, P. Schuman, and C. C. J. Carpenter. Overweight and human immunod-

285 REFERENCES 279 eficiency virus (hiv) in women: Associations between hiv progression and changes in body mass index in women in the hiv epidemiology research study cohort. Cinical Infectious Diseases, [Jon93] R. H. Jones. Longitudinal Data with Serial Correlation: A State-Space Approach. Chapman and Hall, [JS86] Robert I. Jennrich and Mark D. Schluchter. Unbalanced repeatedmeasures models with structured covariance matrices. Biometrics, 42: , [KDW + 80] Joseph B. Kadane, James M. Dickey, Robert L. Winkler, Wayne S. Smith, and Stephen C. Peters. Interactive elicitation of opinion for a normal linear model. Journal of the American Statistical Association, 75: , [Ken98] Michael G. Kenward. Selection models for repeated measurements with non-random dropout: An illustration of sensitivity. Statistics in Medicine, 17: , [KH04] B.F. Kurland and P.J. Heagerty. Marginalized transition models for longitudinal binary data with ignorable and non-ignorable drop-out. Statistics in Medicine, 23: , [KHM03] H. Ko, J.W. Hogan, and K.H. Mayer. Estimating causal treatment effects from longitudinal hiv natural history studies using marginal structural models. Biometrics, 59: , [KI98a] Ken P. Kleinman and Joseph G. Ibrahim. A semi-parametric Bayesian approach to generalized linear mixed models. Statistics in Medicine, 17: , [KI98b] Ken P. Kleinman and Joseph G. Ibrahim. A semiparametric Bayesian approach to the random effects model. Biometrics, 54: , [KL51] S. Kullback and R. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79 86, [KMT03] M.G. Kenward, G. Molenberghs, and H. Thijs. Pattern-mixture models with proper time dependence. Biometrika, 90:53 71, [KP97] Gary Koop and Dale J. Poirier. Learning about the across-regime correlation in switching regression models. Journal of Econometrics, 78: , [KPR + 98] D.P. Kiel, J. Puhl, C.J. Rosen, K. Berg, J.B. Murphy, and D.B. MacLean. Lack of association between insulin-like growth factor i and body composition, muscle strength, physical performance or self reported mobility among older persons with functional limitations. Journal of the American Geriatrics Society, 46: , [KW95] Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association, 90: , [KW96] Robert E. Kass and Larry Wasserman. The selection of prior distributions by formal rules (Corr: 1998V93 p412). Journal of the American Statistical Association, 91: , [KW98] Joseph B. Kadane and Lara J. Wolfson. Experiences in elicitation (Disc: P55-68). Journal of the Royal Statistical Society, Series D: The Statistician, 47:3 19, 1998.

286 280 REFERENCES [Lai04] N. M. Laird. Analysis of Longitudinal and Cluster-Correlated Data. Institute of Mathematical Statistics, [Lan04] Tony Lancaster. An Introduction to Modern Bayesian Econometrics. Blackwell, [Lav92] Michael Lavine. Some aspects of Polya tree distributions for statistical modelling. The Annals of Statistics, 20: , [Lav94] Michael Lavine. More aspects of Polya tree distributions for statistical modelling. The Annals of Statistics, 22: , [LB01] Jaeyong Lee and James O Berger. Semiparametric Bayesian analysis of selection models. Journal of the American Statistical Association, 96(456): , [LC01] Xihong Lin and Raymond J. Carroll. Semiparametric regression for clustered data. Biometrika, 88(4): , [LD05] X. Liu and M.J. Daniels. Joint modelling of a continuous and binary longitudinal process. Tech report, [Leh83] E.L. Lehmann. Theory of Point Estimation. Wadsworth and Brooks/Cole, [LH92] Tom Leonard and John S. J. Hsu. Bayesian inference for a covariance matrix. The Annals of Statistics, 20: , [LI96] Purushottam W. Laud and Joseph G. Ibrahim. Predictive specification of prior model probabilities in variable selection. Biometrika, 83: , [Lin99] James K. Lindsey. Models for Repeated Measurements. Oxford University Press, [Lit94] Roderick J. A. Little. A class of pattern-mixture models for normal incomplete data. Biometrika, 81: , [Liu01] Chuanhai Liu. Comment on The art of data augmentation (Pkg: P1-111). Journal of Computational and Graphical Statistics, 10(1):75 81, [LLM04] J.C. Liechty, M.W. Liechty, and P. Muller. Bayesian correlation estimation. Biometrika, 91:1 14, [LLT89] Kenneth L. Lange, Roderick J. A. Little, and Jeremy M. G. Taylor. Robust statistical modeling using the t-distribution. Journal of the American Statistical Association, 84: , [LMR04] Haiqun Lin, Charles E. McCulloch, and Robert A. Rosenheck. Latent pattern mixture models for informative intermittent missing data in longitudinal studies. Biometrics, 60(2): , [LR98] Chuanhai Liu and Donald B. Rubin. Ellipsoidally symmetric extensions of the general location model for mixed categorical and continuous data. Biometrika, 85: , [LR99] Roderick J. Little and Donald B. Rubin. Comment on Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94: , [LR02] Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, [LVV02] Philippe Lambert, Francois Vandenhende, and François Vandenhende. A copula-based model for multivariate non-normal longitudinal data: Anal-

287 REFERENCES 281 ysis of a dose titration safety study on a new antidepressant. Statistics in Medicine, 21(21): , [LW82] Nan M. Laird and James H. Ware. Random-effects models for longitudinal data. Biometrics, 38: , [LW96] Roderick J. A. Little and Yongxiao Wang. Pattern-mixture models for multivariate incomplete data with covariates. Biometrics, 52:98 111, [LW99] Jun S. Liu and Ying Nian Wu. Parameter expansion for data augmentation. Journal of the American Statistical Association, 94: , [LWC03] Hua Liang, Hulin Wu, and Raymond J. Carroll. The relationship between virologic and immunologic responses in AIDS clinical research using mixed-effects varying-coefficient models with measurement error. Biostatistics (Oxford), 4(2): , [LWK94] Jun S. Liu, Wing Hung Wong, and Augustine Kong. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81:27 40, [LWP99] Xinhua Liu, Christine Waternaux, and Eva Petkova. Influence of human immunodeficiency virus infection on neurological impairment: An analysis of longitudinal binary data with informative drop-out. Journal of the Royal Statistical Society, Series C: Applied Statistics, 48: , [LZ86] Kung-Yee Liang and Scott L. Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73:13 22, [LZ99] Xihong Lin and Daowen Zhang. Inference in generalized additive mixed models by using smoothing splines. Journal of the Royal Statistical Society, Series B: Statistical Methodology, 61: , [MAK + 99] B.H. Marcus, A.E. Albrecht, T.K. King, A.F. Parisi, B.M. Pinto, M. Roberts, R.S. Niaura, and D.B. Abrams. The efficacy of exercise as an aid for smoking cessation in women: a randomized controlled trial. Archives of Internal Medicine, 159, [MC05] G. Muller and C. Czado. An autoregressive ordered probit model with application to high-frequency financial data. Journal of Computational and Graphical Statistics, 14(2): , [MG97] Rahul Mukerjee and Malay Ghosh. Second-order probability matching priors. Biometrika, 84: , [MHS + 03] K. H. Mayer, J. W. Hogan, D. Smith, R. S. Klein, P. Schuman, J. B. Margolick, C. Korkontzelou, H. Farzedegan, D. Vlahov, and C. C. J. Carpenter. Clinical and immunologic progression in hiv-infected us women before and after the introduction of highly active antiretroviral therapy. Journal of Acquired Immune Deficiency Syndrome, [Mig03] Diana L. Miglioretti. Latent transition regression for mixed outcomes. Biometrics, 59(3): , [MKL97] G. Molenberghs, M. G. Kenward, and E. Lesaffre. The analysis of longitudinal ordinal data with nonrandom drop-out. Biometrika, 84:33 44, [MLH + 05] B. H. Marcus, B. Lewis, J.W. Hogan, T. K. King, A.E. Albrecht, B. Bock, and A.F. Parisi. The efficacy of moderate-intensity exercise as an aid for smoking cessation in women: A randomized controlled trial. Nicotine

288 282 REFERENCES and Tobacco Research, [MMKD98] G. Molenberghs, B. Michiels, M. G. Kenward, and P. J. Diggle. Monotone missing data and pattern-mixture models. Statistica Neerlandica, 52: , [MN89] P. McCullagh and John A. Nelder. Generalized Linear Models. Chapman & Hall Ltd, [MN99] P. McCullagh and John A. Nelder. Generalized Linear Models. Chapman & Hall Ltd, [Mon83] John F. Monahan. Fully Bayesian analysis of ARMA time series models. Journal of Econometrics, 21: , [MR87] Bryan F. J. Manly and J. C. W. Rayner. The comparison of sample covariance matrices using likelihood ratio tests. Biometrika, 74: , [Nat01] Ranjini Natarajan. On the propriety of a modified Jeffreys s prior for variance components in binary random effects models. Statistics & Probability Letters, 51(4): , [Nel99] Roger B. Nelsen. An Introduction to Copulas. Springer-Verlag Inc, [NK00] Ranjini Natarajan and Robert E. Kass. Reference Bayesian methods for generalized linear mixed models. Journal of the American Statistical Association, 95(449): , [NM95] Ranjini Natarajan and Charles E. McCulloch. A note on the existence of the posterior distribution for a class of mixed models for binomial responses. Biometrika, 82: , [NMK00] Ranjini Natarajan, Charles E. McCulloch, and Nicholas M. Kiefer. A Monte Carlo EM method for estimating multinomial probit models. Computational Statistics & Data Analysis, 34(1):33 50, [NnANAZ00a] Vincente Núñez Antón, Vincente Nunez-Anton, and Dale L. Zimmermann. Modeling nonstationary longitudinal data. Biometrics, 56(3): , [NnANAZ00b] Vincente Núñez Antón, Vincente Nunez-Anton, and Dale L. Zimmermann. Modeling nonstationary longitudinal data. Biometrics, 56(3): , [OD04] S.M. O Brien and D.B. Dunson. Bayesian multivariate logistic regression. Biometrics, 60: , [O H98] A. O Hagan. Eliciting expert beliefs in substantial practical applications (Disc: P55-68). Journal of the Royal Statistical Society, Series D: The Statistician, 47:21 35, [OO02] J.E. Oakley and A. O Hagan. Uncertainty in prior elicitations: a nonparametric approach. Research Report No. 521/02, University of Sheffield, Dept. of Probability and Statistics, [PD02] M. Pourahmadi and M. J. Daniels. Dynamic conditionally linear mixed models for longitudinal data. Biometrics, 58(1): , [PDP05] M. Pourahmadi, M.J. Daniels, and T. Park. Simultaneous modelling of several covariance matrices using the modified choleski. Tech report, [Pou99] Mohsen Pourahmadi. Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation. Biometrika, 86:677

289 REFERENCES , [Pou00] Mohsen Pourahmadi. Maximum likelihood estimation of generalised linear models for multivariate normal covariance matrix. Biometrika, 87(2): , [Pre03] S. James Press. Subjective and Objective Bayesian Statistics: Principles, Models, and Applications. John Wiley & Sons, [RA01] Beth A. Reboussin and James C. Anthony. Latent class marginal regression models for modelling youthful drug involvement and its suspected influences. Statistics in Medicine, 20(4): , [RC00] David Ruppert and Raymond J. Carroll. Spatially-adaptive penalties for spline fitting. Australian & New Zealand Journal of Statistics, 42(2): , [RL00] Jason Roy and Xihong Lin. Latent variable models for longitudinal data with multiple continuous outcomes. Biometrics, 56(4): , [RLR99] Beth A. Reboussin, Kung-Yee Liang, and David M. Reboussin. Estimating equations for a latent transition model with multiple discrete indicators. Biometrics, 55: , [RR97] James M. Robins and Ya acov Ritov. Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semi-parametric models. Statistics in Medicine, 16: , [RRS98] Andrea Rotnitzky, James M. Robins, and Daniel O. Scharfstein. Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association, 93: , [RS97] G. O. Roberts and S. K. Sahu. Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. Journal of the Royal Statistical Society, Series B: Methodological, 59: , [RT02] H. J. Ribaudo and S. G. Thompson. The analysis of repeated multivariate binary quality of life data: A hierarchical model approach. Statistical Methods in Medical Research, 11(1):69 83, [Rub77] Donald B. Rubin. Formalizing subjective notions about the effect of nonrespondents in sample surveys. Journal of the American Statistical Association, 72: , [Rub87] Donald B. Rubin. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, [RWC03] D. Ruppert, M. P. Wand, and R. Carroll. Semiparametric Regression. Cambridge University Press, [SBCvdL02] David J. Spiegelhalter, Nicola G. Best, Bradley P. Carlin, and Angelika van der Linde. Bayesian measures of model complexity and fit (Pkg: P ). Journal of the Royal Statistical Society, Series B: Statistical Methodology, 64(4): , [Sch92] Mark D. Schluchter. Methods for the analysis of informatively censored longitudinal data (Disc: P ). Statistics in Medicine, 11: , [Sch97a] Joseph L. Schafer. Analysis of Incomplete Multivariate Data. Chapman & Hall Ltd, [Sch97b] Mark J. Schervish. Theory of Statistics. Springer-Verlag Inc, 1997.

290 284 REFERENCES [SDR03] D.O. Scharfstein, M.J. Daniels, and J. Robins. Incorporating prior beliefs about selection bias into the analysis of randomized trials with missing outcomes. Biostatistics, 4: , [SH05] Li Su and J. W. Hogan. Likelihood-based semiparametric marginal regression for continuously measured longitudinal binary outcomes. Technical report, Center for Statistical Sciences, Brown University, [SHCD06] D.O. Scharfstein, M.E. Halloran, H. Chu, and M.J. Daniels. On estimation of vaccine efficacy using validation samples with selection bias. Biostatistics, 7:????, [SI03] Daniel O. Scharfstein and Rafael A. Irizarry. Generalized additive selection models for the analysis of Studies with potentially nonignorable missing outcome data. Biometrics, 59(3): , [SL96] Glen A. Satten and Jr Longini, Ira M. Markov chains with measurement error: Estimating the true course of a marker of the progression of human immunodeficiency virus disease (Disc: P ). Applied Statistics, 45: , [SLR99] Mary Sammel, Xihong Lin, and Louise Ryan. Multivariate linear mixed models for multiple outcomes. Statistics in Medicine, 18: , [SLW84] Robert Stiratelli, Nan Laird, and James H. Ware. Random-effects models for serial observations with binary response. Biometrics, 40: , [SR96] Mary Dupuis Sammel and Louise M. Ryan. Latent variable models with fixed effects. Biometrics, 52: , [SRL97] Mary Dupuis Sammel, Louise M. Ryan, and Julie M. Legler. Latent variable models for mixed discrete and continuous outcomes. Journal of the Royal Statistical Society, Series B: Methodological, 59: , [SRR99] Daniel O. Scharfstein, Andrea Rotnitzky, and James M. Robins. Adjusting for nonignorable drop-out using semiparametric nonresponse models (C/R: p ). Journal of the American Statistical Association, 94: , [SS05] X. Sun and D Sun. Estimation of a multivariate normal covariance matrix with staircase pattern data. Tech report, [STC97] J. P. Sy, J. M. G. Taylor, and W. G. Cumberland. A stochastic model for the analysis of bivariate longitudinal AIDS data. Biometrics, 53: , [Sut03] Brajendra C. Sutradhar. An overview on regression models for discrete longitudinal responses. Statistical Science, 18(3): , [Tan96] Martin Abba Tanner. Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. Springer- Verlag Inc, [TCS94] J. M. G. Taylor, W. G. Cumberland, and J. P. Sy. A stochastic model for analysis of longitudinal AIDS data. Journal of the American Statistical Association, 89: , [TD04] A.A. Tsiatis and M. Davidian. Joint modeling of longitudinal and timeto-even data: An overview. Statistica Sinica,??:????, [TG03] M. Trevisani and A.E. Gelfand. Inequalities between expected marginal

291 REFERENCES 285 log-likelihoods, with implications for likelihood-based model complexity and comparison measures. Canadian Journal of Statistics, 31: , [Tie94] Luke Tierney. Markov chains for exploring posterior distributions (Disc: P ). The Annals of Statistics, 22: , [TK86] Luke Tierney and Joseph B. Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81:82 86, [TLH98] Andrea B. Troxel, Stuart R. Lipsitz, and David P. Harrington. Marginal models for the analysis of longitudinal measurements with nonignorable non-monotone missing data. Biometrika, 85: , [TMH04] Andrea B. Troxel, Guoguang Ma, and Daniel F. Heitjan. An index of local sensitivity to nonignorability. Statistica Sinica, 14(4): , [TW87] Martin A. Tanner and Wing Hung Wong. The calculation of posterior distributions by data augmentation (C/R: P ). Journal of the American Statistical Association, 82: , [vdm01] David A. van Dyk and Xiao-Li Meng. The art of data augmentation (Pkg: P1-111). Journal of Computational and Graphical Statistics, 10(1):1 50, [VL96] Geert Verbeke and Emmanuel Lesaffre. A linear mixed-effects model with heterogeneity in the random-effects population. Journal of the American Statistical Association, 91: , [VM00] Geert Verbeke and Geert Molenberghs. Linear Mixed Models for Longitudinal Data. Springer-Verlag Inc, [VMT + 01] Geert Verbeke, Geert Molenberghs, Herbert Thijs, Emmanuel Lesaffre, and Michael G. Kenward. Sensitivity analysis for nonrandom dropout: A local influence approach. Biometrics, 57(1):7 14, [WB89] Margaret C. Wu and Kent R. Bailey. Estimation and comparison of changes in the presence of informative right censoring: Conditional linear model (Corr: V46 p889). Biometrics, 45: , [WC88] Margaret C. Wu and Raymond J. Carroll. Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process (Corr: V45 p1347; V47 p357). Biometrics, 44: , [WCK03] Frederick Wong, Christopher K. Carter, and Robert Kohn. Efficient estimation of covariance selection models. Biometrika, 90(4): , [WCL05] Naisyin Wang, Raymond J. Carroll, and Xihong Lin. Efficient semiparametric marginal estimation for longitudinal/clustered data. Journal of the American Statistical Association, 100(469): , [WF06] Kenneth J. Wilkins and Garrett M. Fitzmaurice. A hybrid model for nonignorable dropout in longitudinal binary responses. Biometrics, 62(1): , [WT97] Michael S. Wulfsohn and Anastasios A. Tsiatis. A joint model for survival and longitudinal data measured with error. Biometrics, 53: , [WT01] Yan Wang and Jeremy M. G. Taylor. Jointly modeling longitudinal and event time data with application to acquired immunodeficiency syndrome. Journal of the American Statistical Association, 96(455): , 2001.

292 286 REFERENCES [WT02] Yue Wang and Jeremy M. G. Taylor. A measure of the proportion of treatment effect explained by a surrogate marker. Biometrics, 58(4): , [WW02] Hulin Wu and Lang Wu. Identification of significant host factors for Hiv dynamics modelled by non-linear mixed-effects models. Statistics in Medicine, 21(5): , [XZ01] Jane Xu and Scott L. Zeger. Joint analysis of longitudinal data comprising repeated measures and times to events. Applied Statistics, 50(3): , [YB94] Ruoyong Yang and James O. Berger. Estimation of a covariance matrix using the reference prior. The Annals of Statistics, 22: , [ZD01] Daowen Zhang and Marie Davidian. Linear mixed models with flexible distribution of random effects for longitudinal data. Biometrics, 57(3): , [Zha99] Heping Zhang. Analysis of infant growth curves using multivariate adaptive splines. Biometrics, 55: , [Zha04] Daowen Zhang. Generalized linear mixed models with varying coefficients for longitudinal data. Biometrics, 60(1):8 15, [ZL92] Scott L. Zeger and Kung-Yee Liang. An overview of methods for the analysis of longitudinal data. Statistics in Medicine, 11: , [ZLRS98] Daowen Zhang, Xihong Lin, Jonathan Raz, and MaryFran Sowers. Semiparametric stochastic mixed models for longitudinal data. Journal of the American Statistical Association, 93: , 1998.

293 Index DIC, 68 effective number of parameters, 68 Likelihood, 45 Information matrix, 48 score function, 47 MCMC, 57 convergence, 64 data augmentation, 62 Gibbs sampling, 58 Metropolis-Hastings, 60 mixing, 64 Rao-Blackwellization, 66 posterior distribution, 48 Posterior predictive checks, 71 Posterior predictive loss, 70 Prior distributions, 49 conjugate, 51 elicitation, 74 improper, 53 informative, 56 noninformative, 52 shrinkage, 73 probit model multivariate, 33 Splines Bayes, 73