Missing Data in Longitudinal Studies: Dropout, Causal Inference, and Sensitivity Analysis

Size: px
Start display at page:

Download "Missing Data in Longitudinal Studies: Dropout, Causal Inference, and Sensitivity Analysis"

Transcription

1 Missing Data in Longitudinal Studies: Dropout, Causal Inference, and Sensitivity Analysis Michael J. Daniels University of Florida Joseph W. Hogan Brown University

2

3 Contents I Regression and Inference 1 1 Datasets Schizophrenia trial Growth hormone trial Smoking cessation trials HERS: HIV Natural History Study OASIS Study Pediatric AIDS Trial 14 2 Regression Models Introduction Generalized linear models Conditionally-specified models Directly specified (marginal) models Nonlinear and semiparametric regression Interpreting covariate effects Further reading 43 3 Bayesian inference Likelihood Prior Distributions Computation of the Posterior Distribution 57 iii

4 iv CONTENTS 3.4 Model comparison and Model fit Semiparametric Bayes Further reading 73 4 Data Analysis Analysis of GH study using complete data (continuation of Data Example 1.2) Analysis of schizophrenia using complete data (continuation of Data Example 1.1) Analysis of CTQ I using complete data (continuation of Data Example 1.3) Analysis of HERS CD4 data (continuation of Data Example 1.4) 86 II Missing Data 91 5 Missing data mechanisms Introduction Full versus observed data Full-data models and missing data mechanism Assumptions about missing data mechanism MAR and dropout Posterior inference Posterior inference under ignorability Posterior inference under non-ignorability Summary Further reading Inference under MAR General issues in model specification Posterior sampling using data augmentation Covariance structures for univariate longitudinal processes 134

5 CONTENTS v 6.4 Covariate dependent covariance structures Multivariate processes Model comparison and fit Further reading Data examples: Ignorability Re-analysis of GH study under MAR (cont. of Example 4.1) Analysis of schizophrenia clinical trial under MAR (cont. of Example 4.2) Analysis of CTQ I using all the data under MAR (cont. of Example 4.3) Analysis of weekly smoking and weight change from CTQ II using all the data under MAR Models under nonignorable dropout Introduction Selection models Mixture models Shared parameter models (SPM) Model comparisons and assessing model fit Further reading Informative Priors and Sensitivity analysis Introduction Pattern mixture models Selection models Elicitation of expert opinion and formulation of sensitivity analyses A note on sensitivity analysis in fully parametric models Further reading 229

6 vi CONTENTS 10 Case studies: Missing not at random Growth hormone study Pediatric AIDS Trial OASIS Study: Analysis via pattern mixture models OASIS Study: Analysis via selection models Application of missing data methods to problems in causal inference Introduction Framework Instrumental variables (IV) Principal stratification Further reading 268 A Notation 269 References 271 Index 287

7 PART I Regression and Inference

8

9 General introduction goes here. 3

10

11 CHAPTER 1 Datasets This chapter describes, in detail, several datasets that will be used throughout the book to motivate the material and to illustrate data analysis methods. For each dataset, we describe the associated study, the primary research goals to be met by analyses, and the key complications presented by missing data. Empirical summaries are given here; for detailed analyses we provide references to subsequent chapters. These datasets derive primarily from our own collaborations, introducing the potential for selection bias in the coverage of topics. However we have selected these because they cover a range of different types of study design and missing data issues. Although the data are primarily from clinical trials, we have XX observational studies. The datasets have both continuous and discrete longitudinal endpoints for analysis. Both continuous-time and discrete-time dropout processes are covered here. In XX of the studies, the goal is to capture the joint distribution of two processes evoloving simultaneously. For motivating and illustrating issues related to causal inference, we use an observational study and a clinical trial with noncompliance. It is our hope that readers can find within these examples connections to their own work or data-analytic problem of interest. The literature on missing data continues to grow, and it is impossible to address every situation with representative examples, so we have included at the end of each chapter a list of supplementary reading to assist readers in their navigation of current research in this field. 1.1 Dose-finding trial of an experimental treatment for schizophrenia Study and data These data were collected as part of a randomized, double-blind clinical trail of a new pharmacologic treatment of schizophrenia (Lapierre, Nai, 5

12 6 DATASETS Chauinard et al., 1990). Other published analyses of these data can be found in Hogan and Laird (1997) and Cnaan, Laird and Slasor (1997). The trial compares three doses of the new treatment (low, medium, high) to the standard dose of halperidol, an effective antipsychotic that has known side effects. At the time of the trial, the experimental therapy was thought to have similar antipsychotic effectiveness with fewer side effects; the trial was designed to find the appropriate dosing level. The study enrolled 242 patients at 13 different centers, and randomized them to one of the four treatment arms. The intended length of follow up was six weeks, with measures taken weekly except for week 5. Schizophrenia severity was assessed using the Brief Psychiatric Rating Scale, or BPRS, a sum of scores of 18 items that reflect behaviors, mood and feelings (Overall and Gorham, 1988). The scores ranged from 0 to 108 with higher scores indicating higher severity. A minimum score of 20 was required for entry into the study Questions of interest The primary objective is to compare mean change from baseline to week 6 between the four treatment groups Missing data Dropout in this study was substantial; only 139 of the 245 participants had a measurement at week 6. The mean BPRS on each treatment arm showed differences between dropouts and completers. Reasons for dropout included adverse experience (e.g., side effects), lack of treatment effect, and withdrawal for unspecified reasons. Reasons such as lack of observed treatment effect are clearly related to the primary efficacy outcome, but others such as adverse events and participant withdrawal may not be. Figure 1: Graphs (4 panels), separating completers from dropouts Figure 2: KM curve of dropout by tx arm Table 1: Dropout by reason, stratified by treatment arm. Need to insert material in chapters 5 and 7 about handling combinations of informative and noninformative dropout. Reference to appropriate material on extrapolations Re-add some material on intermittent missingness in the presence of dropout, in Chapters 5 or 7 What about centers? Include as covariate?

13 GROWTH HORMONE TRIAL Data Analyses In Section XXX we analyze these data using a standard random effects model for longitudinal data, fitted under the missing at random (MAR) constraints. Because valid inference under MAR depends on correct model specification, we emphasize the role of variance-covariance specification under MAR. For some individuals, drop out occurs for reasons that clearly are related to outcome, and for others the connection between dropout and outcome is less clear (e.g. dropout due to adverse side effects). Hence the MAR assumption may not be entirely appropriate, and in Section 9.X, we show how to handle combinations of MAR and MNAR dropout using a pattern mixture model; some attention is given in this example to assumptions being made about intermittently missing values. Need to do MAR analysis Need to do Ch 9 analysis 1.2 Clinical trial of recombinant human growth hormone (rhgh) for increasing muscle strength in the elderly Study and data The data come from a randomized clinical trial conducted to examine the effects of recombinant human growth hormone (rhgh) therapy for building and maintaining muscle strength in the elderly (Kiel, Puhl, Rosen et al., 1998). The study enrolled 161 participants and randomized them to one of four treatment arms: placebo (P), growth hormone (GH) only, exercise plus placebo (E), and exercise plus growth hormone (E+GH). Various muscle strength measures were recorded at baseline, six months, and 12 months. Here, we focus on mean quadriceps strength (QS), measured as the maximum foot-pounds of torque that can be exerted against resistance provided by a mechanical device Questions of interest The primary objective of our analyses is to compare mean QS at month 12 in the four treatment arms among all those randomized (i.e., draw inference about the intention to treat effect).

14 8 DATASETS Missing data Roughly 75% of randomized individuals completed all 12 months of follow-up, and most of the dropout was thought to be related to the unobserved responses at the dropout times. Table summarizes mean and standard deviation of QS for available follow up data, both aggregated and stratified by dropout time Data analyses These data are used several times throughout the book to illustrate various models. In Section?? we analyze data on completers only to illustrate multivariate normal regression; in Section?? we fit pattern mixture models under the missing at random (MAR) constraint, and illustrate the use of interior family constraints (Molenberghs et al., 1998; Kenward and Molenberghs, 2000 need cite); in Section?? we use more general pattern mixture models that permit missingness not at random (MNAR) and sensitivity analyses. 1.3 Clinical Trials of Exercise as an Aid to Smoking Cessation in Women: The Commit to Quit Studies Studies and data The Commit to Quit studies were randomized clinical trials to examine the impact of exercise on the ability to quit smoking. The women in each study were aged 18-65, smoked 5 or more cigarettes per day for at least one year, and participated in moderate or vigorous intensity activity for less than ninety minutes per week. The first trial (hereafter CTQ I) enrolled 281 women and tested the effect on smoking cessation of supervised vigorous exercise versus equivalent staff contact time (Marcus, Albrecht, King et al., 1999); the second trial (hereafter CTQ II) enrolled XXX female smokers and was designed to examine the effect of moderate partially supervised exercise (Marcus, Lewis, Hogan et al., 2005). Other analyses of these data can be found in Hogan, Roy and Korkontzelou (2004), add ref who illustrate weighted regression using inverse propensity scores, and Roy and Hogan (2007), who use principal stratification methods to infer the causal effect of compliance with vigorous exercise. In each study, smoking cessation was assessed weekly using self-report,

15 SMOKING CESSATION TRIALS 9 Table 1.1 need treatment labels Growth hormone trial: Sample means (standard deviations) stratified by treatment group and dropout pattern k. Patterns defined by last visit (0: baseline; 1: 3 months; 2: 6 months), and n k is the number in pattern k. Month Treatment k n k (26) (15) 68 (26) (24) 90 (32) 88 (32) All (25) 87 (32) 88 (32) (17) (33) 81 (42) (22) 64 (21) 63 (20) All (23) 66 (25) 63 (20) (32) (52) 86 (51) (24) 81 (25) 73 (21) All (26) 82 (26) 73 (21) (29) (19) 62 (31) (23) 62 (20) 63 (19) All (24) 62 (22) 63 (19) with confirmation via laboratory testing of saliva and exhaled carbon monoxide. As is typical in short-term intervention trials for smoking cessation, the target date for quitting smoking followed an initial runin period during which the intervention was administered but the participants were not asked to quit smoking. In CTQ I, measurements on smoking status were taken weekly for 12 weeks; women were asked to quit at week 5. In CTQ II, total follow up lasted 8 weeks, and women were asked to quit in week Defining treatment effects In each study, the question of interest was whether the intervention under study reduced the rate of smoking. This can be answered in terms of the effect of randomization to treatment versus randomization to control, or in terms of the effect of complying with treatment versus not com-

16 10 DATASETS plying. The former can be answered using an intention to treat analysis, contrasting outcomes based on treatment arm assignment. The latter poses additional challenges, but can be addressed using methods for inferring causal effects; these include instrumental variables, propensity score methods, and principal stratification. These topics are addressed some detail in Chapter??; key references include Angrist, Imbens and Rubin (1996) need cite, Robins (XX) need cite and Frangakis and Rubin need cite. In our analyses, we frame the treatment effect in terms of either (a) timeaveraged weekly cessation rate following the target quit date or (b) cessation rate at the final week of follow up. More details about analyses are given below Missing data Each of the studies had substantial dropout; in CTQ I, XX percent (XX/XX) dropped out on the exercise arm, and XX percent (XX/XX) on the control arm; for CTQ II the proportions were XX on exercise, and XX on control. Figure XX shows, for CTQ I, weekly cessation rates for all observed data, and then stratified by dropout status (yes/no), making clear that dropout is related at least to observed smoking status during the study. There also exists some empirical support for the notion that dropout is related to missing outcomes in smoking cessation studies, in the sense that dropouts are more likely to be smoking once they have withdrawn from the study (Liectenstein et al, 19XX) need citation Data analyses Analysis of CTQ I under standard MAR assumptions In Chapters XX, we use the CTQ I data to infer the effect of being randomized to either exercise or control under the standard MAR assumption; inferences are compared to other approaches such as complete-case analysis and the common practice of assuming dropouts are smokers. Analysis of CTQ II using auxiliary variables Weight change is generally associated with smoking cessation. In Chapter XX, we illustrate the use of auxiliary information on longitudinal

17 HERS: HIV NATURAL HISTORY STUDY 11 weight changes to inform the distribution of smoking cessation outcomes in making treatment comparisons in CTQ II. The weight change data is incorporated through a joint model for longitudinal smoking cessation and weight, and the marginal distribution of smoking cessation is used for treatment comparisons. Inferring the causal effect of exercise in CTQ I An attractive feature of many behavioral intervention trials is that compliance with the intervention is directly observed; in CTQ I, participants attended on-site sessions to participate in exercise and counseling. In Chapter XX, we use the method of principal stratification to estimate the causal effect, on week-12 cessation rate, of attending the exercise sessions versus attending the educational sessions (see also Roy and Hogan, 2007). This effect is estimated for the subpopulation (stratum) who would comply with either intervention, if offered. Graph of observed smoking rates each study Graph stratified by dropouts and non-dropouts for each study Graph of weight and smoking status in CTQ II? 1.4 Natural history of HIV infection in women: HIV Epidemiology Research Study (HERS) Cohort Have to decide what we will do with these data. So far we have summarized CD4 using spline models. I suppose that is useful for the discussion about MAR but I m not sure where else that gets us Study and data The HIV Epidemiology Research Study (HERS) was a longitudinal cohort study of the natural history of HIV in women. Between 1993 and 1996, the HERS enrolled 1310 women who were either HIV-positive or at high risk for infection; 871 were HIV positive at study entry. Every six months for up to five years, several outcomes were recorded for each participant, including standard measures of immunologic function and viral burden, plus a comprehensive set of measures characterizing health status and behavioral patterns (e.g., body mass index, depression status, drug use behavior). Our analyses of HERS data will focus on modeling

18 12 DATASETS CD4 progression in the presence of dropout and mortality, and on estimating effect of treatment on CD4 count. Many analyses of HERS, addressing a variety of topics, have been published in both the medical and statistical literature. A general study of CD4 and viral load progression can be found in Mayer, Hogan, Smith et al. (2003) ; covariation of CD4 and body mass index is investigated in Jones, Hogan, Snyder et al. (2003) ; development and application of methods for estimating the effect of time-varying treatment can be found in Ko, Hogan and Mayer (2003), Hogan and Lee (2004), Hogan and Lancaster (2004), and Roy et al. (2006). check biblio for citations of hogan/lancaster and Roy et al CD4 or depression over time, with population smoother? Should align with HAART? Here, could potentially look at depression as a function of Race within CD4 200, where dropout seems to matter. KM plots of dropout, or plots showing dropout vs non-dropouts. Need something on mortality... (a) KM plot of time to death, treating dropout and non-hiv death as censoring?? Study objectives The HERS is a multi-site study with substudies numbering in the hundreds; relative to their scope, our objectives for illustrating data analyses are necessarily simplified. We carry out two analyses of data from HERS: in the first, our interest is in characterizing the trajectory of CD4 or depression or something else...?; in the second, we are interested in quantifying the causal effect of highly-active antiviral therapy (HAART) on CD4 count. The latter question requires methodology for handling a time-varying nonrandomized treatment Missing data issues All individuals enrolled in HERS were scheduled for five-year follow up (12 visits); of the 871 HIV-positive women, XXX completed all 12 visits; XXX dropped out and XXX died before completing the study. Reason for death is classified as HIV-related or not. Having dropout due to death necessitates careful definition of the target quantities for inference, which is given more detailed attention in Chapter XX need to put in placeholder for section - put it in the section.

19 OASIS STUDY Data analyses In Chapter XX, we illustrate the use of regression splines under MAR, with emphasis on the importance of choosing an appropriate variancecovariance model. Regression splines are used because the actual dates of the measurements, rather than just visit number, are available. The analyses in Chapter XX uses the method of instrumental variables to draw inferences about the effect of time-varying HAART on CD4 cell count. A similar analysis using moment-based methods can be found in Hogan and Lancaster (2004). 1.5 Clinical trial of smoking cessation among substance abusers: OASIS Study Data description The OASIS Trial studied compared standard versus enhanced counseling intervention for various behaviors such as smoking and alcohol abuse among substance abusers check - or alc abusers?; the focus for our analyses is on smoking cessation. The trial enrolled XX individuals, randomized to standard versus more intensive counseling details from JYL paper. Follow up occurred at one, three and six months following randomization. Table of cessation rates under completers only and filling in dropouts as smokers; include percent dropout; from JYL paper Analysis objectives The primary goal of our analysis is comparison, by treatment randomization, of smoking cessation rates at 12 months post baseline (i.e., intention to treat effect) Missing data issues Dropout rate was relatively high in the OASIS study (XX percent on standard, XX percent on the enhanced intervention). In our analyses we do not distinguish between dropout reason or type.

20 14 DATASETS Data analyses These data are analyzed in detail in Chapter XX, using models that allow for MNAR dropout. The first analysis uses a pattern mixture model where, conditional on dropout time, the longitudinal smoking outcomes follow a Markov transition model. The model is fit under MAR assumptions, then elaborated to allow for MNAR mechanisms. Sensitivity analyses and the use of informative prios are illustrated. The second analysis uses a semiparametric selection model approach, also allowing for MNAR dropout. The two models are compared in terms of treatment effect inference and qualitative characteristics. 1.6 Equivalence trial of competing doses of AZT in HIV-infected children: Protocol 128 of the AIDS Clinical Trials Group Data description Study of XX children randomized to two interventions. Dropout rate is XX percent. Data have been analyzed in other papers, including Hogan and Laird (1996), Hogan and Daniels (2000), and Hogan, Lin and Herman (2004). Tables and figures: Graph of all data, with highlighted profiles showing dropouts tend to have lower slopes etc. Graph of OLS slopes vs dropout time Inferential objectives Compare difference in CD4 change by end of study Missing data issues Dropout for various reasons. Here we will treat dropouts as the same, but for an analysis that considers reasons for dropout see Hogan and Laird (1996) and Hogan and Daniels (2000).

21 PEDIATRIC AIDS TRIAL Data analyses These data are analyzed in Chapter XX using a mixtures of varying coefficient models, which are compared to the standard random effects approach and to the conditional linear model of Wu and Bailey (1988).

22

23 CHAPTER 2 Regression Models for Longitudinal Data 2.1 Introduction Longitudinal data The material in this book is organized around regression models for repeated measurements. Appealing to first principles, one can think of longitudinal data as arising from the joint evolution of response and covariates, {Y i (t), x i (t) : t 0} If the process is observed at a discrete set of time points t = (t 1,..., t J ) T that is common to all individuals, the resulting response data can be written in terms of the J 1 vector Y i = {Y i (t) : t t} = (Y i1,..., Y ij ) T. The covariate process {x i (t)} is p 1. At time t j, the observed covariates are collected in the vector x ij = (x ij1,..., x ijp ) T. Hence the full collection of observed covariates is contained in the J p matrix x T i1 x T i2 X i =.. x T ij When the set of observation times is common to all individuals, we say the responses are balanced or temporally aligned. It is sometimes the case that observation times are unbalanced, or temporally misaligned, in that they vary by subject. in which case the times are t i1,..., t iji and the dimensions of Y i and X i are J i 1 and J i p, respectively. In regression, we are interested in characterizing the effect of covariates 17

24 18 REGRESSION MODELS X on a longitudinal dependent variable Y. Formally, we wish to draw inference about the joint distribution of the vector Y i of responses, conditionally on X i, [Y i X i ] = [Y i1,..., Y iji X i ]. Likelihood-based regression models for longitudinal data require a specification of this joint distribution using a model f(y x, θ). The parameter θ is a finite-dimensional vector of parameters indexing the model; it might include regression coefficients, variance components, and parameters indexing serial correlation. The joint distribution of responses can specified directly or indirectly. Directly specified models are written in terms of the marginal mean at each measurement occasion or time point t, together with a model for the variance-covariance structure. Indirectly specified models typically use a multilevel format, for example involving subject-specific random effects or latent variables b i to partition within- and between-subject variation. The usual strategy is to specify the joint distribution of responses and random effects, factored as [Y i, b i X i ] = [Y i b i, X i ] [b i X i ]. The distribution of interest, [Y i X i ], is obtained by integrating over the b i. Both directly- and indirectly-specified models are common for modeling longitudinal data, and in our review we will give several examples Regression models The literature on regression models for longitudinal data is vast, and we make no attempt to be comprehensive here. Our review is designed to highlight predominant approaches to regression modeling, emphasizing those models used in later chapters. Readers are referred to Diggle et al. [DHLZ02b], Fitzmaurice et al. [FLW04], Laird [Lai04], Jones [Jon93], Davidian and Giltinan [DG98], Crowder and Hand [CH90], Verbeke and Molenberghs [VM00], and Lindsey [Lin99] for a variety of perspectives. As we review several different regression models, the intent is to give the reader a sense of the rich variety of models that can be used to characterize longitudinal data, and to illustrate that these fit coherently into a single framework. As a result, missing data strategies described in later chapters can be applied very generally. Specific models described here will be familiar to those with experience analyzing longitudinal data (e.g. multivariate normal regression model, random effects models), but

25 INTRODUCTION 19 others represent fairly new developments (e.g. marginalized transition models [Hea02], regression splines [EM96, LZ99, RWC03]). Here we focus on specification and interpretation; Chapter 3 covers various aspect of inference. Because many regression models for longitudinal data have their foundation in the generalized linear model (GLM) for cross-sectional data [MN99], our review begins with a concise description of GLMs. Coverage of models for longitudinal data begins with random effects models; these build directly on the GLM structure by introducing individual-level random effects to capture between-subject variation. Conditionally on the random effects, within-level variation can be described by a simpler model, such as a GLM. Random effects models are very attractive in that they naturally partition variation in the dependent variable into its betweenand within-subject components, and they can be used to model both balanced and unbalanced data. At the same time, there is sometimes the disadvantage that the implied marginal distribution of responses is opaque. An alternative to random effects models is directly-specified models of the joint marginal distribution of responses (Section 2.4). Frequently referred to as marginal models, directly-specified models have a natural construction when the error distribution is multivariate normal, but for binary, count, and other discrete data, the choice of an appropriate joint distribution is less obvious. Our review touches on some recent developments for discrete longitudinal responses, such as the marginalized transition model [Hea02] and others. For a detailed review of likelihoodbased models of multivariate discrete responses, see Chapter 11 of Diggle et al. [DHLZ02a] and Chapter 7 of Laird [Lai04]. For all models covered in the first part of the chapter, the regression function is linear in covariates and takes a known functional form. Section 2.5 describes models in which the regression function can be nonlinear, either through a known function of the covariates or through an unspecified smooth function. The latter type of model is typically called semiparametric, because the regression is left unspecified but distributional assumptions are made about the error structure. Nonlinear and semiparametric models have a close connection to the GLM structure; our discussion of these models emphasizes that connection and illustrates that regression models as a whole can be very generally characterized [HTF01, RWC03]. The final element of our review concerns interpretation of covariate effects in longitudinal models. Because the response and covariates change with time, models of longitudinal data afford the opportunity to infer

26 20 REGRESSION MODELS both within- and between-subject covariate effects; however the importance of underlying assumptions to the interpretation of covariate effects should not be underestimated. Section 2.6 discusses three key aspects of interpretation and specification for longitudinal models: crosssectional versus longitudinal effects of a time-varying covariate, marginal (population-averaged) versus conditional (subject-specific) covariate effects, and the assumptions governing the use of time-varying covariates Full vs. observed data Throughout Chapters 2 and 3, the models refer to a full-data distribution. The distinction between full and observed data is particularly important when drawing inference from incomplete longitudinal data. We define the full data as those observations intended to be collected on a pre-specified interval, such as [0, T ]. For example, if intended collection times t 1,..., t J are common to all individuals, then the full response and covariate data are (Y i1, X i1 ),..., (Y ij, X ij ), where Y ij = Y i (t j ) and X ij = X i (t j ). In most applications, interest lies in the effect of covariates on the mean structure. When data are fully observed, the variance and covariance models can frequently be treated as nuisance parameters. Correct specification of variance and covariance allows more efficient use of the data, but it is not always necessary for obtaining proper inferences about mean parameters. When data are not fully observed, variance-covariance specification takes on heightened importance because missing data will effectively be imputed or extrapolated from observed data, based on modeling assumptions. For longitudinal data, unobserved responses will be imputed from observed responses for the same individual; the assumed correlation structure will usually have considerable influence on the imputation. This theme recurs throughout the book, and therefore our review pays particular attention to aspects of variance-covariance specification Additional notation Random variables and their realizations are denoted by Roman letters (e.g., X, x), and parameters are represented by Greek letters (e.g. α, θ). In Chapter 5, we expand the definition of full data to include random variables such as dropout time that characterize the missing data process.

27 GENERALIZED LINEAR MODELS 21 Vector- and matrix-valued random variables and parameters are represented using boldface (e.g. x, Y, β, Σ). For any matrix or vector A, we use A T to denote transpose. If A is invertible, then A 1 is its inverse and S = A 1/2 is the lower triangular matrix square root such that SS T = A. A full listing of notational conventions appears in the Appendix. 2.2 Generalized linear models for cross sectional data The generalized linear model (GLM) forms the foundation for many approaches to regression with multivariate responses, such as longitudinal or clustered data. Models such as random effects or mixed effects models, latent variable and latent class models, and regression splines, all highly flexible and general, are based on the GLM framework. Moment-based methods such as generalized estimating equations (GEE) also follow directly from the GLM for cross-sectional data [LZ86]. The GLM is a regression model for a dependent variable arising from the exponential family of distributions, f(y θ, ψ) = exp {(yθ b(θ)) /a(ψ) + c(y, ψ)}, where a, b and c are known functions, θ is the canonical parameter, and ψ is a scale parameter. The exponential family includes several commonlyused distributions, such as normal, Poisson, binomial, and gamma. It can be readily shown that E(Y ) = b (θ) var(y ) = a(ψ)b (θ) (see McCullagh and Nelder [MN89], Section for details). The effect of covariates x i = (x i1,..., x ip ) T can be modeled by introducing the linear predictor η i (x i, β) = x T i β, where β = (β 1,..., β p ) T is a vector of regression coefficients. Now define µ i = µ(x i, β) = E(Y x i ). A smooth, monotone function g links the mean µ i to the linear predictor η i via g(µ i ) = η i = x i β. (2.1) In many exponential family distributions, it is possible to identify a link function g such that X T Y is the sufficient statistic for β (here, X is the n p design matrix and Y = (Y 1,..., Y n ) T is the n 1 vector of responses). In this case, the canonical parameter is θ = η. Examples are well-known and widespread: for the Poisson distribution, the canonical

28 22 REGRESSION MODELS parameter is log(µ); for binomial distribution, it is the log odds (logit), log{µ/(1 µ)}. Although canonical links are sometimes convenient, their use is not necessary to form a GLM. In general, it only requires specification of a mean and variance function, conditionally on covariates. The mean follows (2.1), and the variance is given by v(µ i, φ) = φh(µ i ), where h( ) is some function of the mean and φ > 0 is a scale factor. Certain choices of g and h will yield likelihood score equations for common parametric regression models based on for exponential family distributions. For example, setting g(µ) = log{µ/(1 µ)}, h(µ) = µ(1 µ) and φ = 1 yields logistic regression under a Bernoulli distribution [Y i x i ]. Similarly, Poisson regression can be specified by setting g(µ) = log µ, h(µ) = µ and φ = Conditionally specified (random effects) models Conditionally specified models using random effects or latent variables provide a highly flexible class of models for handling longitudinal data. A defining characteristic of these models is that they impose structure on marginal variance and correlation using individual-specific random effects or latent variables. The models can be applied either to balanced or unbalanced response patterns, and can be used to capture key features of both between- and within-subject variation using relatively few parameters. A standard approach is to specify a regression model that includes subject-level random effects or latent variables b, and then to assume that conditionally on the latent variables, the distribution [Y X, b] has a simple form (e.g. its elements are independent). Integrating out the random effects yields marginal correlations between the {Y ij } within subject [BK99]. Many models, regression and otherwise, can be represented using a random effects or latent variable formulation. These include standard random effects regression models for responses that are continuous [LW82, Dig88], or discrete [SLW84, GH97, HG94, NMK00]; see Breslow and Clayton [BC93] and Daniels and Gatsonis [DG99] for an overview. This class of models also includes include regression models with factor-analytic and latent class structures [SR96, AW00, SL96, RLR99, RA01]. See Bartholomew and Knott [BK99] for a full account. Here we briefly review conditionally-specified regression models where

29 CONDITIONALLY-SPECIFIED MODELS 23 conditioning is done on random effects; these models also are known by a variety of names, including mixed effects models, random effects models, and random coefficient models. We use the term random effects models. The most common random effects models for longitudinal data specify the joint distribution [Y i, b i X i, θ] as [Y i b i, X i, θ 1 ] [b i X i, θ 2 ]. The parameter θ 1 captures the conditional effect of X on Y. The marginal distribution [Y i X i ] is obtained by integrating b i out of the joint distribution, and is indexed by the full set of parameters θ = (θ 1, θ 2 ) Random effects models based on GLMs By including random effects, generalized linear models can be used to model longitudinal and clustered data. For common distributions such as Bernoulli and Poisson, the GLM with random effects can be written in terms of the conditional mean and variance. The conditional mean takes the form g{e(y ij x ij, z ij, b i )} = g(µ b ij ) = x ijβ + z ij b i, where g( ) is a link function and z ij is a design matrix for the subjectspecific random effects. This representation of the conditional mean motivates the term mixed-effects model because the coefficients quantify both population-level (β) and individual-level (b i ) effects. The conditional variance is V b ij = var(y ij x ij, z ij, b i ) = φh(µ b ij). Finally, within subject correlation is specified through a covariance function C b ijk (γ) = cov(y ij, Y ik x ij, x ik, b i, γ). In many cases it is assumed that Cijk b = 0; i.e., that the random effects capture relevant within-subject correlation (after averaging over their distribution), but this assumption may not always be appropriate for longitudinal responses. At the second level, the random effects b i follow some distribution such as multivariate normal. The model for the marginal joint distribution of (Y i1,..., Y ij X i ) is obtained by integrating over b i, f(y 1,..., y J X i, θ) = f(y 1,..., y J X i, b i, θ 1 ) df (b i X i, θ 2 ).

30 24 REGRESSION MODELS The relationship between marginal and conditional (random effects) models is important to understand, particularly as it relates to interpreting covariate effects. In what follows we give several examples to illustrate Random effects models for continuous response A natural choice for modeling continuous or measured responses is the normal distribution. In random effects models, allowing both withinand between-subject variation to follow a normal distribution, or more generally a Gaussian process, affords considerable modeling flexibility while retaining interpretability. Example 2.1. Normal random effects model for continuous responses. A common model for continuous longitudinal responses is the normal random effects model. This model illustrates well the concept of an indirectly-specified joint distribution because the variance-covariance structure in [Y i X i, θ] is a by-product of the assumed random effects distribution. Like many random effects models, it is easiest to describe in two stages. At the first stage, the responses Y i are normal conditionally on a q 1 vector of random effects b i, [Y i X i, b i, θ 1 ] N(µ b i, Σ b i), where superscript b is added to emphasize that the mean and covariance are conditional on b i. To incorporate covariate effects, let µ b i = X iβ + Z i b i, where Z i is the design matrix for random effects. The variance matrix Σ b i = Σb i (φ) captures within-subject variation and is parameterized by the r 1 vector of φ of nonredundant parameters. Hence θ 1 = (β, φ). When Z i X i, as is usually the case, the b i can be thought of as error terms for regression coefficients, which gives rise to the term random coefficient model. For example, if X i = Z i, we obtain a randomcoefficients model, µ b i = X i β i = X i (β + b i ). (2.2) where the random effects b i can be interpreted as individual-specific deviations from β. The within-subject variance Σ i (φ) usually has a simplified structure, parameterized through a covariance function C ijk (φ). For example, an

31 CONDITIONALLY-SPECIFIED MODELS 25 exponential structure takes the form where φ = (σ 2, ρ) and 0 ρ 1. C ijk (φ) = σ 2 ρ tij t ik, At the second level, the random effects are assigned a distribution that can depend on covariates. The (multivariate) normal is a common choice, [b i X i ] N(0, Ω), where Ω = Ω(η) is a q q variance matrix indexed by η (hence θ 2 = η). It also is possible to allow η to depend on individual-level covariates through appropriate specifications [DZ03]; this is covered in more detail in Chapter 6. The marginal distribution of Y i follows the multivariate normal distribution [Y i X i, Z i, θ] N(X i β, Z i ΩZ T i + Σ). (2.3) The marginal variance var(y i X i ) is indirectly specified because it depends on parameters from both [Y i X i, b i ] and [b i X i ]. Moreover, we see from by comparing (2.2) and (2.3) that β can be interpreted both as a marginal and a conditional effect of X i on Y i. A version of this model is used to analyze data described in Example 1.1, a longitudinal clinical trial comparing three doses of an antipsychotic to the standard of care in schizophrenia patients. The analysis appears in Data Analysis Random effects models for discrete responses Random effects specifications can be very useful for modeling longitudinal discrete responses, where the joint distribution rarely takes an obvious form and principles from generalized linear models are not easily applied. In the case of longitudinal binary data, for example, it is straightforward to show that the joint distribution of a J-dimensional response variable can be represented by a multinomial distribution with 2 J categories. When J is appreciably large, however, parameter constraints must be imposed to make modeling practical. See Laird [Lai04], Chapter 7 for a more detailed discussion. Compared to direct specification of the joint distribution, random effects models offer the advantage of being parsimonious, providing a natural decomposition of sources of variation, and applying equally well to balanced and unbalanced response profiles. The regression parameters

32 26 REGRESSION MODELS represent covariate effects in the conditional rather than marginal joint distribution of Y, however, and because the link functions are nonlinear transformations of the mean (e.g., log, logit), these do not generally coincide. Therefore care must be taken when interpreting regression effects. The logistic regression with normal random effects illustrates several of these points rather well. Example 2.2. Logistic regression with random effects. As in Example 2.1, a logistic random effects model is specified in terms of the joint distribution [Y i, b i X i, θ] = [Y i X i, b i, θ 1 ] [b i X i, θ 2 ], where θ = (θ 1, θ 2 ). The conditional distribution of each component in Y i follows the Bernoulli model, where [Y ij x ij, b i, θ 1 ] Ber(µ b ij ), g(µ b ij ) = x ijβ + z ij b i (2.4) (hence θ 1 = β). The random effects distribution follows so θ 2 = Ω. [b i X i, θ 2 ] N(0, Ω), The parameter β characterizes the conditional, or subject-specific effect of X i on Y i. By contrast, the marginal or population-averaged distribution [Y i X i, θ] must be obtained by integrating over b i. The marginal mean µ ij (β, Ω) = E(Y ij x ij, β, Ω) is µ ij (β, Ω) = µ b ij (β) df (b i x ij, Ω) = exp(x ij β + z ij b i ) 1 + exp(x ij β + z ij b i ) df (b i x ij, Ω). The marginal effect of X i differs from the conditional effect in that it is a function of both β and Ω, and on the logit scale, it is no longer linear. Zeger and Liang [ZL92] show that in some cases, the marginal effect in the logit-normal model is approximately linear on the logit scale, and differs from the conditional effect by a scale factor that depends on Ω; i.e., g(µ ij ) = x ij β, where β = βk(ω) and k : R q R p is a known function. In many cases the population-averaged effect β is attenuated relative to the subject-specific effect β; for example, when q = 1 (random intercept model), b i is normally distributed, b i is independent of X i, and X i has a single time-constant covariate x i, then β β, with the difference

33 DIRECTLY SPECIFIED (MARGINAL) MODELS 27 β β increasing with var(b i ). Interpreting the marginal and conditional effects is considered further in Section 2.6. In Chapter 4, we use this model to characterize the effect of a behavioral intervention on weekly smoking cessation status using longitudinal binary data from a recent clinical trial. The data are described in Dataset?? and analyzed in Data Analysis??. Examples 2.1 and 2.2 assume the random effects b i follow a normal distribution; this is not necessary and in many cases it may be inappropriate or incorrect. Zhang and Davidian [ZD01] describe models where the random effects distribution belongs to a flexible class of densities that includes the normal as a special case. Verbeke and Lesaffre [VL96] describe random effects distributions that follow discrete mixtures of normal distributions. For some simple models, it is sometimes possible to use exploratory analysis in order to ascertain whether a normal or other symmetric distribution is suitable for describing the random effects. In other cases more formal methods of model choice may be needed. 2.4 Directly specified (marginal) models This section reviews the family of models in which the joint distribution [Y X] is directly specified by a model f(y x, θ). Usually the most challenging aspect of model specification is finding a suitable parameterization for the correlation and/or covariance, particularly when observations are unbalanced in time or when the number of observations per subject is large relative to sample size. In these cases, sensible decisions about dimension reduction must be made. For continuous data that can be characterized using a normal distribution or Gaussian process, model specification (though not necessarily selection) can be reasonably straightforward, owing to the natural separation of mean and variance parameters in the normal distribution. The analyst can focus efforts separately on sensible models for mean and variance/covariance structures. Other types of data also pose more significant challenges to the process of direct specification due to a lack of obvious choices for joint distribution models. Unlike the normal distribution, which generalizes naturally to the multivariate and even stochastic process setting, common distributions like gamma, binomial and Poisson do not have obvious multivariate analogues. The main problems are that the mean shares parameters with the variance and, even for simple specifications, with the covariance. Another potential problem is that unlike with the normal model,

34 28 REGRESSION MODELS higher order associations do not necessarily follow from pairwise associations, hence they need to be specified or explicitly constrained [FL93]. The joint distribution of J binary responses, for example, has 2 J J parameters governing the association structure. With count data, appropriately specifying even a simple correlation structure is not immediately obvious. This section describes various approaches to direct model specification, illustrated with examples from the normal and binomial distributions. The first examples use the normal distribution. For longitudinal binary data, we describe an extension of the log-linear model that allows transparent interpretation of both the mean and serial correlation. Another useful approach to modeling association in binary data is the multivariate probit model, which exploits properties of the normal distribution by assuming the binary variables are manifestations of an underlying normally-distributed latent process Multivariate normal and Gaussian process models The multivariate normal distribution provides a highly flexible starting point for modeling continuous response data, both temporally aligned and misaligned. It also is useful for handling situations where the number of observation times is relatively large relative to the number of units being followed. The most straightforward situation is where data are temporally aligned and n J, allowing both the mean and variance to be unstructured. When responses are temporally misaligned, or when J is large relative to n, structure must be imposed. A key characteristic of the normal distribution that allows for flexible modeling across a wide variety of settings is that the mean and variance have separate parameters. The next two examples illustrate a variety of model specifications using the normal distribution. Each assumes that the response variable Y, or suitable transformation, is well-characterized by the normal model. Example 2.3. Multivariate normal regression for temporally aligned observations. Assume that observations on the primary response variable are taken at a fixed schedule of times t 1,..., t J. For a response vector Y i = (Y i1,..., Y ij ) T with associated J p covariate matrix X i = (x T i1,..., xt ij )T, the multivariate normal regression is written as [Y i X i, θ] N(µ i, Σ i ), where µ i is J 1 and Σ i is J J. The mean E(Y i X i ) follows a regression µ i (β) = µ(x i, β) = X i β,

35 DIRECTLY SPECIFIED (MARGINAL) MODELS 29 where X i is a J p covariate matrix and β is a p 1 vector of regression coefficients. The covariance is parameterized with a vector of nonredundant parameters φ. To emphasize that the covariance var(y i X i ) may depend on X i through φ, we sometimes write Σ i (φ) = Σ(X i, φ). If Σ is assumed constant across individuals, it has J(J +1)/2 unique parameters, but structure can be imposed to reduce this number [JS86]. As an alternative to leaving Σ fully parameterized, common structures for longitudinal data include banded or Toeplitz (with common parameter along each off-diagonal), and autoregressive correlations of pre-specified order. The model also can be extended to allow Σ to depend on covariates [DP02, NnANAZ00a, Pou00]. This model is written in general terms, and the X i matrix can be arbitrary. For example it can include information about measurement time, baseline covariates and the like. If we set x ij = (1, t j ) T and β = (β 0, β 1 ) T, then β 1 corresponds to the average slope over time, where the average is taken over the population from which the sample of individuals is drawn. When J is small enough, x ij can include a vector of time indicators, allowing the mean to be unstructured in time. In Data Analysis 4.1, this model is used to analyze data from a clinical trial of recombinant human growth hormone for increasing muscle strength in the elderly. The data are fully described in Dataset 1.2. In the previous example, it is sometimes possible to allow both the mean and variance to remain unstructured in time because of the relatively few time points and covariate levels. When time points are temporally misaligned, or when the number of observation times is large relative to the sample size, information at the unique measurement times will be sparse and additional structure needs to be imposed. Our focus in the next example is on covariance parameterization in terms of a covariance function. Further details can be found in Diggle et al. [DHLZ02a], Chapter 4. Example 2.4. Multivariate normal regression model for temporally misaligned observations. The main difference in model specification when observations are temporally misaligned has mainly to do with the covariance parameterization. As with Example 2.3, a normal distribution may be assumed, but with covariance Σ i whose dimension and structure depend on the number and timing of observations for individual i. The joint distribution follows [Y i X i, θ] N(µ i, Σ i ).

Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

Problem of Missing Data

Problem of Missing Data VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

More information

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

More information

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

More information

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku, PATTERN MIXTURE MODELS FOR MISSING DATA Mike Kenward London School of Hygiene and Tropical Medicine Talk at the University of Turku, April 10th 2012 1 / 90 CONTENTS 1 Examples 2 Modelling Incomplete Data

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Chapter 1. Longitudinal Data Analysis. 1.1 Introduction

Chapter 1. Longitudinal Data Analysis. 1.1 Introduction Chapter 1 Longitudinal Data Analysis 1.1 Introduction One of the most common medical research designs is a pre-post study in which a single baseline health status measurement is obtained, an intervention

More information

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: james.carpenter@lshtm.ac.uk

More information

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

More information

Analysis of Correlated Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Analysis of Correlated Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington Analysis of Correlated Data Patrick J Heagerty PhD Department of Biostatistics University of Washington Heagerty, 6 Course Outline Examples of longitudinal data Correlation and weighting Exploratory data

More information

Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

More information

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Methods Report A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Hrishikesh Chakraborty and Hong Gu March 9 RTI Press About the Author Hrishikesh Chakraborty,

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Bayesian Approaches to Handling Missing Data

Bayesian Approaches to Handling Missing Data Bayesian Approaches to Handling Missing Data Nicky Best and Alexina Mason BIAS Short Course, Jan 30, 2012 Lecture 1. Introduction to Missing Data Bayesian Missing Data Course (Lecture 1) Introduction to

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Nominal and ordinal logistic regression

Nominal and ordinal logistic regression Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

More information

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA Examples: Multilevel Modeling With Complex Survey Data CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA Complex survey data refers to data obtained by stratification, cluster sampling and/or

More information

Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

More information

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Regression Modeling Strategies

Regression Modeling Strategies Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

Gaussian Conjugate Prior Cheat Sheet

Gaussian Conjugate Prior Cheat Sheet Gaussian Conjugate Prior Cheat Sheet Tom SF Haines 1 Purpose This document contains notes on how to handle the multivariate Gaussian 1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

More information

Longitudinal Data Analysis

Longitudinal Data Analysis Longitudinal Data Analysis Acknowledge: Professor Garrett Fitzmaurice INSTRUCTOR: Rino Bellocco Department of Statistics & Quantitative Methods University of Milano-Bicocca Department of Medical Epidemiology

More information

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random [Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

The Probit Link Function in Generalized Linear Models for Data Mining Applications

The Probit Link Function in Generalized Linear Models for Data Mining Applications Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis. These models are appropriate when the response

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring. Jie-Men Mok Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni 1 Web-based Supplementary Materials for Bayesian Effect Estimation Accounting for Adjustment Uncertainty by Chi Wang, Giovanni Parmigiani, and Francesca Dominici In Web Appendix A, we provide detailed

More information

Models for Longitudinal and Clustered Data

Models for Longitudinal and Clustered Data Models for Longitudinal and Clustered Data Germán Rodríguez December 9, 2008, revised December 6, 2012 1 Introduction The most important assumption we have made in this course is that the observations

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

More information

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Examples: Monte Carlo Simulation Studies CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES Monte Carlo simulation studies are often used for methodological investigations of the performance of statistical

More information

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén www.statmodel.com. Table Of Contents Mplus Short Courses Topic 2 Regression Analysis, Eploratory Factor Analysis, Confirmatory Factor Analysis, And Structural Equation Modeling For Categorical, Censored, And Count Outcomes Linda K. Muthén

More information

Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine.

Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine. American Statistical Association Biopharm Section Monthly Webinar Series: Sensitivity analyses that address missing data issues in Longitudinal studies for regulatory submission. Dr James Roger. GlaxoSmithKline

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: vad5@hw.ac.uk Currie,

More information

Credit Risk Models: An Overview

Credit Risk Models: An Overview Credit Risk Models: An Overview Paul Embrechts, Rüdiger Frey, Alexander McNeil ETH Zürich c 2003 (Embrechts, Frey, McNeil) A. Multivariate Models for Portfolio Credit Risk 1. Modelling Dependent Defaults:

More information

13. Poisson Regression Analysis

13. Poisson Regression Analysis 136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often

More information

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

More information

LOGIT AND PROBIT ANALYSIS

LOGIT AND PROBIT ANALYSIS LOGIT AND PROBIT ANALYSIS A.K. Vasisht I.A.S.R.I., Library Avenue, New Delhi 110 012 amitvasisht@iasri.res.in In dummy regression variable models, it is assumed implicitly that the dependent variable Y

More information

Covariance and Correlation

Covariance and Correlation Covariance and Correlation ( c Robert J. Serfling Not for reproduction or distribution) We have seen how to summarize a data-based relative frequency distribution by measures of location and spread, such

More information

Imputation of missing data under missing not at random assumption & sensitivity analysis

Imputation of missing data under missing not at random assumption & sensitivity analysis Imputation of missing data under missing not at random assumption & sensitivity analysis S. Jolani Department of Methodology and Statistics, Utrecht University, the Netherlands Advanced Multiple Imputation,

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Introduction to Fixed Effects Methods

Introduction to Fixed Effects Methods Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed

More information

FIXED EFFECTS AND RELATED ESTIMATORS FOR CORRELATED RANDOM COEFFICIENT AND TREATMENT EFFECT PANEL DATA MODELS

FIXED EFFECTS AND RELATED ESTIMATORS FOR CORRELATED RANDOM COEFFICIENT AND TREATMENT EFFECT PANEL DATA MODELS FIXED EFFECTS AND RELATED ESTIMATORS FOR CORRELATED RANDOM COEFFICIENT AND TREATMENT EFFECT PANEL DATA MODELS Jeffrey M. Wooldridge Department of Economics Michigan State University East Lansing, MI 48824-1038

More information

TUTORIAL IN BIOSTATISTICS Handling drop-out in longitudinal studies

TUTORIAL IN BIOSTATISTICS Handling drop-out in longitudinal studies STATISTICS IN MEDICINE Statist. Med. 2004; 23:1455 1497 (DOI: 10.1002/sim.1728) TUTORIAL IN BIOSTATISTICS Handling drop-out in longitudinal studies Joseph W. Hogan 1; ;, Jason Roy 2; and Christina Korkontzelou

More information

An Application of the G-formula to Asbestos and Lung Cancer. Stephen R. Cole. Epidemiology, UNC Chapel Hill. Slides: www.unc.

An Application of the G-formula to Asbestos and Lung Cancer. Stephen R. Cole. Epidemiology, UNC Chapel Hill. Slides: www.unc. An Application of the G-formula to Asbestos and Lung Cancer Stephen R. Cole Epidemiology, UNC Chapel Hill Slides: www.unc.edu/~colesr/ 1 Acknowledgements Collaboration with David B. Richardson, Haitao

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

More information

Department of Epidemiology and Public Health Miller School of Medicine University of Miami

Department of Epidemiology and Public Health Miller School of Medicine University of Miami Department of Epidemiology and Public Health Miller School of Medicine University of Miami BST 630 (3 Credit Hours) Longitudinal and Multilevel Data Wednesday-Friday 9:00 10:15PM Course Location: CRB 995

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

More information

Applied Missing Data Analysis in the Health Sciences. Statistics in Practice

Applied Missing Data Analysis in the Health Sciences. Statistics in Practice Brochure More information from http://www.researchandmarkets.com/reports/2741464/ Applied Missing Data Analysis in the Health Sciences. Statistics in Practice Description: A modern and practical guide

More information

Analyzing Structural Equation Models With Missing Data

Analyzing Structural Equation Models With Missing Data Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information

ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? 1. INTRODUCTION

ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? 1. INTRODUCTION ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? SAMUEL H. COX AND YIJIA LIN ABSTRACT. We devise an approach, using tobit models for modeling annuity lapse rates. The approach is based on data provided

More information

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March 2015. Due:-March 25, 2015.

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March 2015. Due:-March 25, 2015. Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment -3, Probability and Statistics, March 05. Due:-March 5, 05.. Show that the function 0 for x < x+ F (x) = 4 for x < for x

More information

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Sensitivity analysis of longitudinal binary data with non-monotone missing values

Sensitivity analysis of longitudinal binary data with non-monotone missing values Biostatistics (2004), 5, 4,pp. 531 544 doi: 10.1093/biostatistics/kxh006 Sensitivity analysis of longitudinal binary data with non-monotone missing values PASCAL MININI Laboratoire GlaxoSmithKline, UnitéMéthodologie

More information

1 Teaching notes on GMM 1.

1 Teaching notes on GMM 1. Bent E. Sørensen January 23, 2007 1 Teaching notes on GMM 1. Generalized Method of Moment (GMM) estimation is one of two developments in econometrics in the 80ies that revolutionized empirical work in

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Multiple Choice Models II

Multiple Choice Models II Multiple Choice Models II Laura Magazzini University of Verona laura.magazzini@univr.it http://dse.univr.it/magazzini Laura Magazzini (@univr.it) Multiple Choice Models II 1 / 28 Categorical data Categorical

More information

arxiv:1301.2490v1 [stat.ap] 11 Jan 2013

arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 The Annals of Applied Statistics 2012, Vol. 6, No. 4, 1814 1837 DOI: 10.1214/12-AOAS555 c Institute of Mathematical Statistics, 2012 arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 ADDRESSING MISSING DATA MECHANISM

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Bayesian Adaptive Designs for Early-Phase Oncology Trials

Bayesian Adaptive Designs for Early-Phase Oncology Trials The University of Hong Kong 1 Bayesian Adaptive Designs for Early-Phase Oncology Trials Associate Professor Department of Statistics & Actuarial Science The University of Hong Kong The University of Hong

More information

A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY. Workshop

A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY. Workshop A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY Ramon Alemany Montserrat Guillén Xavier Piulachs Lozada Riskcenter - IREA Universitat de Barcelona http://www.ub.edu/riskcenter

More information

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out Sandra Taylor, Ph.D. IDDRC BBRD Core 23 April 2014 Objectives Baseline Adjustment Introduce approaches Guidance

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

Multilevel Models for Longitudinal Data. Fiona Steele

Multilevel Models for Longitudinal Data. Fiona Steele Multilevel Models for Longitudinal Data Fiona Steele Aims of Talk Overview of the application of multilevel (random effects) models in longitudinal research, with examples from social research Particular

More information

Multivariate Analysis (Slides 13)

Multivariate Analysis (Slides 13) Multivariate Analysis (Slides 13) The final topic we consider is Factor Analysis. A Factor Analysis is a mathematical approach for attempting to explain the correlation between a large set of variables

More information

1 Prior Probability and Posterior Probability

1 Prior Probability and Posterior Probability Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

More information

How To Model The Fate Of An Animal

How To Model The Fate Of An Animal Models Where the Fate of Every Individual is Known This class of models is important because they provide a theory for estimation of survival probability and other parameters from radio-tagged animals.

More information

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple

More information

Missing data in randomized controlled trials (RCTs) can

Missing data in randomized controlled trials (RCTs) can EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled

More information

Guideline on missing data in confirmatory clinical trials

Guideline on missing data in confirmatory clinical trials 2 July 2010 EMA/CPMP/EWP/1776/99 Rev. 1 Committee for Medicinal Products for Human Use (CHMP) Guideline on missing data in confirmatory clinical trials Discussion in the Efficacy Working Party June 1999/

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Dealing with Missing Data

Dealing with Missing Data Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January

More information

UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE TWO-WAY ANOVA UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

More information

A hidden Markov model for criminal behaviour classification

A hidden Markov model for criminal behaviour classification RSS2004 p.1/19 A hidden Markov model for criminal behaviour classification Francesco Bartolucci, Institute of economic sciences, Urbino University, Italy. Fulvia Pennoni, Department of Statistics, University

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS

SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS John Banasik, Jonathan Crook Credit Research Centre, University of Edinburgh Lyn Thomas University of Southampton ssm0 The Problem We wish to estimate an

More information