Analysis of Correlated Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Similar documents

Chapter 1. Longitudinal Data Analysis. 1.1 Introduction

Introduction to mixed model and missing data issues in longitudinal studies

Analysis of Longitudinal Data for Inference and Prediction. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Longitudinal Data Analysis

Introduction to Longitudinal Data Analysis

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY. Workshop

An Application of the G-formula to Asbestos and Lung Cancer. Stephen R. Cole. Epidemiology, UNC Chapel Hill. Slides:

Organizing Your Approach to a Data Analysis

SPPH 501 Analysis of Longitudinal & Correlated Data September, 2012

Statistical Rules of Thumb

MATH5885 LONGITUDINAL DATA ANALYSIS

Elements of statistics (MATH0487-1)

Biostatistics Short Course Introduction to Longitudinal Studies

USC Columbia, Lancaster, Salkehatchie, Sumter & Union campuses

4. Simple regression. QBUS6840 Predictive Analytics.

Department of Epidemiology and Public Health Miller School of Medicine University of Miami

The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities

A course on Longitudinal data analysis : what did we learn? COMPASS 11/03/2011

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

PRACTICE PROBLEMS FOR BIOSTATISTICS

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Data Mining: Algorithms and Applications Matrix Math Review

Study Design and Statistical Analysis

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

School of Public Health and Health Services Department of Epidemiology and Biostatistics

Time series analysis as a framework for the characterization of waterborne disease outbreaks

Least Squares Estimation

How to choose an analysis to handle missing data in longitudinal observational studies

Part 2: Analysis of Relationship Between Two Variables

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Chapter 7: Simple linear regression Learning Objectives

Marketing Mix Modelling and Big Data P. M Cain

The Latent Variable Growth Model In Practice. Individual Development Over Time

5. Linear Regression

Lecture 3: Linear methods for classification

Econometrics Simple Linear Regression

1 Simple Linear Regression I Least Squares Estimation

Elementary Statistics

The 3-Level HLM Model

Geostatistics Exploratory Analysis

Biostatistics: Types of Data Analysis

Analysis of Bayesian Dynamic Linear Models

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

Additional sources Compilation of sources:

Chapter 13 Introduction to Linear Regression and Correlation Analysis

11. Analysis of Case-control Studies Logistic Regression

Exercise 1.12 (Pg )

Simple Regression Theory II 2010 Samuel L. Baker

The Loss in Efficiency from Using Grouped Data to Estimate Coefficients of Group Level Variables. Kathleen M. Lang* Boston College.

Descriptive Statistics

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

Selection bias in secondary analysis of electronic health record data. Sebastien Haneuse, PhD

Multivariate Logistic Regression

Statistics 151 Practice Midterm 1 Mike Kowalski

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Exploratory data analysis (Chapter 2) Fall 2011

17. SIMPLE LINEAR REGRESSION II

2. Simple Linear Regression

Illustration (and the use of HLM)

Longitudinal Modeling of Lung Function in Respiratory Drug Development

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Assignments Analysis of Longitudinal data: a multilevel approach

Introduction to Structural Equation Modeling (SEM) Day 4: November 29, 2012

Simple Predictive Analytics Curtis Seare

A Basic Introduction to Missing Data

Example: Boats and Manatees

Introduction to Regression and Data Analysis

Introduction to Statistics and Quantitative Research Methods

DATA INTERPRETATION AND STATISTICS

Econometric Methods for Panel Data

Multilevel Models for Longitudinal Data. Fiona Steele

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

SYSTEMS OF REGRESSION EQUATIONS

Descriptive Statistics and Exploratory Data Analysis

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

10. Analysis of Longitudinal Studies Repeat-measures analysis

When to Use a Particular Statistical Test

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Introduction to Data Analysis in Hierarchical Linear Models

Chapter 11 Introduction to Survey Sampling and Analysis Procedures

Factors affecting online sales

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

FIXED AND MIXED-EFFECTS MODELS FOR MULTI-WATERSHED EXPERIMENTS

Variables. Exploratory Data Analysis

Schools Value-added Information System Technical Manual

II. DISTRIBUTIONS distribution normal distribution. standard scores

2013 MBA Jump Start Program. Statistics Module Part 3

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Final Exam Practice Problem Answers

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Highlights the connections between different class of widely used models in psychological and biomedical studies. Multiple Regression

Problem of Missing Data

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

HLM software has been one of the leading statistical packages for hierarchical

Transcription:

Analysis of Correlated Data Patrick J Heagerty PhD Department of Biostatistics University of Washington Heagerty, 6

Course Outline Examples of longitudinal data Correlation and weighting Exploratory data analysis between- and within-person variation correlation / covariance Regression analysis Models for the mean Models for the correlation Linear mixed model GEE Heagerty, 6

Course Outline Missing data Binary and Count Data Models for the mean Models for the correlation Generalized linear mixed model GEE Transition models Time-dependent covariates Multilevel models Design considerations 3 Heagerty, 6

Bio & Notes Patrick J Heagerty Professor, University of Washington Collaborative roles = RWJ, VA ERIC, NIAMS MCRC, K Books: Diggle, Heagerty, Liang & Zeger Analysis of Longitudinal Data Oxford, van Belle, Fisher, Heagerty & Lumley Biostatistics Wiley, 4 (introductory chapter on LDA) 4 Heagerty, 6

Course Notes & Slides UW Biostat 57 = PhD applied core sequence Winter 999,,, UM Epi 766 = Longitudinal Data Analysis / Epi Summer (Summer 4 with VA/UW Biostat/Epi) Second Seattle Symposium (with S Zeger) Fall RAND short course Fall NICHD short course Fall 3 5 Heagerty, 6

Longitudinal Data Analysis INTRODUCTION to EXAMPLES AND ISSUES 6 Heagerty, 6

Outline Examples of longitudinal data between- and within-person variation correlation / covariance Scientific motivation Opportunities Issues Time scales Cross-sectional contrasts Longitudinal contrasts Correlation and weighting Impact of correlation Weighted estimation 7 Heagerty, 6

Continuous Longitudinal Data Example : Cystic Fibrosis and Lung Function There is a large registry of cystic fibrosis patient data Annual measurements include standard pulmonary function measures: FVC, FEV primary outcome: FEV percent predicted covariates: age, gender, genotype Q: Does change in lung function differ by gender and/or genotype? 8 Heagerty, 6

8- Heagerty, 6 5 5 5 3 5 5 fev 5 5 5 5 5 3 5 5 5 3 age

Continuous Longitudinal Data Example : Beta-carotene and vitamin E a phase II study to ascertain the pharmacokinetics of beta-carotene supplementation and the subsequent impact on vitamin E levels primary outcome: plasma measures taken monthly for 3 months prior to, 9 months during, and 3 months after supplementation covariates: dose (, 5, 3, 45, 6 mg/day) and time Q: What is the time course? Dose-response? Relationship between beta-carotene and vitamin E? 9 Heagerty, 6

9- Heagerty, 6 Dose = 45 Subject = 9 Subject = Subject = Subject = 3 beta-carotene (solid) 3 4 6 8 beta-carotene (solid) 3 4 6 8 beta-carotene (solid) 3 4 6 8 beta-carotene (solid) 3 4 6 8 5 5 5 5 5 5 5 5 months since randomization months since randomization months since randomization months since randomization Subject = 3 Subject = 3 Subject = 46 Subject = 47 beta-carotene (solid) 3 4 6 8 beta-carotene (solid) 3 4 6 8 beta-carotene (solid) 3 4 6 8 beta-carotene (solid) 3 4 6 8 Vitamin E (dashed) 5 5 5 5 5 5 5 5 months since randomization months since randomization months since randomization months since randomization

9- Heagerty, 6 Dose = 45 Subject = 9 Subject = Subject = Subject = 3 log(beta-carotene) (solid) 4 5 6 7 8 4 6 8 log(beta-carotene) (solid) 4 5 6 7 8 4 6 8 log(beta-carotene) (solid) 4 5 6 7 8 4 6 8 log(beta-carotene) (solid) 4 5 6 7 8 4 6 8 5 5 5 5 5 5 5 5 months since randomization months since randomization months since randomization months since randomization Subject = 3 Subject = 3 Subject = 46 Subject = 47 log(beta-carotene) (solid) 4 5 6 7 8 4 6 8 log(beta-carotene) (solid) 4 5 6 7 8 4 6 8 log(beta-carotene) (solid) 4 5 6 7 8 4 6 8 log(beta-carotene) (solid) 4 5 6 7 8 4 6 8 Vitamin E (dashed) 5 5 5 5 5 5 5 5 months since randomization months since randomization months since randomization months since randomization

Categorical Longitudinal Data Example 3: Maternal Stress and Child Morbidity daily indicators of stress (maternal), and illness (child) primary outcome: illness, utilization covariates: employment, stress Q: association between employment, stress and morbidity? Heagerty, 6

- Heagerty, 6

Illness Percent 5 5 5 3 Employed= Employed= 5 5 5 Day Stress Percent 5 5 5 3 Employed= Employed= 5 5 5 Day - Heagerty, 6

-3 Heagerty, 6 subject = 4 subject = subject = 56 Y illness N Y stress N 7 4 8 Day Y illness N Y stress N 7 4 8 Day Y illness N Y stress N 7 4 8 Day Y illness N Y stress N subject = 7 4 8 Day Y illness N Y stress N subject = 7 7 4 8 Day Y illness N Y stress N subject = 9 7 4 8 Day Y illness N Y stress N subject = 7 4 8 Day Y illness N Y stress N subject = 4 7 4 8 Day Y illness N Y stress N subject = 94 7 4 8 Day Y illness N Y stress N subject = 7 7 4 8 Day Y illness N Y stress N subject = 96 7 4 8 Day Y illness N Y stress N subject = 34 7 4 8 Day

Continuous Longitudinal Data Example 4: PANSS Data PANSS is a standard symptom assessment for schizophrenic patients This study compares different doses of a new agent to a standard agent and to placebo primary outcome: PANSS covariates: treatment, time Q: What s the best treatment? Heagerty, 6

- Heagerty, 6 Last Visit = 8 Last Visit = 6 Last Visit = 4 PANSS Score 4 6 8 4 6 PANSS Score 4 6 8 4 6 PANSS Score 4 6 8 4 6 4 6 8 Time 4 6 8 Time 4 6 8 Time Last Visit = Last Visit = Last Visit = PANSS Score 4 6 8 4 6 PANSS Score 4 6 8 4 6 PANSS Score 4 6 8 4 6 4 6 8 Time 4 6 8 Time 4 6 8 Time

Longitudinal Data In longitudinal studies of health, we typically observe two distinct kinds of outcomes Times of clinical or other key events Repeated values of markers of the health status of participants In general terms, the scientific question is how explanatory variables affect times to clinical events and markers of the level or change in health status over time The relationship between the event times and markers can also be of interest Use markers as predictors of, or surrogates for, the clinical event time Heagerty, 6

Longitudinal Studies Benefits of longitudinal studies: Incident events are recorded Measure the new occurrence of disease Timing of disease onset can be correlated with recent changes in patient exposure and/or with chronic exposure Prospective ascertainment of exposure Participants can have their exposure status recorded at multiple follow-up visits This can alleviate recall bias Temporal order of exposures and outcomes is observed 3 Heagerty, 6

3 Measurement of individual change in outcomes A key strength of a longitudinal study is the ability to measure change in outcomes and/or exposure at the individual level Longitudinal studies provide the opportunity to observe individual patterns of change 4 Separation of time effects: Cohort, Period, Age When studying change over time there are many time scales to consider cohort scale is the time of birth such as 945 or 963 period is the current time such as 4 age is (period - cohort) A longitudinal study with times t, t, t n can characterize multiple time scales such as age and cohort effects using covariates derived from the calendar time and birth year: age of subject i at time t j is age ij = (t j birth i ); and cohort is cohort ij = birth i 4 Heagerty, 6

5 Control for cohort effects In a cross-sectional study the comparison of subgroups of different ages combines the effects of aging and the effects of different cohorts That is, comparison of outcomes measured in 3 among 58 year old subjects and among 4 year old subjects reflects both the fact that the groups differ by 8 years (aging) and the fact that the subjects were born in different eras In a longitudinal study the cohort under study is fixed and thus changes in time are not confounded by cohort differences An nice overview of LDA opportunities in respiratory epidemiology is presented in Weiss and Ware (996) Lebowitz (996) discusses age, period, and cohort effects 5 Heagerty, 6

Longitudinal Studies The benefits of a longitudinal design are not without cost There are several challenges posed: Challenges of longitudinal studies: Participant follow-up Risk of bias due to incomplete follow-up, or drop-out of study participants If subjects that are followed to the planned end of study differ from subjects who discontinue follow-up then a naive analysis may provide summaries that are not representative of the original target population 6 Heagerty, 6

Analysis of correlated data Statistical analysis of longitudinal data requires methods that can properly account for the intra-subject correlation of response measurements If such correlation is ignored then inferences such as statistical tests or confidence intervals can be grossly invalid 7 Heagerty, 6

3 Time-varying covariates Although longitudinal designs offer the opportunity to associate changes in exposure with changes in the outcome of interest, the direction of causality can be complicated by feedback between the outcome and the exposure Example = MSCM with stresss and illness Although scientific interest generally lies in the effect of exposure on health, reciprocal influence between exposure and outcome poses analytical difficulty when trying to separate the effect of exposure on health from the effect of health on exposure How to choose exposure lag? eg Is it the air pollution today, yesterday, or last week that is the important predictor of morbidity today? 8 Heagerty, 6

Longitudinal Studies The Scientific Opportunity Observe individual changes over time Characterize the time-course of disease Outcome Measures A single outcome at a fixed follow-up time The time until an event occurs Repeated measures taken over time 9 Heagerty, 6

Motivation Cystic Fibrosis and Pulmonary Function Several specific aspects are of interest: What is the rate of decline in FEV? Is the time course different for males and females? 3 Is the time course different for F58 homozygous subjects? Reference: Davis PB (997) Journal of Pediatrics Heagerty, 6

Data ID = patient id FEV = percent-predicted forced expiratory volume in second AGE = age (years) GENDER = sex (=male, =female) PSEUDOA = infection with Pseudomonas Aeruginosa (=no, 3=yes) F58 = genotype (=homozygous, =heterozygous, 3=none) PANCREAT = pancreatic enzyme supplmentation (,=no, =yes) 73 38 845 3 73 988 8783 3 73 9873 9785 3 73 79 538 3 73 984 39 3 73 943 336 3 73 9548 448 3 9685 55 3 5 33 3 33 55 3 99 6838 3 978 758 3 776 8847 3 Heagerty, 6

- Heagerty, 6 EDA: Numerical Summaries Total number of subjects = Number of observations (number of subjects with ni): 6 7 8 9 49 5 36 63 Distribution of males / females male female 98 Number of mutations of f58 3 87 9

- Heagerty, 6 EDA: Numerical Summaries Age at entry N = Median = 9655 Quartiles = 7758, 5335 Decimal point is at the colon 5 : 355666788889 6 : 334555567789999 7 : 34446778889 8 : 3345566899 9 : 44788 : 3349 : 3446678 : 33445557788888999 3 : 34455 4 : 45555779 5 : 3357 6 : 347899 7 : 3567779 8 : 4899 9 : 4 : 3778 : 5577 : 459 3 : 8

ID = 57 ID = 5796 ID = 577 ID = 774 FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC 5 5 5 3 5 5 5 3 5 5 5 3 5 5 5 3 Age Age Age Age ID = 7 ID = 6345 ID = 884 ID = 749 FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC 5 5 5 3 5 5 5 3 5 5 5 3 5 5 5 3 Age Age Age Age ID = 3564 ID = 587 ID = 439 ID = 7755 FEV 5 5 FEV 5 5 FEV PA PA PA PA PC PC PC PC 5-3 5 5 3 5 5 5 3 5 5 5 3 5 5 Heagerty, 6 5 3 5 5 FEV 5 5

ID = 743 ID = 977 ID = 876 ID = 83 FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC 5 5 5 3 5 5 5 3 5 5 5 3 5 5 5 3 Age Age Age Age ID = 3399 ID = 979 ID = 8645 ID = 7 FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC 5 5 5 3 5 5 5 3 5 5 5 3 5 5 5 3 Age Age Age Age ID = 679 ID = 5 ID = 35 ID = 8 FEV 5 5 FEV 5 5 FEV PA PA PA PA PC PC PC PC 5-4 5 5 3 5 5 5 3 5 5 5 3 5 5 Heagerty, 6 5 3 5 5 FEV 5 5

ID = 9847 ID = 67 ID = 97 ID = 597 FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC 5 5 5 3 5 5 5 3 5 5 5 3 5 5 5 3 Age Age Age Age ID = 736 ID = 4367 ID = 63 ID = 945 FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC 5 5 5 3 5 5 5 3 5 5 5 3 5 5 5 3 Age Age Age Age ID = 6699 ID = 394 ID = 54 ID = 7 FEV 5 5 FEV 5 5 FEV PA PA PA PA PC PC PC PC 5-5 5 5 3 5 5 5 3 5 5 5 3 5 5 Heagerty, 6 5 3 5 5 FEV 5 5

ID = 837 ID = 74 ID = 76 ID = 74 FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC 5 5 5 3 5 5 5 3 5 5 5 3 5 5 5 3 Age Age Age Age ID = 7483 ID = 4864 ID = 786 ID = 754 FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC FEV 5 5 PA PC 5 5 5 3 5 5 5 3 5 5 5 3 5 5 5 3 Age Age Age Age ID = 66 ID = 8579 ID = 383 ID = 9469 FEV 5 5 FEV 5 5 FEV PA PA PA PA PC PC PC PC 5-6 5 5 3 5 5 5 3 5 5 5 3 5 5 Heagerty, 6 5 3 5 5 FEV 5 5

-7 Heagerty, 6 FEV versus Age fev 5 5 slope = -73 5 5 5 3 age

FEV versus Age-at-Entry FEV 5 5 slope = -5 5 5 Age at Entry FEV change - -5 5 FEV Change versus Age-since-Entry slope = -39 4 6 8 Age since Entry -8 Heagerty, 6

Distinguishing Cross-sectional and Longitudinal Associations Cross-sectional data Y i = β C x i + ɛ i, i =,, m () β C represents the difference in average Y across two sub-populations which differ by one unit in x Longitudinal data Y ij = β C x i + β L (x ij x i ) + ɛ ij, j =,, n i i =,, m () Heagerty, 6

When j =, the two equations are the same; β C has the same cross-sectional interpretation Subtract equations above to obtain (Y ij Y i ) = β L (x ij x i ) + (ɛ ij ɛ i ) β L represents the expected change in Y per unit change in x 3 Heagerty, 6

3- Heagerty, 6 A - Cross-Sectional Data Reading Score 4 6 8 o o o o o o o o o o o o 4 6 8 4 6 Age

Cross-sectional Longitudinal Response 5 5 Response 5 5 4 6 8 Months 4 6 8 Months Both Response 5 5 4 6 8 Months 3- Heagerty, 6

Age change in FEV 4 6 8-5 5 males females FEV by Male/Female Age change in FEV 4 6 8-5 5 f58 f58 f58 FEV by f58 3-3 Heagerty, 6

EDA Summary Observations Systematic trends: time, gender, F58 Random variation: individual, observation Questions Two time scales? Estimation / testing for rates of decline? Other? 4 Heagerty, 6

Longitudinal Data Analysis INTRODUCTION to CORRELATION and WEIGHTING 5 Heagerty, 6

Longitudinal Data The basic statistical problem is that variables from a given individual are correlated over time (generic) Q: So what? (-) ignoring dependence can lead to invalid inference (-) often limited information regarding dependence (+) can observe change for individuals over time (+) variety of statistical approaches that are available 6 Heagerty, 6

Longitudinal Data need to account for the dependence (generic) Q: How? Choice of Model Choice of Estimator 3 Choice of Summaries 7 Heagerty, 6

Dependent Data and Proper Variance Estimates Let X ij = denote placebo assignment and X ij = denote active treatment () Consider (Y i, Y i ) with (X i, X i ) = (, ) for i = : n and (X i, X i ) = (, ) for i = (n + ) : n ˆµ = n ˆµ = n n i= j= n Y ij i=n+ j= var(ˆµ ˆµ ) = n {σ ( + ρ)} 8 Heagerty, 6 Y ij

8- Heagerty, 6 Scenario subject control treatment time time time time ID = Y, Y, ID = Y, Y, ID = 3 Y 3, Y 3, ID = 4 Y 4, Y 4, ID = 5 Y 5, Y 5, ID = 6 Y 6, Y 6,

Dependent Data and Proper Variance Estimates () Consider (Y i, Y i ) with (X i, X i ) = (, ) for i = : n and (X i, X i ) = (, ) for i = (n + ) : n ˆµ = n ˆµ = n { n Y i + i= { n Y i + i= n i=n+ n i=n+ Y i } Y i } var(ˆµ ˆµ ) = n {σ ( ρ)} 9 Heagerty, 6

9- Heagerty, 6 Scenario subject control treatment time time time time ID = Y, Y, ID = Y, Y, ID = 3 Y 3, Y 3, ID = 4 Y 4, Y 4, ID = 5 Y 5, Y 5, ID = 6 Y 6, Y 6,

Dependent Data and Proper Variance Estimates If we simply had n independent observations on treatment (X = ) and n independent observations on control then we d obtain var(ˆµ ˆµ ) = σ n + σ n = n σ Q: What is the impact of dependence relative to the situation where all (n + n) observations are independent? () positive dependence, ρ >, results in a loss of precision () positive dependence, ρ >, results in an improvement in precision! 3 Heagerty, 6

Therefore: Dependent data impacts proper statements of precision Dependent data may increase or decrease standard errors depending on the design 3 Heagerty, 6

Weighted Estimation Consider the situation where subjects report both the number of attempts and the number of successes: (Y i, N i ) Examples: live born (Y i ) in a litter (N i ) condoms used (Y i ) in sexual encounters (N i ) SAEs (Y i ) among total surgeries (N i ) Q: How to combine these data from i = : m subjects to estimate a common rate (proportion) of successes? 3 Heagerty, 6

Weighted Estimation Proposal : ˆp = i Y i / i N i Proposal : Simple Example: ˆp = m Y i /N i i Data : (, ) (, ) ˆp = ( + )/() = 3 ˆp = {/ + /} = 5 33 Heagerty, 6

Weighted Estimation Note: Each of these estimators, ˆp, and ˆp, can be viewed as weighted estimators of the form: ˆp w = { i w i Y i N i } / i We obtain ˆp by letting w i = N i, corresponding to equal weight given each to binary outcome, Y ij, Y i = N i j= Y ij We obtain ˆp by letting w i =, corresponding to equal weight given to each subject w i Q: What s optimal? 34 Heagerty, 6

Weighted Estimation A: Whatever weights are closest to /variance of Y i /N i (stat theory called Gauss-Markov ) If subjects are perfectly homogeneous then and ˆp is best V (Y i ) = N i p( p) If subjects are heterogeneous then, for example V (Y i ) = N i p( p){ + (N i )ρ} and an estimator closer to ˆp is best 35 Heagerty, 6

Summary Longitudinal (dependent) data are common (and interesting!) Inference must account for the dependence Consideration as to the choice of weighting will depend on the variance/covariance of the response variables 36 Heagerty, 6

Summary Methodological Issues in the Study of Cognitive Decline Morris et al AJE (999) In modeling associations between risk factors and change in cognitive function, the usual analytic issues are complicated by the need to consider time in the analysis The potential for both fruitful investigation and misinterpretation of the data is increased Analyses that make the most effective use of the data usually require the most advanced methods of longitudinal analysis 37 Heagerty, 6

Summary Methodological Issues in the Study of Cognitive Decline Morris et al AJE (999) Use of the longitudinal designs and advanced statistical models described here is well worth the extra effort required Meeting the special challenges of measuring change in cognitive function should improve our ability to identify risk factors for cognitive decline and other diseases related to aging Many of these issues apply to any study of disease processes entailing change with age 38 Heagerty, 6

Some References: Books Diggle PJ, Heagerty PJ, Liang K-Y, Zeger SL () Analysis of Longitudinal Data, Second Edition, Oxford University Press Fitzmaurice GM, Laird NM, Ware JM (4) Applied Longitudinal Analysis, Wiley Verbeke G, Molenberghs G () Linear Mixed Models for Longitudinal Data, Springer Singer JD, Willett JB (3) Applied Longitudinal Data Analysis, Oxford University Press 39 Heagerty, 6

Longitudinal Data Analysis INTRODUCTION to REGRESSION APPROACHES 4 Heagerty, 6

Linear Mixed Model Regression model: mean response as a function of covariates systematic variation Random effects: variation from subject-to-subject in trajectory random between-subject variation Within-subject variation: variation of individual observations over time random within-subject variation 4 Heagerty, 6

Scientific Questions as Regression Questions concerning the rate of decline refer to the time slope for FEV: E[ FEV X = age, gender, f58] = β (X) + β (X) time Time Scales Let age = age-at-entry, age i Let agel = time-since-entry, age ij age i 4 Heagerty, 6

CF Regression Model Model: E[FEV X i ] = β +β age + β agel +β 3 female +β 4 f58 = + β 5 f58 = +β 6 female agel +β 7 f58 = agel + β 8 f58 = agel = β (X i ) + β (X i ) agel 43 Heagerty, 6

43- Heagerty, 6 Intercept f58= f58= f58= male β + β age β + β age β + β age +β 4 +β 5 female β + β age β + β age β + β age +β 3 +β 3 + β 4 +β 3 + β 5

43- Heagerty, 6 Slope f58= f58= f58= male β β + β 7 β + β 8 female β β + β 7 β + β 8 +β 6 +β 6 +β 6

43-3 Heagerty, 6 Gender Groups (f58==) FEV 5 6 7 8 9 Male Female 4 6 8 Age (years)

43-4 Heagerty, 6 Genotype Groups (male) FEV 5 6 7 8 9 f58= f58= f58= 4 6 8 Age (years)

Define Y ij = FEV for subject i at time t ij X i = (X ij,, X ini ) X ij = (X ij,, X ij,,, X ij,p ) age, agel, gender, genotype Issue: Response variables measured on the same subject are correlated cov(y ij, Y ik ) 44 Heagerty, 6

Some Notation It is useful to have some notation that can be used to discuss the stack of data that correspond to each subject Let n i denote the number of observations for subject i Define: Y i = Y i Y i Y ini If the subjects are observed at a common set of times t, t,, t m then E(Y ij ) = µ j denotes the mean of the population at time t j 45 Heagerty, 6

Dependence and Correlation Recall that observations are termed independent when deviation in one variable does not predict deviation in the other variable Given two subjects with the same age and gender, then the blood pressure for patient ID= is not predictive of the blood pressure for patient ID=334 Observations are called dependent or correlated when one variable does predict the value of another variable The LDL cholesterol of patient ID= at age 57 is predictive of the LDL cholesterol of patient ID= at age 6 46 Heagerty, 6

Dependence and Correlation Recall: The variance of a variable, Y ij (fix time t j for now) is defined as: σ j = E [ (Y ij µ j ) ] = E [(Y ij µ j )(Y ij µ j )] The variance measures the average distance that an observation falls away from the mean 47 Heagerty, 6

Dependence and Correlation Define: The covariance of two variables, Y ij, and Y ik (fix t j and t k ) is defined as: σ jk = E [(Y ij µ j )(Y ik µ k )] The covariance measures whether, on average, departures in one variable, Y ij µ j, go together with departures in a second variable, Y ik µ k In simple linear regression of Y ij on Y ik the regression coefficient β in E(Y ij Y ik ) = β + β Y ik is the covariance divided by the variance of Y ik : β = σ jk σ k 48 Heagerty, 6

Dependence and Correlation Define: The correlation of two variables, Y ij, and Y ik (fix t j and t k ) is defined as: ρ jk = E [(Y ij µ j )(Y ik µ k )] σ j σ k The correlation is a measure of dependence that takes values between - and + Recall that a correlation of implies that the two measures are unrelated (linearly) Recall that a correlation of implies that the two measures fall perfectly on a line one exactly predicts the other! 49 Heagerty, 6

Why interest in covariance and/or correlation? Recall that on pages 8 and 9 our standard error for the sample mean difference µ µ depends on ρ In general a statistical model for the outcomes Y i = (Y i, Y i,, Y ini ) requires the following: Means: µ j Variances: σ j Covariances: σ jk, or correlations ρ jk Therefore, one approach to making inferences based on longitudinal data is to construct a model for each of these three components 5 Heagerty, 6

Something new to model cov(y i ) = var(y i ) cov(y i, Y i ) cov(y i, Y ini ) cov(y i, Y i ) var(y i ) cov(y i, Y ini ) cov(y ini, Y i ) cov(y ini, Y i ) var(y ini ) = σ σ σ ρ σ σ ni ρ ni σ σ ρ σ σ σ ni ρ ni σ ni σ ρ ni σ ni σ ρ ni σ n i 5 Heagerty, 6

Mean and Covariance Models for FEV Models: E(Y ij X i ) = µ ij (regression) cov(y i X i ) = Σ i = Z i DZ T i } {{ } + R }{{} i between subjects within subjects Q: What are appropriate covariance models for the FEV data? 5 Heagerty, 6

5- Heagerty, 6 Wide Data Data array for the residuals (first rows of rmat ) age8 age age age4 age6 age8 age age age4 [,] 3464 9 566 675 NA NA NA NA NA [,] NA NA 648 68 794 387 NA NA NA [3,] NA NA NA -349-346 -35-497 NA NA [4,] NA NA NA NA NA NA NA -4-6 [5,] NA NA NA 459-8 36 868 NA NA [6,] 55 7 78 55 46 NA NA NA NA [7,] NA NA NA NA 38-839 338 94 NA [8,] NA NA NA NA -856-9 -937-46 NA [9,] 398 396 3679 NA NA NA NA NA NA [,] NA NA 976 459 69 9 78 NA NA

age 8-6 4-4 4-4 -4 4-6 4-6 - age age -6 4-4 4 age 4 age 6-4 4 8-4 4 age 8 age -4 4-4 age -6 4-6 4-4 8-4 4-399975*^ 99975*^ 6*^ age 4 5- Heagerty, 6

5-3 Heagerty, 6 Empirical Covariance Matrix: [,] [,] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,] 746 57586 5558 4758 64856 346 NA NA NA [,] 6336 534 44444 4679 3836 NA NA NA [3,] 68 548 535 547 49 97333 NA [4,] 5796 5386 4935 447 39584 358 [5,] 6638 5798 443 37498 4885 [6,] 59754 57 4639 4675 [7,] 659 4875 59 [8,] 59 4377 [9,] 687 Empirical Correlation Matrix: [,] [,] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,] 86 74 73 94 6 NA NA NA [,] 8 73 7 6 NA NA NA [3,] 8 77 8 77 65 NA [4,] 84 84 68 73 53 [5,] 84 69 64 76 [6,] 83 73 76 [7,] 87 84 [8,] 78 [9,]

5-4 Heagerty, 6 Number of observations (pairs): [,] [,] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,] 68 66 64 5 [,] 87 83 69 39 6 [3,] 5 73 5 7 3 [4,] 7 93 7 45 9 4 [5,] 85 6 33 3 [6,] 77 47 5 [7,] 85 54 33 [8,] 66 4 [9,] 48

How to build models for correlation? Mixed models random effects within-subject similarity due to sharing trajectory Serial correlation close in time implies strong similarity correlation decreases as time separation increases 53 Heagerty, 6

Linear Mixed Model Regression model: mean response as a function of covariates systematic variation Random effects: variation from subject-to-subject in trajectory random between-subject variation Within-subject variation: variation of individual observations over time random within-subject variation 54 Heagerty, 6

54- Heagerty, 6 Two Subjects Response 5 5 5 4 6 8 Months

Levels of Analysis We first consider the distribution of measurements within subjects: Y ij = β,i + β,i t ij + e ij e ij N (, σ ) E[Y i X i, β i ] = β,i + β,i t ij = [, time ij ] β,i β,i = X i β i 55 Heagerty, 6

Levels of Analysis We can equivalently separate the subject-specific regression coefficients into the average coefficient and the specific departure for subject i: β,i = β + b,i β,i = β + b,i This allows another perspective: Y ij = β,i + β,i t ij + e ij = (β + β t ij ) + (b,i + b,i t ij ) + e ij E[Y i X i, β i ] = X i β } {{ } + X i b } {{ } i mean model between-subject 56 Heagerty, 6

56- Heagerty, 6 Sample of Lines FEV 5 6 7 8 9 4 6 8 Age (years)

56- Heagerty, 6 Intercepts and Slopes Slope Deviation (b) Intercept Deviation (b)

Levels of Analysis Next we consider the distribution of patterns (parameters) among subjects: equivalently β i N (β, D) b i N (, D) Y i = X i β } {{ } + X i b } {{ } i + e }{{} i mean model between-subject within-subject 57 Heagerty, 6

57- Heagerty, 6 Fixed intercept, Fixed slope Random intercept, Fixed slope beta + beta * time - 4 6 8 beta + beta * time - 4 6 8 4 6 8 time Fixed intercept, Random slope 4 6 8 time Random intercept, Random slope beta + beta * time - 4 6 8 beta + beta * time - 4 6 8 4 6 8 time 4 6 8 time

Between-subject Variation We can use the idea of random effects to allow different types of between-subject heterogeneity: The magnitude of heterogeneity is characterized by D: b i = var(b i ) = b,i b,i D D D D 58 Heagerty, 6

Between-subject Variation The components of D can be interpreted as: D the typical subject-to-subject deviation in the overall level of the response D the typical subject-to-subject deviation in the change (time slope) of the response D the covariance between individual intercepts and slopes If positive then subjects with high levels also have high rates of change If negative then subjects with high levels have low rates of change 59 Heagerty, 6

59- Heagerty, 6 Fixed intercept, Fixed slope Random intercept, Fixed slope beta + beta * time - 4 6 8 beta + beta * time - 4 6 8 4 6 8 time Fixed intercept, Random slope 4 6 8 time Random intercept, Random slope beta + beta * time - 4 6 8 beta + beta * time - 4 6 8 4 6 8 time 4 6 8 time

Between-subject Variation: Examples No random effects: Y ij = β + β t ij + e ij = [, time ij ]β + e ij Random intercepts: Y ij = (β + β t ij ) + b,i + e ij = [, time ij ]β + [ ]b,i + e ij Random intercepts and slopes: Y ij = (β + β t ij ) + b,i + b,i t ij + e ij = [, time ij ]β + [, time ij ]b i + e ij 6 Heagerty, 6

6- Heagerty, 6 Mixed Models and Covariances/Correlation Q: What is the correlation between outcomes Y ij and Y ik under these random effects models? Random Intercept Model Y ij = β + β t ij + b,i + e ij Y ik = β + β t ik + b,i + e ik var(y ij ) = var(b,i ) + var(e ij ) = D + σ cov(y ij, Y ik ) = cov(b,i + e ij, b,i + e ik ) = D

6- Heagerty, 6 Mixed Models and Covariances/Correlation Random Intercept Model corr(y ij, Y ik ) = D D + σ D + σ = D D + σ = between var between var + within var Therefore, any two outcomes have the same correlation Doesn t depend on the specific times, nor on the distance between the measurements Exchangeable correlation model Assuming: var(e ij ) = σ, and cov(e ij, e ik ) =

6-3 Heagerty, 6 Mixed Models and Covariances/Correlation Random Intercept and Slope Model Y ij = (β + β t ij ) + (b,i + b,i t ij ) + e ij Y ik = (β + β t ik ) + (b,i + b,i t ik ) + e ik var(y ij ) = var(b,i + b,i t ij ) + var(e ij ) = D + D t ij + D t ij + σ cov(y ij, Y ik ) = cov[(b,i + b,i t ij + e ij ), (b,i + b,i t ik + e ik )] = D + D (t ij + t ik ) + D t ij t ik

6-4 Heagerty, 6 Mixed Models and Covariances/Correlation Random Intercept and Slope Model ρ ijk = corr(y ij, Y ik ) = D + D (t ij + t ik ) + D t ij t ik D + D t ij + D t ij + σ D + D t ik + D t ik + σ Therefore, two outcomes may not have the same correlation Correlation depends on the specific times for the observations, and does not have a simple form Assuming: var(e ij ) = σ, and cov(e ij, e ik ) =

Linear Mixed Model Regression model: mean response as a function of covariates systematic variation Random effects: variation from subject-to-subject in trajectory random between-subject variation Within-subject variation: variation of individual observations over time random within-subject variation 6 Heagerty, 6

Covariance Models Serial Models Linear mixed models assume that each subject follows his/her own line In some situations the dependence is more local meaning that observations close in time are more similar than those far apart in time 6 Heagerty, 6

Covariance Models Define e ij = ρ e ij + ɛ ij ɛ ij N (, σ ( ρ ) ) ɛ i N (, σ ) This leads to autocorrelated errors: cov(e ij, e ik ) = σ ρ j k 63 Heagerty, 6

ID=, rho=5 ID=, rho=5 ID=3, rho=5 ID=4, rho=5 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months ID=5, rho=5 ID=6, rho=5 ID=7, rho=5 ID=8, rho=5 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months ID=9, rho=5 ID=, rho=5 ID=, rho=5 ID=, rho=5 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months ID=3, rho=5 ID=4, rho=5 ID=5, rho=5 ID=6, rho=5 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months ID=7, rho=5 ID=8, rho=5 ID=9, rho=5 ID=, rho=5 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months 63- Heagerty, 6

ID=, rho=9 ID=, rho=9 ID=3, rho=9 ID=4, rho=9 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months ID=5, rho=9 ID=6, rho=9 ID=7, rho=9 ID=8, rho=9 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months ID=9, rho=9 ID=, rho=9 ID=, rho=9 ID=, rho=9 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months ID=3, rho=9 ID=4, rho=9 ID=5, rho=9 ID=6, rho=9 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months ID=7, rho=9 ID=8, rho=9 ID=9, rho=9 ID=, rho=9 y 5 5 5 y 5 5 5 y 5 5 5 y 5 5 5 4 6 8 4 6 8 4 6 8 4 6 8 months months months months 63- Heagerty, 6

63-3 Heagerty, 6 Two Subjects Response 5 5 5 line for subject average line line for subject subject dropout 4 6 8 Months

Linear Mixed Model Regression model: mean response as a function of covariates systematic variation Random effects: variation from subject-to-subject in trajectory random between-subject variation Within-subject variation: variation of individual observations over time random within-subject variation 64 Heagerty, 6