Lecture 27: Introduction to Correlated Binary Data

Similar documents
Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Lecture 14: GLM Estimation and Logistic Regression

Lecture 19: Conditional Logistic Regression

Lecture 18: Logistic Regression Continued

SAS Software to Fit the Generalized Linear Model

Longitudinal Data Analysis

Department of Epidemiology and Public Health Miller School of Medicine University of Miami

Introduction to mixed model and missing data issues in longitudinal studies

VI. Introduction to Logistic Regression

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview of Methods for Analyzing Cluster-Correlated Data. Garrett M. Fitzmaurice

BIO 226: APPLIED LONGITUDINAL ANALYSIS COURSE SYLLABUS. Spring 2015

Logistic Regression (1/24/13)

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Multivariate Normal Distribution

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Generalized Linear Models

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

A Composite Likelihood Approach to Analysis of Survey Data with Sampling Weights Incorporated under Two-Level Models

Family economics data: total family income, expenditures, debt status for 50 families in two cohorts (A and B), annual records from

The Probit Link Function in Generalized Linear Models for Data Mining Applications

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Analysis of Correlated Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Biostatistics Short Course Introduction to Longitudinal Studies

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

GEEs: SAS Syntax and Examples

Introduction to Longitudinal Data Analysis

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Statistics in Retail Finance. Chapter 2: Statistical models of default

Lecture 6: Poisson regression

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Statistics Graduate Courses

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

How to use SAS for Logistic Regression with Correlated Data

Sections 2.11 and 5.8

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

Introduction to Fixed Effects Methods

Multinomial and Ordinal Logistic Regression

Models for Longitudinal and Clustered Data

Multivariate Logistic Regression

An Introduction to Modeling Longitudinal Data

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Advanced Quantitative Methods for Health Care Professionals PUBH 742 Spring 2015

Poisson Models for Count Data

Abbas S. Tavakoli, DrPH, MPH, ME 1 ; Nikki R. Wooten, PhD, LISW-CP 2,3, Jordan Brittingham, MSPH 4

Lecture 3: Linear methods for classification

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Binary Logistic Regression

Goodness of fit assessment of item response theory models

11. Analysis of Case-control Studies Logistic Regression

Basic Statistical and Modeling Procedures Using SAS

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Discrete Distributions

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Paper ST-12 ABSTRACT INTRODUCTION SESUG 2011

Nominal and ordinal logistic regression

Regression III: Advanced Methods

Introduction to Multivariate Analysis

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

GLM I An Introduction to Generalized Linear Models

Prediction for Multilevel Models

An extension of the factoring likelihood approach for non-monotone missing data

Problem of Missing Data

Logit Models for Binary Data

GLM, insurance pricing & big data: paying attention to convergence issues.

Chapter 15. Mixed Models Overview. A flexible approach to correlated data.

Data Mining: Algorithms and Applications Matrix Math Review

MATH5885 LONGITUDINAL DATA ANALYSIS

Assessing the Effectiveness of Anti-smoking Media Campaigns by Recall and Rating Scores A Pattern-Mixture GEE Model Approach

Lecture 25. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Highlights the connections between different class of widely used models in psychological and biomedical studies. Multiple Regression

Linear Threshold Units

Statistical Models in R

RMTD 404 Introduction to Linear Models

Parallelization Strategies for Multicore Data Analysis

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

Curriculum - Doctor of Philosophy

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Poisson Regression or Regression of Counts (& Rates)

Cool Tools for PROC LOGISTIC

Imputing Missing Data using SAS

Design and Analysis of Phase III Clinical Trials

Statistics 104: Section 6!

A Basic Introduction to Missing Data

Ordinal Regression. Chapter

SUGI 29 Statistics and Data Analysis

USC Columbia, Lancaster, Salkehatchie, Sumter & Union campuses

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Transcription:

Lecture 27: Introduction to Correlated Binary Data Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 27: Introduction to Correlated Binary Data p. 1/18

Introduction Previously, we discussed correlated data in the context of a matched pair design Correlated data are common in 1. Longitudinal studies: same subject followed over time. 2. Cluster randomized trials: Treatment is assigned to groups, the members within a group tend to be correlated responses 3. Family studies: individuals within a family are more similar (genetically and environmentally) While we could have spent a semester on this topic alone, we will briefly move through a methods-driven examination Lecture 27: Introduction to Correlated Binary Data p. 2/18

Motivating Data We will look at the Six Cities study of the health effects of air pollution (Ware et al. 1984). The data analyzed are the 16 selected cases in Lipsitz, Fitzmaurice, et al. (1994). The binary response is the wheezing status of 16 children at ages 9, 10, 11, and 12 years. The probability of wheezing at each age is to be modeled as a logistic regression model using the explanatory variables city of residence, age, and maternal smoking status at the particular age. Lecture 27: Introduction to Correlated Binary Data p. 3/18

General summary points to consider Data between subjects are assumed to be independent Data within a subject are assumed to be dependent The dependence is modeled as a covariance (or correlation) pattern Lecture 27: Introduction to Correlated Binary Data p. 4/18

Covariance Patterns You will learn more about covariance patterns models in Multivariate Analysis, for now, consider the following structures: Compound Symmetry (or exchangeable): Correlation is the same for all outcomes within a subject (i.e., the corr(y ij,y ik )=ρ, j k and the corr(y ij,y i k )=0, i i ) Unstructured (i.e., corr(y i j,y i k)=ρ jk ) The compound symmetry model is a good place to begin. You estimate the fewest number of correlation parameters Compound symmetry is often used in sample size calculations re:example -The binary responses (there are 4 of these, one for each age) for individual children are assumed to be equally correlated, implying an exchangeable correlation structure. Lecture 27: Introduction to Correlated Binary Data p. 5/18

Generalized Estimating Equations (GEE) GEE is a generalized form of a GLM GEE differs from a GLM in that the distribution of the outcome is not completely specified GEE is known as a marginal model. A marginal model is appropriate when inference on group effects (population effects) is of interest. Group effects may include a "treatment" effect. Solutions are obtained by the estimating equations (AKA as score equations), which for exponential class variables, can be written as S(β) = nx i=1 E(Y i X i ) β» Yi E(Y i X i ) V ar(y i X i ) The estimation of the variance-covariance matrix is more complicated, and the solutions are obtained iteratively. For the purpose of this class, lets rely on GENMOD for the calculations. Lecture 27: Introduction to Correlated Binary Data p. 6/18

Notation Before we develop our model, lets formalize some more notation. Notation i = 1,..., N subjects j = 1,..., t i observations (I like t to represent time, others may use n i ) Y i = [y i1, y i2,..., y iti ] is the t i x 1 vector of responses for subject i y i1 is the response for subject i at time 1 y i2 is the response for subject i at time 2 etc. (recall t i is the maximum observed time for subject i this does not have to be the same for all subjects) For our example, t i = 4 i Lecture 27: Introduction to Correlated Binary Data p. 7/18

Covariate notation x ij = [x ij1, x ij2,..., x ijp ] is the p x 1 covariate vector for the subject i at time j X i = [x i1,x i2,...,x iti ] is the t i x p matrix of covariates for subject i β is the p x 1 vector of true population parameters NOTES: Covariates are typically static (same at all time measurements, e.g., gender, race, etc) or time-dependent (change over time, e.g., drug dose, smoking status, etc.) This notation accounts for the characteristics. Lecture 27: Introduction to Correlated Binary Data p. 8/18

Our data data six; input case city$ @@; <--- Time indepdendent covariates do i=1 to 4; <-- "time" input age smoke wheeze @@; <-- time dependent output; end; datalines; 1 portage 9 0 1 10 0 1 11 0 1 12 0 0 2 kingston 9 1 1 10 2 1 11 2 0 12 2 0 3 kingston 9 0 1 10 0 0 11 1 0 12 1 0 4 portage 9 0 0 10 0 1 11 0 1 12 1 0 5 kingston 9 0 0 10 1 0 11 1 0 12 1 0 6 portage 9 0 0 10 1 0 11 1 0 12 1 0 7 kingston 9 1 0 10 1 0 11 0 0 12 0 0 8 portage 9 1 0 10 1 0 11 1 0 12 2 0 9 portage 9 2 1 10 2 0 11 1 0 12 1 0 10 kingston 9 0 0 10 0 0 11 0 0 12 1 0 11 kingston 9 1 1 10 0 0 11 0 1 12 0 1 12 portage 9 1 0 10 0 0 11 0 0 12 0 0 13 kingston 9 1 0 10 0 1 11 1 1 12 1 1 14 portage 9 1 0 10 2 0 11 1 0 12 2 1 15 kingston 9 1 0 10 1 0 11 1 0 12 2 1 16 portage 9 1 1 10 1 1 11 2 0 12 1 0 ; run; Lecture 27: Introduction to Correlated Binary Data p. 9/18

Ignoring Clustering Here, we have 4 observations per individual What happens if we assume we have 64 independent observations? (4 outcomes per 16 people) Here is the code: proc genmod data=six desc; class case city ; model wheeze = city age smoke / dist=bin; run; Lecture 27: Introduction to Correlated Binary Data p. 10/18

Selected Results Model Information Data Set Distribution Link Function Dependent Variable WORK.SIX Binomial Logit wheeze Number of Observations Read 64 Number of Observations Used 64 <--This is saying our N Number of Events 19 is 64 (we only have 16 Number of Trials 64 participants) Analysis Of Parameter Estimates Standard Parameter DF Estimate Error Pr > ChiSq Intercept 1 1.2597 2.6104 0.6294 city kingston 1 0.1391 0.5527 0.8013 city portage 0 0.0000 0.0000. age 1-0.2003 0.2508 0.4245 smoke 1-0.1284 0.4102 0.7544 Lecture 27: Introduction to Correlated Binary Data p. 11/18

Repeated Statement We need to tell SAS that we have correlated data (or repeated observations) We do this by using the repeated statement proc genmod data=six desc; class case city ; model wheeze = city age smoke / dist=bin; repeated subject=case / type=exch; run; Lecture 27: Introduction to Correlated Binary Data p. 12/18

Selected Results You get the same model based information Data Set WORK.SIX Distribution Binomial Link Function Logit Dependent Variable wheeze Number of Observations Read 64 Number of Observations Used 64 Number of Events 19 Number of Trials 64 So...you have to go to the end of the report for the GEE summary Lecture 27: Introduction to Correlated Binary Data p. 13/18

Selected Results GEE Model Information Correlation Structure Exchangeable Subject Effect case (16 levels) Number of Clusters 16 Correlation Matrix Dimension 4 Maximum Cluster Size 4 Minimum Cluster Size 4 Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Standard 95% Confidence Parameter Estimate Error Limits Z Pr > Z Intercept 1.2751 3.0561-4.7148 7.2650 0.42 0.6765 city kingston 0.1223 0.6882-1.2266 1.4713 0.18 0.8589 city portage 0.0000 0.0000 0.0000 0.0000.. age -0.2036 0.2789-0.7502 0.3431-0.73 0.4655 smoke -0.0935 0.3613-0.8016 0.6145-0.26 0.7957 Lecture 27: Introduction to Correlated Binary Data p. 14/18

Comparison of Estimates Regular MLE GEE (64 indep obs) (16 clusters of 4) Standard Standard Parameter Estimate Error Estimate Error Intercept 1.2597 2.6104 1.2751 3.0561 city kingston 0.1391 0.5527 0.1223 0.6882 city portage 0.0000 0.0000 0.0000 0.0000 age -0.2003 0.2508-0.2036 0.2789 smoke -0.1284 0.4102-0.0935 0.3613 Conclusion: Parameter estimates approximately equal; standard errors wrong under regular MLE However, for this example, the effect isn t that dramatic (See Kleinbaum & Klein Ch 11 for more dramatic example) Lecture 27: Introduction to Correlated Binary Data p. 15/18

Properties of GEE GEE estimates have desirable asymptotic properties For correctly specified models and As the number of clusters gets large, the estimates are 1. Consistent: b β β as K 2. Asymptotically normal: b β normal as K Correctly specified means the correct link and correlation have been specified However, GEE is robust to misspecficiation of the correlation patten The closer the correlation is the to true correlation, the more efficient (smaller standard errors) Lecture 27: Introduction to Correlated Binary Data p. 16/18

Model Testing Gone are the likelihood based methods We have not specified our likelihood and are using quasi-likelihood Recall from the GLM slides, we formulated the score equations With GEE, we are using score like equations since we have not fully specified the likelihood (we ve only specified the variance (based on the binomial) and the correlation of the outcomes We can compute score like and Wald tests Lecture 27: Introduction to Correlated Binary Data p. 17/18

Other forms of correlated data Correlated data also arises in survey research This follows a cluster sampling approach Randomly select clusters (e.g., families) and survey family members Need to take this clustering into account in the analysis You can use SUDAAN, GEE or new SAS procedures (SURVEYFREQ, SURVEYLOGISTIC, etc) Lecture 27: Introduction to Correlated Binary Data p. 18/18