How to use SAS for Logistic Regression with Correlated Data



Similar documents
Fitting Nonlinear Mixed Models with the New NLMIXED Procedure Russell D. Wolfinger, SAS Institute Inc., Cary, NC

Introduction to mixed model and missing data issues in longitudinal studies

VI. Introduction to Logistic Regression

SAS Software to Fit the Generalized Linear Model

Lecture 19: Conditional Logistic Regression

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Lecture 14: GLM Estimation and Logistic Regression

Statistics Graduate Courses

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Nominal and ordinal logistic regression

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

SPPH 501 Analysis of Longitudinal & Correlated Data September, 2012

MATH5885 LONGITUDINAL DATA ANALYSIS

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Linear Classification. Volker Tresp Summer 2015

LOGIT AND PROBIT ANALYSIS

Introduction to Fixed Effects Methods

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Sensitivity Analysis in Multiple Imputation for Missing Data

Handling attrition and non-response in longitudinal data

Multinomial and Ordinal Logistic Regression

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Multiple Choice Models II

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Factorial experimental designs and generalized linear models

Design and Analysis of Phase III Clinical Trials

Problem of Missing Data

Chapter 39 The LOGISTIC Procedure. Chapter Table of Contents

Lecture 3: Linear methods for classification

15 Ordinal longitudinal data analysis

Are you looking for the right interactions? Statistically testing for interaction effects with dichotomous outcome variables

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Lecture 18: Logistic Regression Continued

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Two Correlated Proportions (McNemar Test)

Study Design Sample Size Calculation & Power Analysis. RCMAR/CHIME April 21, 2014 Honghu Liu, PhD Professor University of California Los Angeles

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Analyzing Clinical Trial Data via the Bayesian Multiple Logistic Random Effects Model

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Imputing Missing Data using SAS

STATISTICAL MODELING OF LONGITUDINAL SURVEY DATA WITH BINARY OUTCOMES

10 Dichotomous or binary responses

Biostatistics Short Course Introduction to Longitudinal Studies

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Study Design and Statistical Analysis

ESALQ Course on Models for Longitudinal and Incomplete Data

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Generalized Linear Models

Including the Salesperson Effect in Purchasing Behavior Models Using PROC GLIMMIX

A.1 SAS EXAMPLES. Chapter 1: Introduction

Dealing with Missing Data

Survey Analysis: Options for Missing Data

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

Abbas S. Tavakoli, DrPH, MPH, ME 1 ; Nikki R. Wooten, PhD, LISW-CP 2,3, Jordan Brittingham, MSPH 4

A Basic Introduction to Missing Data

Logit Models for Binary Data

Model Selection and Claim Frequency for Workers Compensation Insurance

Multilevel Modeling of Complex Survey Data

Poisson Regression or Regression of Counts (& Rates)

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Missing data and net survival analysis Bernard Rachet

Statistical Machine Learning

11. Analysis of Case-control Studies Logistic Regression

Prevalence odds ratio or prevalence ratio in the analysis of cross sectional data: what is to be done?

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Lecture 25. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

STA 4273H: Statistical Machine Learning

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Basic Statistical and Modeling Procedures Using SAS

Multivariate Logistic Regression

MEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS. Biostatistics

Analysis of Correlated Data. Patrick J. Heagerty PhD Department of Biostatistics University of Washington

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Prediction for Multilevel Models

Milk Data Analysis. 1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED

Regression 3: Logistic Regression

Generalized Linear Mixed Models

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

BIO 226: APPLIED LONGITUDINAL ANALYSIS COURSE SYLLABUS. Spring 2015

Latent Class Regression Part II

Models for Longitudinal and Clustered Data

Least-Squares Intersection of Lines

Two Tools for the Analysis of Longitudinal Data: Motivations, Applications and Issues

Transcription:

How to use SAS for Logistic Regression with Correlated Data Oliver Kuss Institute of Medical Epidemiology, Biostatistics, and Informatics Medical Faculty, University of Halle-Wittenberg, Halle/Saale, Germany

Contents 1. Introduction 2. The Data 3. The Logistic Regression Model with Correlated Data 4. Methods of Estimation in SAS 5. Comparison of Methods 6. Conclusion 7. References SAS and all other SAS Institute Inc. Product or service names are registered trademarks or trademarks of SAS Institute Inc. In the USA and other countries. indicates USA registration.

1. Introduction Logistic regression is the standard analyzing tool for binary responses Reasons: - Ease of interpretation of parameters - Prognoses for the event of interest are possible - Software is available (LOGISTIC, GENMOD, PROBIT, CATMOD,...) Crucial assumption in standard logistic regression: Observations are independent

However, many study designs in applied sciences give rise to correlated data/responses: For example: - Subjects are followed over time and responses are assessed at different time points - Subjects are treated under different experimental conditions - Several responses are measured at the same subject - Subjects are observed in logical units (families, communities, clinics) Analysis for discrete responses is more complicated than for continuous responses

2. The Data Multicenter randomized controlled clinical trial, conducted in eight different clinics (Beitler/Landis, 1985, Wolfinger, 1999) Purpose of study: Assess the effect of a topical cream treatment on curing nonspecific infections. In each of the eight clinics, the number of treated and the number of successfully cured persons were recorded for treatment and control: data infection; input clinic treatment x n; datalines; 1 1 11 36 1 0 10 37... 8 1 4 6 8 0 6 7 run;

A crude analysis: Ignore the fact that the data were observed in different clinics, collapse the data in a single 2x2-table, and measure the treatment effect by the odds ratio: data infection2(drop=i); set infection; do i=1 to n; if i<= x then cure=1; if i > x then cure=0; status=2-cure; output; end; run; proc freq data=infection2; tables treatment*cure / relrisk; run;

(Partial) PROC FREQ Output: Statistics for Table of treatment by cure Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 1.4979 0.9151 2.4518 Moderate, non-significant (p=0.11) benefit for treatment. The odds for curing the infection is 50% higher in the treatment group. However, this ignores the effect of clinics completely. We might suspect that different features of the clinics (personnel, environment, typical population) might influence the treatment effect. This also implies a correlation of patients from the same clinic.

3. The Logistic Regression Model with Correlated Data There are two different groups of statistical models for binary responses that account for correlation in a different style and whose estimated parameters have different interpretations (Diggle/Liang/Zeger, 1994): Some Notation: Marginal Models and Random Effects Models Let Y ij, (i=1,..., n, j=1,..., n i ) denote whether patient j in clinic i was cured (Y ij = 1: yes, Y ij = 0: no) and x ij whether patient j in clinic i was in the treatment or in the control group (x ij = 1: treatment, x ij = 0: control) The response Y ij is assumed to follow a Bernoulli distribution with cure probability p ij.

Marginal Model In a marginal model the treatment effect is modelled separately from the within-clinic correlation Model equations: logit(p ij ) = b 0 + b treat x ij Var(Y ij ) = p ij (1- p ij ) Corr(Y ij,y ik ) = α - Interpretation of parameters is analogous to standard logistic regression - Correlation is regarded a nuisance parameter and is not estimated - Correlation is assumed to be constant between patients from the same clinic and identical within clinics - The interpretation does not depend on the single clinic but rather averages the treatment effect across clinics (? Population-averaged)

Random effects Model In a random effects model it is assumed that there is natural heterogeneity between clinics and that this heterogeneity can be modelled by a probabilistic distribution Model equation: with u i ~ N(0,? 2 ) logit(p ij u i ) = b 0 + b treat x ij + u i - To be more specific, this is actually a random intercept logistic regression model - The correlation of patients from the same clinic arises from their sharing specific but unobserved properties of the respective clinic. Given u i, the responses from the same clinic are independent. - By considering the intercept random but the treatment effect fixed, we assume that the treatment effect is identical across clinics, but there is a single individual baseline cure probability in each clinic (? Subject-specific)

4. Methods of Estimation in SAS 4.1 The GENMOD Procedure - The GENMOD procedure fits Generalized Linear Models (McCullagh/Nelder, 1989) - Since Version 6.12 it also allows the modelling of correlated data via the REPEATED-Statement - The implemented estimation procedure is GEE (Liang/Zeger, 1986) - It estimates a marginal model proc genmod data=infection2 descending order=data; class treatment clinic; model cure=treatment / d=bin link=logit; repeated subject=clinic / type=cs; estimate "treatment" treatment 1-1 / exp; run;

4.2 The %GLIMMIX Macro - The %GLIMMIX macro was written by Russ Wolfinger from SAS Institute and is available from the SAS homepage - It is designed for the analysis of Generalized Linear Mixed Models (GLMM) - Our random intercept logistic regression model is a GLMM - Several estimation methods are possible - Iteratively fits a linear mixed model to a pseudo response (Wolfinger/O Connell, 1993) %include "...\glmm800.sas"; %glimmix(data=infection2, stmts = %str(class clinic; model cure = treatment / solution cl; random clinic; parms (0) (1.0);), error=binomial,link=logit,procopt=order=data );run;

4.3 The %NLINMIX Macro - The %NLINMIX macro was also written by Russ Wolfinger from SAS Institute and is available from the SAS homepage - It is actually designed for the analysis of nonlinear mixed models but as our model is also a nonlinear model, we can use it (Wolfinger/Lin, 1997) - There were substantial changes between Version 6.12 and Version 8 - Several estimation methods are possible

%include "...\nlmm800.sas"; %nlinmix(data=infection2, model =%str( num = exp(b0 + b_treat*treatment + u); den = 1 + num; predv = num/den; ), parms =%str(b0=-0.7142 b_treat=0.404), derivs=%str( d_b0 = num /(den*den); d_b_treat = treatment*num /(den*den); d_u = num /(den*den); ), stmts = %str( class clinic; model pseudo_cure= d_b0 d_b_treat / noint solution cl; random d_u / subject=clinic cl solution; ), procopt=empirical, expand=eblup ); run;

4.4 The NLMIXED Procedure - The %GLIMMIX and the %NLINMIX use approximations to the likehood function and thus yield only approximate ML estimators - The NLMIXED procedure maximizes the likelihood directly by numerical integration methods (Gaussian Quadrature) and thus gives exact ML estiamtors - We also use the fact here that our model is a nonlinear mixed effect model proc nlmixed data=infection2; parms b0=-0.7142 b_treat=0.404 s2u=2; eta = b0 + b_treat*treatment + u; expeta = exp(eta); p = expeta/(1+expeta); model cure ~ binary(p); random u ~ normal(0,s2u) subject=clinic; run;

4.5 The PHREG/LOGISTIC Procedure - We can also use conditional ML estimation for a random effects model - This removes the random effect completely from the likelihood function - It turns out that the conditional likelihood function in our case is equivalent to that one in stratified (or matched) case-control studies and so we can use the PHREG procedure proc phreg data=infection2; model status*cure(0)=treatment / ties=discrete; strata clinic; run;

However, the PHREG procedure yields only asymptotic conditional ML estimators and we can use the LOGISTIC procedure for an exact conditional analysis (Derr, 2000) proc logistic data=infection2 descending exactonly; class clinic / param=ref; model cure=clinic treatment; exact treatment / estimate=both; run;

4.6 Other methods There are still some other estimation methods that could be used: - Meta-Analysis (MIXED procedure) - Nonparametric ML analysis - MCMC (Stochastic integration, Implemented very poorly in SAS )

5. Comparison of Methods Back to the data set: Method OR [95%-CI] PROC FREQ 1.498 [0.915; 2.452] PROC GENMOD 1.740 [1.102; 2.747] %GLIMMIX 2.069 [1.176; 3.641] %NLINMIX 2.021 [1.076; 3.794] PROC NLMIXED 2.093 [1.162; 3.771] PROC PHREG 2.130 [1.177; 3.855] PROC LOGISTIC 2.130 [1.137; 4.079]

6. Conclusion - The SAS System offers a large number of options for estimating logistic regression models with correlated data - It is difficult to give general recommendations which of the methods to use because this depends on (a) the data at hand and (b) on the desired interpretation of parameters (population-averaged vs. subject-specific) - In our data set we feel most comfortable with the results from the NLMIXED procedure and from the conditional ML analysis.

7. References - Beitler, P.J., Landis, J.R. (1985), A Mixed-effects Model for Categorical Data, Biometrics, 41, 991-1000. - Derr, R.E. (2000), Performing Exact Logistic Regression with the SAS System, Proceedings of the 25 th Annual SAS Users Group International Conference (SUGI 25), 254-25. - Diggle, P.J., Liang, K.-Y., Zeger, S.L. (1994), Analysis of Longitudinal Data, Oxford University Press, Oxford. - Liang, K.-Y., Zeger, S.L. (1986), Longitudinal Data Analysis Using Generalized Linear Models, Biometrika, 73, 13-22. - McCullagh, P., Nelder J.A. (1989), Generalized Linear Models, Chapman and Hall, New York. - Wolfinger, R.D., O Connell, M. (1993), Generalized linear models: a pseudo-likelihood approch, Journal of Statistical Computation and Simulation, 48, 233-243. - Wolfinger, R.D., Lin, X. (1997), Two Taylor-series approximation methods for nonlinear mixed models, Computational Statistics & Data Analysis, 25, 465-490. - Wolfinger, R.D. (1999), Fitting Nonlinear Mixed Models with the new NLMIXED Procedure, Proceedings of the 24 th Annual SAS Users Group International Conference (SUGI 24), 287-24.