Lecture 18: Logistic Regression Continued

Similar documents

Lecture 14: GLM Estimation and Logistic Regression

STATISTICA Formula Guide: Logistic Regression. Table of Contents

11. Analysis of Case-control Studies Logistic Regression

VI. Introduction to Logistic Regression

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Lecture 19: Conditional Logistic Regression

Multivariate Logistic Regression

SAS Software to Fit the Generalized Linear Model

Basic Statistical and Modeling Procedures Using SAS

Lecture 3: Linear methods for classification

Poisson Models for Count Data

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Logistic Regression.

Generalized Linear Models

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

HLM software has been one of the leading statistical packages for hierarchical

Lecture 6: Poisson regression

Statistical Machine Learning

How to set the main menu of STATA to default factory settings standards

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Logit Models for Binary Data

Logistic (RLOGIST) Example #1

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Statistics in Retail Finance. Chapter 2: Statistical models of default

SUGI 29 Statistics and Data Analysis

Modeling Lifetime Value in the Insurance Industry

Examining a Fitted Logistic Model

GLM, insurance pricing & big data: paying attention to convergence issues.

Ordinal Regression. Chapter

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Indices of Model Fit STRUCTURAL EQUATION MODELING 2013

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

Elements of statistics (MATH0487-1)

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

Chapter 6: Multivariate Cointegration Analysis

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Logistic Regression (a type of Generalized Linear Model)

Multivariate Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Abbas S. Tavakoli, DrPH, MPH, ME 1 ; Nikki R. Wooten, PhD, LISW-CP 2,3, Jordan Brittingham, MSPH 4

Categorical Data Analysis

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Multinomial and Ordinal Logistic Regression

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Logistic Regression (1/24/13)

International Statistical Institute, 56th Session, 2007: Phil Everson

Statistics, Data Analysis & Econometrics

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Assignments Analysis of Longitudinal data: a multilevel approach

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Additional sources Compilation of sources:

data visualization and regression

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Gamma Distribution Fitting

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Assessing Model Fit and Finding a Fit Model

Dongfeng Li. Autumn 2010

Lecture 8: Gamma regression

Regression 3: Logistic Regression

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Introduction to Statistics and Quantitative Research Methods

Latent Class Regression Part II

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Introduction to Longitudinal Data Analysis

Chapter 7: Simple linear regression Learning Objectives

Poisson Regression or Regression of Counts (& Rates)

Logistic (RLOGIST) Example #3

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

Statistics 104: Section 6!

Simple Linear Regression Inference

GLM I An Introduction to Generalized Linear Models

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Directions for using SPSS

Factors affecting online sales

Multinomial Logistic Regression

Chapter 5 Analysis of variance SPSS Analysis of variance

Module 4 - Multiple Logistic Regression

Using Stata for Categorical Data Analysis

Introduction to General and Generalized Linear Models

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Two Correlated Proportions (McNemar Test)

LOGISTIC REGRESSION ANALYSIS

II. DISTRIBUTIONS distribution normal distribution. standard scores

5. Linear Regression

Statistical Models in R

Data Analysis Tools. Tools for Summarizing Data

Some Essential Statistics The Lure of Statistics

Descriptive Statistics

SUMAN DUVVURU STAT 567 PROJECT REPORT

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Comparison of Survival Curves

Introduction to Quantitative Methods

Logistic regression modeling the probability of success

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Transcription:

Lecture 18: Logistic Regression Continued Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 18: Logistic Regression Continued p. 1/104

Maximum Likelihood Estimation for Logistic Regression Consider the general logistic regression model logit(p i ) = β 0 + β 1 x i1 +... + β K x ik with Y i Bern(p i ), i = 1,..., n. The likelihood is L(β 0, β 1,..., β K ) = ny i=1 p y i i (1 p i) (1 y i) Then, we maximize log L(β 0, β 1,..., β K ) to get MLE s. The log-likelihood is log[l(β 0, β 1,..., β K )] = `P β n 0 i=1 y `P i + n β1 i=1 x `P i1y i +... + n βk i=1 x iky i P n i=1 log(1 + eβ 0+β 1 x i1 +...+β K x ik ) Lecture 18: Logistic Regression Continued p. 2/104

With some data, it is possible that the data is so sparse (most people respond 0 or most people respond 1) that there is no solution, but if there is a solution, it is unique and the maximum. In practice, if there is no solution, your logistic regression package will say something like Convergence not reached after 25 iterations. However, if the iterative approach converges, then the MLE is obtained. In the next lectures, we will discuss other methods that may be more appropriate with sparse data (conditional logistic regression, exact methods). Large Sample Distribution: In large samples, ˆβ has approximate distribution 0 " n # 1 X 1 ˆβ N K+1 @β, p i (1 p i )x i x A i i=1 Lecture 18: Logistic Regression Continued p. 3/104

Sample Size requirements for approximate normality of the MLE s Asymptotics: The sample size needed to get approximately normal estimates using a general logistic regression model depends on the covariates, and the true underlying probabilities, p i As a rule of thumb, for the parameters of a given model to be approximately normal, you would like total sample size (n) # parameters in model 15 Thus, for a model like p i = e β 0 +β 1 x i1 1 + e β 0+β 1 x i1 «, you could have n = 30 different values of x i and get approximately normal estimates of (β 0, β 1 ) since 30 2 15 Lecture 18: Logistic Regression Continued p. 4/104

Confidence Intervals For confidence intervals and test statistics, you can use dv ar(ˆβ) = " X n 1 bp i (1 bp i )x i x i#, i=1 In particular, d V ar(ˆβ k ) is the (k + 1, k + 1) element of d V ar(ˆβ) (since there is an intercept, it is not the k th, but k + 1.) Thus, 95% (large sample) confidence intervals are formed via q ˆβ k ± 1.96 dv ar(ˆβ) (k+1),(k+1) Lecture 18: Logistic Regression Continued p. 5/104

Test Statistics For the logistic model, a test of interest is whether x ik affects the probability of success, H 0 : β k = 0, or equivalently, whether the probability of success is independent of x ik. Wald Statistic The most obvious test statistic is the WALD statistic: Z = ˆβ k q dv ar(ˆβ) kk N(0,1) in large samples under the null. Lecture 18: Logistic Regression Continued p. 6/104

Likelihood Ratio Statistic The likelihood ratio statistic is G 2 = 2{log[L(ˆβ 0, ˆβ 1,..., ˆβ K ) H A ] log[l( β 0, β 1,..., β k = 0,..., β K ) H 0 ]} = 2 P h i n j=1 y i log bpi p + (1 y i )log 1 bpi i 1 p i χ 2 1 where bp j is the estimate under the alternative, and p j is the estimate under the null. Lecture 18: Logistic Regression Continued p. 7/104

Score test statistic The score statistic for is based on the H 0 : β k = 0, X 2 = [P n i=1 x ik (y i p i )] 2 dv ar[ P n i=1 x ik(y i p i )], In general logistic regression, this statistic does not have the simple form that it had earlier, mainly because we have to iterate to get the estimates of β under the null (and thus to get p i ). Lecture 18: Logistic Regression Continued p. 8/104

Goodness-of-Fit When we discussed goodness-of-fit statistics earlier, we looked at the likelihood ratio statistic for the given model versus the saturated model. However, when the underlying data are bernoulli data, Y i Bern(p i ), a saturated model, i.e., a model in which we have a different parameter for each individual is not informative, as we now show. In particular, since Bern(p i ) = Bin(1, p i ), our estimate of p i for the saturated model is bp i = Y i 1 = ( 1 if Y i = 1 0 if Y i = 0. which is either 0 or 1. Lecture 18: Logistic Regression Continued p. 9/104

Now, the likelihood at the MLE for the saturated model is ny i=1 For individual i, if y i = 1, then bp i = 1 and Similarly, if y i = 0, then bp i = 0 and bp y i i (1 bp i) 1 y i. bp y i i (1 bp i) 1 y i = 1 1 0 0 = 1 bp y i i (1 bp i) 1 y i = 0 0 1 1 = 1. Then, the likelihood at the MLE for the saturated model is ny 1 = 1, i=1 and the log-likelihood equals log(1) = 0. Lecture 18: Logistic Regression Continued p. 10/104

Suppose, even if the underlying data are coded as bernoulli for your computer program, that, in reality, the underlying data actually arose as product binomial, i.e., many subjects have the same covariate vector x i. In particular, suppose the underlying data are actually made up of J binomials, and there are n j subjects with the same covariate vector x j, j = 1,..., J; n = P J j=1 n j, so we have Y j Bin(n j, p j ). Since the product bernoulli and product binomial likelihoods are proportional (the same except for combinatorial terms not depending on β), we would get the same MLE and standard error estimates; however, for goodness-of-fit, we would use the product binomial likelihood. With x j the covariate vector associated with (binomial) group j, we want to test the fit of the model logit(p j ) = x j β versus the saturated model (in which we estimate a different p j for each j.) Lecture 18: Logistic Regression Continued p. 11/104

The Deviance As with logistic models discussed earlier, the likelihood ratio statistic for a given model M 1 with estimates bp j versus a saturated model in which bp j = y j n j, is often called the deviance, denoted by D 2, D 2 (M 1 ) = 2{log[L( b β) Sat] log[l( β) M 1 ]} = P h i J yj nj y j=1 y j log + (n n j bp j y j )log j j n j (1 bp j ) = P J P 2 j=1 k=1 O Ojk jk log E jk χ 2 P under the null, where E j1 = n j bp j and E j2 = n j (1 bp j ) and P = # parameters in sat. model # parameters in M 1 Deviance D 2 is often used as measure of overall goodness-of-fit, and is a test statistic from terms left out of the model. Lecture 18: Logistic Regression Continued p. 12/104

Score Statistic Again, sometimes the Score Statistic or Pearson s chi-square is used to look at the goodness of fit for a given model versus the saturated model: X 2 = P «J [y j n j bp j ] 2 j=1 n j bp j (1 bp j ) = P J j=1 [y j n j bp j ] 2 + [(n j y j ) n j (1 bp j )] 2 n j bp j n j (1 bp j ) = P J j=1 P 2 k=1 (O jk E jk ) 2 E jk χ 2 P For the deviance D 2 and Pearson s chi-square (X 2 ) to be asymptotically chi-square, so that they can be used as statistics to measure the fit of the model, you need each n j to be large (actually, n j ). Lecture 18: Logistic Regression Continued p. 13/104

Sample Size for approx chi-square D 2 For a simple (2 2) table, recall we said that, for the chi-square approximation for the Deviance (likelihood ratio) and Pearson s chi-square to be valid, we should have 75% of the E jk 5 In terms of the logistic regression model, this means that 75% of E j1 = n j bp j 5 and E j2 = n j (1 bp j ) 5, or, for stratum j, i.e., Note, though, E j1 + E j2 5 + 5 E j1 + E j2 10 E j1 + E j2 = n j bp j + n j (1 bp j ) = n j 10 Lecture 18: Logistic Regression Continued p. 14/104

Thus, as a rough rule-of-thumb, for the chi-square approximation for the Deviance and Pearson s chi-square to be valid, we should have 75% of the n j 10 However, n j is often small. For example, 1) the covariates may be continuous, so that n j = 1. 2) the model may have a lot of covariates (so that very few individuals have the same pattern), and most individuals will have different x j s, In these cases, the Deviance and Pearson s chi-square will not be approximately chi-square. We will discuss other statistics that can be used in this situation. Lecture 18: Logistic Regression Continued p. 15/104

Example: The esophageal cancer data There are six age levels, four levels of alcohol, and four levels of tobacco. In theory, there are 96 = 6 4 4 strata, but 8 strata have no observations, leaving J = 88 strata for model fitting. With 975 observations (200 cases and 775 controls), this means we have about 11 observations per stratum, which, on the surface, appears to be a good number. However, when looking more closely, we see that only 31 of the 88 strata (35%) have n j 10, so we may want to be a little careful when using D 2. Also, with the saturated model having this many degrees of freedom (88), D 2 will have power against a lot of alternatives, but not necessarily high power. For example, if we left one marginally significant term out of the model (with 1 df), D 2 probably won t be able to detect a bad fit. We will fit models to assess trends in cancer with alcohol and tobacco consumption, and their interaction. We will discuss modelling the data a little later. Lecture 18: Logistic Regression Continued p. 16/104

GOF with nested Models For bernoulli data, the parameter estimates are OK, but since each group j has n j = 1, D 2 will not be asymptotically chi-square, and could give you a very deceptive idea of the fit. In this case, you could do as before: fit a broader model than the given model, but not the saturated model. Lecture 18: Logistic Regression Continued p. 17/104

Likelihood Ratio Statistic for Nested Models Suppose Y i Bern(p i ) = Bin(1, p i ), Sometimes you can look at a broader model than the one of interest to test for Goodness-of-Fit. In general, to test the fit of a given model, you can put interaction terms and/or square terms in the model and test their significance. For example, suppose you want to see if Model 1 fits, Model 1: p i = e β 0+β 1 x i 1 + e β 0+β 1 x i!, This model is nested in Model 2: Model 2: We want to test p i = e β 0+β 1 x i+β 2 z i 1 + e β 0+β 1 x i+β 2 z i H 0 : β 2 = 0!, Lecture 18: Logistic Regression Continued p. 18/104

Recall, the deviance D 2 is sort of like a SUMS of SQUARES ERROR (error in the given model versus the saturated), and a smaller model will always have more the same or more error than the bigger model. As with log-linear and logistic models discussed earlier, to test for significance of parameters in model 2 versus model 1, you can use D 2 (M 2 M 1 ) = D 2 (M 1 ) D 2 (M 2 ) = 2{log[L( b β) Sat] log[l( β) M 1 ]} 2{log[L( b β) Sat] log[l( β) M 2 ]} = 2{log[L( β) M 2 ] log[l( β) M 1 ]} which is the change in D 2 for model 2 versus model 1. If the smaller model fits, in large samples, D 2 (M 2 M 1 ) χ 2 M, where M parameters are set to 0 in the smaller model. Lecture 18: Logistic Regression Continued p. 19/104

Even though we need binomial data (n j to be large) for D 2 to be approximately chi-square, the same is not true for D 2. Each individual in the study could have a different x i, (i.e. n j = 1) and D 2 will still be approximately a chi-square test statistic for β 2 = 0, since log[l( b β) Sat] subtracts out in the difference. Using differences in D 2 s is just a trick to get the likelihood ratio statistic 2{log[L( β) M 2 ] log[l( β) M 1 ]}, in particular, the log-likelihood for the saturated model subtracts out, and does not affect asymptotics. Lecture 18: Logistic Regression Continued p. 20/104

Basically, regardless of the size of n j, if you are comparing Model 1:! p i = e β 0+β 1 x i 1 + e β 0+β 1 x i, versus Model 2: p i = e β 0+β 1 x i+β 2 z i 1 + e β 0+β 1 x i+β 2 z i!, a Taylor Series approximation can be used to show that, 2 D 2 6 (M 2 M 1 ) 4 q ˆβ 2 3 7 5 dv ar(ˆβ 2 ) 2 χ 2 1 Lecture 18: Logistic Regression Continued p. 21/104

Using G 2 As before, another popular statistic is G 2, which is the likelihood ratio test statistic for whether the parameters, except the intercept µ, are 0 (i.e., the significance of parameters in the model). For G 2, the larger model always has bigger G 2 since it has more parameters (sort of like SUMS of SQUARES REGRESSION) Again, to test for significance of parameters in model 2 versus model 1, you can use G 2 (M 2 M 1 ) = G 2 (M 2 ) G 2 (M 1 ) which is the change in G 2 for model 2 versus model 1. Thus, the likelihood ratio statistic for two nested models can be calculated using either G 2 or D 2. Lecture 18: Logistic Regression Continued p. 22/104

Analysis of Esophageal Data We will treated AGE, TOBACCO, and ALCOHOL as both ordered and not ordered (quantitative). When treating age as ordered, we assigned the values AGE = 8 >< >: 30 if 25-34 40 if 35-44 50 if 45-54 60 if 55-64 70 if 65-74 80 if 75+. Tobacco is given in units of 10g/day, TOB = 8 >< >: 0.5 1.5 2.5 4.0. Lecture 18: Logistic Regression Continued p. 23/104

Alcohol is given in units of 10g/day, ALC = 8 >< >: 2 6 10 15. And, of course, Y = ( 1 if CASE 0 if CONTROL. We also fit models with dummy variables for AGE (AGEGRP), TOBACCO (TOBGRP), and ALCOHOL (ALCGRP). Lecture 18: Logistic Regression Continued p. 24/104

Summary of Model Fits Since age is considered a possible confounding factor, and there is enough data, all models contain 6 dummy variables for the main effect of AGE (AGEGRP), i.e., the basic model is 5X logit(p i ) = β 0 + α j a ij where the a ij are 5 dummy variables for the six age levels. We then added TOBACCO (TOBGRP), and ALCOHOL (ALCGRP) to this basic model (next page). In some models, TOBACCO and ALCOHOL are treated as ordinal, and in others, non-ordinal. When looking at interactions, some models have AGE as ordinal in the interactions, although still non-ordered (dummy variables) for the main effects. j=1 Lecture 18: Logistic Regression Continued p. 25/104

COVARIATES FITTED (in Hypothesis Tested/ # addition to AGEGRP) df Dˆ2 Interpretation --- ---------------------- ---- ----- ------------------------------------ 1 TOBGRP + ALCGRP 76 82.34 Non-ordered main effects --------------------------------------------------------------------------- 2 TOBGRP + ALC 78 87.51 Linear effect of alcohol --------------------------------------------------------------------------- 3 TOBGRP + ALC + ALCˆ2 77 87.01 Linear & quad effects of alcohol --------------------------------------------------------------------------- 4 ALCGRP + TOB 78 84.53 Linear effect of tobacco --------------------------------------------------------------------------- 5 ALCGRP + TOB + TOBˆ2 77 83.73 Linear & quad effects of tobacco --------------------------------------------------------------------------- --------------------------------------------------------------------------- 6 ALC + TOB 80 89.02 Linear effects of tobacco and alcohol Lecture 18: Logistic Regression Continued p. 26/104

--------------------------------------------------------------------------- 7 ALC + TOB + ALC*TOB 79 88.05 Linear effects of tobacco and alcohol + Alc/tob interaction --------------------------------------------------------------------------- 8 ALCGRP + TOBGRP 75 81.37 Non-ordered main effects, but + ALC*TOB ordered alcohol + Alc/tob interacti --------------------------------------------------------------------------- 9 ALCGRP + TOBGRP 75 80.08 Non-ordered main effects, but + ALC*AGE ordered alcohol slope depends on age --------------------------------------------------------------------------- 10 ALCGRP + TOBGRP 75 82.33 Non-ordered main effects, but + TOB*AGE ordered tobacco slope depends on age The p-values for these Dˆ2 are all between.21 and.32 Lecture 18: Logistic Regression Continued p. 27/104

As stated, the sample size may be large enough to support approximate chi-square distributions for the D 2 s if the given model fits. However, these D 2 s will not necessarily have high power to detect where the model does not fit. None of these D 2 s are significant, saying all models are a good fit. Comparing models 1 and 2, there is some evidence that the increase in the log-odds ratio with alcohol may not be purely linear, D 2 (TOBGRP + ALC) D 2 (TOBGRP + ALCGRP) = 5.07 df = 78 76 = 2, p =.08 This is because, having ALCGRP which includes 3 dummy variables for ALCOHOL, is equivalent to having ALC, ALC 2, and ALC 3. Thus, this likelihood ratio statistic is comparing a model with ALC to one with (ALC, ALC 2, and ALC 3 ). Lecture 18: Logistic Regression Continued p. 28/104

However, when comparing models 2 and 3, which is looking for a quadratic effect (ALC 2 ), we find that ALC 2 is not significant, D 2 (TOBGRP + ALC) D 2 (TOBGRP + ALC + ALC 2 ) = 0.11 df = 77 76 = 1, p =.95, so that the deviation from a straight line alcohol detected in the first test may be due to chance. Linearity of trend with tobacco also seems adequate when comparing Model 4 with Models 1 and 5; (p >.05) in both cases. Also, none of the models with interactions appear to add anything significant over models with just the main effects of tobacco and alcohol p >.05 in all cases Also, Model 6, which contains just linear effects for each of alcohol and tobacco, fits the data nearly as well as model 1, which has 4 more parameters for these effects: D 2 (TOB + ALC) D 2 (TOBGRP + ALCGRP) = 6.68 df = 80 76 = 4, p =.154, Lecture 18: Logistic Regression Continued p. 29/104

Final Model The model we suggest is model 6, 5X logit(p i ) = β 0 + α j a ij + β TOB TOB i + β ALC ALC i j=1 where the a ij are 5 dummy variables for the six age levels and Tobacco is given in units of 10g/day, TOB = 8 >< >: 0.5 1.5 2.5 4.0 ; and Alcohol is given in units of 10g/day, ALC = 8 >< >: 2 6 10 15. Lecture 18: Logistic Regression Continued p. 30/104

The estimates from the final model are (this is from Proc Logistic): Standard Wald Parameter DF Estimate Error Chi-Square ChiSq Intercept 1-2.6068 0.4322 36.3812 <.0001 age 25-34 1-4.6680 1.1680 15.9720 <.0001 age 35-44 1-2.8131 0.5539 25.7905 <.0001 age 45-54 1-1.0542 0.4438 5.6411 0.0175 age 55-64 1-0.5047 0.4292 1.3827 0.2396 age 65-74 1 0.0317 0.4383 0.0052 0.9424 tob 1 0.4094 0.0921 19.7734 <.0001 alc 1 0.2548 0.0265 92.8054 <.0001 Lecture 18: Logistic Regression Continued p. 31/104

We estimate that the odds of esophageal cancer increases by a factor of exp(ˆβ ALC ) = exp(.255) = 1.29 for every additional 10 grams of alcohol consumed per day. We estimate that the odds of esophageal cancer increases by a factor of exp(ˆβ TOB ) = exp(.409) = 1.51 for every additional 10 grams of tobacco consumed per day. Lecture 18: Logistic Regression Continued p. 32/104

SAS Proc Logistic data one; input age $ alcohol $ tobacco $ cases controls; if alcohol = 0-39 then alc = 2; if alcohol = 40-79 then alc = 6; if alcohol = 80-119 then alc =10; if alcohol = 120+ then alc =15; if tobacco = 0-9 then tob =.5; if tobacco = 10-19 then tob =1.5; if tobacco = 20-29 then tob =2.5; if tobacco = 30+ then tob =4.0; count=cases; y=1; output; count=controls; y=0; output; drop cases controls; Lecture 18: Logistic Regression Continued p. 33/104

cards; 25-34 0-39 0-9 0 40 25-34 0-39 10-19 0 10 25-34 0-39 20-29 0 6 25-34 0-39 30+ 0 5 25-34 40-79 0-9 0 27 25-34 40-79 10-19 0 7 25-34 40-79 20-29 0 4 25-34 40-79 30+ 0 7 25-34 80-119 0-9 0 2 25-34 80-119 10-19 0 1 <<see website for the data>> Lecture 18: Logistic Regression Continued p. 34/104

proc print noobs; var age alc tob y count; run; /* PROC PRINT OUTPUT */ AGE ALC TOB Y COUNT 25-34 2 0.5 1 0 25-34 2 0.5 0 40 25-34 2 1.5 1 0 25-34 2 1.5 0 10 25-34 2 2.5 1 0 25-34 2 2.5 0 6 25-34 2 4.0 1 0 25-34 2 4.0 0 5 MORE proc logistic descending; class age (PARAM=ref) ; model y = age tob alc / aggregate scale=d /* specify for deviance */ ; freq count; run; Lecture 18: Logistic Regression Continued p. 35/104

/* SELECTED OUTPUT */ Deviance and Pearson Goodness-of-Fit Statistics Criterion DF Value Value/DF Pr > ChiSq Deviance 80 89.0166 1.1127 0.2297 Pearson 80 91.6997 1.1462 0.1748 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1-2.6068 0.4322 36.3812 <.0001 age 25-34 1-4.6680 1.1680 15.9720 <.0001 age 35-44 1-2.8131 0.5539 25.7905 <.0001 age 45-54 1-1.0542 0.4438 5.6411 0.0175 age 55-64 1-0.5047 0.4292 1.3827 0.2396 age 65-74 1 0.0317 0.4383 0.0052 0.9424 tob 1 0.4094 0.0921 19.7734 <.0001 alc 1 0.2548 0.0265 92.8054 <.0001 Lecture 18: Logistic Regression Continued p. 36/104

Hosmer-Lemeshow Goodness-of-Fit Statistic when n j s are small Suppose the responses are bernoulli and are not naturally grouped into binomials, either because there is a continuous covariate or many covariates so that the grouping would lead to sparse strata, i.e., suppose that more than 25% of n j < 10 Then assume Y i Bern(p i ) = Bin(1, p i ), We want to determine if the model, p i = e β 0+β 1 x i 1 + e β 0+β 1 x i!, is a good fit (we are thinking of x i as a vector). Lecture 18: Logistic Regression Continued p. 37/104

One possible way is to fit a broader model (with interactions and/or squared, cubic, etc terms) and see if those extra terms are significant. Hosmer and Lemeshow suggest forming G (usually 10) Extra terms or Extra covariates based on combinations of the covariates x i in the logistic regression model. You again test if these Extra Covariates are significant. Lecture 18: Logistic Regression Continued p. 38/104

Simple Example For example, suppose there is one continuous covariate x i, and we fit the model, p i = e β 0 +β 1 x i 1 + e β 0+β i x i «, where there are i = 1,..., n observations in the dataset. We could then form 10 extra terms or groups based on deciles of the covariate x i, i.e., We form 10 groups of approximately equal size. The first group contains the n/10 subjects with the smallest values of x i, the second group contains the n/10 subjects with the next smallest values of x i, and... the last group contains the n/10 subjects with the highest values of x i. Suppose we define the 9 indicators (the last one is redundant) I ig = ( 1 if individual i is in group g 0 if otherwise, Lecture 18: Logistic Regression Continued p. 39/104

Then, to test goodness-of-fit, we consider the alternative model logit(p i ) = β 0 + β 1 x i + γ 1 I i1 +... + γ 9 I i9 If model is appropriate, then logit(p i ) = β 0 + β 1 x i γ 1 =... = γ 9 = 0. Then, to test the fit of the model, we could use a likelihood ratio statistic, a score statistic, or a Wald statistic for H o : γ 1 =... = γ 9 = 0, and each of these statistics would be approximately chi-square with 9 degrees of freedom if model fits. Lecture 18: Logistic Regression Continued p. 40/104

General Logistic Model Now, we are looking at the fit of the general logistic model, p i = e β 0+β 1 x i 1 + e β 0+β 1 x i!, in which the covariates can be continuous, and a method of forming the Extra Covariates is not obvious. Hosmer and Lemeshow suggest that we form groups based on deciles of risk ; if bp i = 0 @ eˆβ 0 +ˆβ 1 x i 1 + e β 0+ˆβ 1 x i 1 A, is the predicted probability of failure for the given model, then Hosmer and Lemeshow suggest forming groups based on deciles of bp i, i.e., Lecture 18: Logistic Regression Continued p. 41/104

We form 10 groups of approximately equal size. The first group contains the n/10 subjects with the smallest values of bp i, the second group contains the n/10 subjects with the next smallest values of bp i, and... the last group contains the n/10 subjects with the highest values of bp i. You can show that, if there is just one covariate x i, then we get equivalent groups whether we base the grouping on x i or bp i ; (because bp i is a monotone transformation of x i.) Lecture 18: Logistic Regression Continued p. 42/104

A Model for which to look at the fit Again, we are looking at the fit of the general model, p i = e β 0+β 1 x i 1 + e β 0+β 1 x i!, Suppose we define the G group indicators I ig = ( 1 if individual i (bp i ) is in group g 0 if otherwise, where the groups are based on deciles of risk. Then, to test goodness-of-fit, we consider the alternative model logit(p i ) = β 0 + β 1 x i + +γ 1 I i1 +... + γ 9 I i9 Effectively, we are forming an alternative model used to test the fit of the given model. Even though I ig is based on the random quantities bp i, Moore and Spruill (1975), showed that, asymptotically, we can treat the partition as based on the true p i (and thus, we can treat I ig as a fixed covariate). Lecture 18: Logistic Regression Continued p. 43/104

If model is appropriate, then logit(p i ) = β 0 + β 1 x i γ 1 =... = γ G 1 = 0. Then, to test the fit of the model, we could use a likelihood ratio statistic, a score statistic, or a Wald statistic for H o : γ 1 =... = γ G 1 = 0, and each of these statistics would be approximately chi-square with (G 1) degrees of freedom if model fits. The score statistic only requires the estimate of (β 0, β 1 ) under the null, but, both the likelihood ratio and the Wald statistic require the estimates of γ g from the alternative model. Lecture 18: Logistic Regression Continued p. 44/104

The Hosmer-Lemeshow statistic Hosmer and Lemeshow (1982), suggest using Pearson s chi-square based on the grouping of observations, X 2 HL = P G g=1 [O g E g ] 2 n g (E g /n g )[1 E g /n g ] = P G g=1 P 2 k=1 [O gk E gk ] 2 E gk, where, for failure (Y = 0), O g1 = O g, E g1 = E g and, for success (Y = 1), O g2 = n g O g, E g2 = n g E g Basically, this is Pearson s (Hosmer and Lemeshow) chi-square applied to the (G 2) table of observed and expected counts in the G groups. Lecture 18: Logistic Regression Continued p. 45/104

Example Arthritis Clinical Trial Recall the example of an arthritis clinical trial comparing the drug auranofin and placebo therapy for the treatment of rheumatoid arthritis (Bombardier, et al., 1986). The response of interest is the self-assessment of arthritis, classified as (0) poor or (1) good. Individuals were randomized into one of the two treatment groups after baseline self-assessment of arthritis (with the same 2 levels as the response). The dataset contains 293 patients who were observed at both baseline and 13 weeks. The data from 25 cases are shown below: Lecture 18: Logistic Regression Continued p. 46/104

We are interested in a pretest-posttest analysis, in which we relate the individual s bernoulli response ( 1 if good at 13 weeks Y i =. 0 if poor at 13 weeks to the covariates, 1. BASELINE self-assessment (denoted as x) 2. AGE IN YEARS, 3. GENDER 4. TREATMENT Lecture 18: Logistic Regression Continued p. 47/104

In particular, the model is logit(p i ) = β 0 + β 1 x i + β SEX SEX i + β AGE AGE i + β TRT TRT i where the covariates are age in years at baseline (AGE i ), sex (SEX i, 1=male, 0=female), treatment (TRT i, 1 = auranofin, 0 = placebo), and x i is baseline response (1 = good, 0 = poor) The main question is whether the treatment increases the odds of a more favorable response, after controlling for baseline response; secondary questions are whether the response differs by age and sex. Lecture 18: Logistic Regression Continued p. 48/104

The Hosmer-Lemeshow Statistic can be obtained in both SAS Proc Logistic. By default, SAS attempts to form 10 groups of approximately equal size n/10. Of course, in this dataset, we have 293 observations, so we cannot get exactly the same number of observations in each group; Different software packages (such as STATA) may differ slightly in the partition. Lecture 18: Logistic Regression Continued p. 49/104

SAS Proc Logistic The following ascii is in the current directory, and called art.dat 1 54 1 0 1 0 41 0 1 1............... 1 60 0 1 0 1 63 1 0 0 /* SAS STATEMENTS */ DATA ARTH; infile art.dat ; input SEX AGE TRT x y; ; proc logistic descending; model y = SEX AGE TRT x /lackfit; run; Lecture 18: Logistic Regression Continued p. 50/104

/* SELECTED OUTPUT */ Response Profile Ordered Total Value y Frequency 1 1 234 2 0 59 Probability modeled is y=1. Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 0.3327 0.8409 0.1566 0.6923 SEX 1 0.2168 0.3389 0.4095 0.5222 AGE 1-0.00530 0.0144 0.1361 0.7122 TRT 1 0.7005 0.3136 4.9891 0.0255 x 1 1.4231 0.3102 21.0533 <.0001 Lecture 18: Logistic Regression Continued p. 51/104

Partition for the Hosmer and Lemeshow Test y = 1 y = 0 Group Total Observed Expected Observed Expected 1 29 13 15.51 16 13.49 2 29 17 18.02 12 10.98 3 29 23 20.99 6 8.01 4 29 25 23.18 4 5.82 5 28 24 23.47 4 4.53 6 31 30 26.21 1 4.79 7 29 23 25.24 6 3.76 8 29 28 26.28 1 2.72 9 30 26 27.45 4 2.55 10 30 25 27.65 5 2.35 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 12.9408 8 0.1139 Lecture 18: Logistic Regression Continued p. 52/104

Results Recall, the β s have the interpretation as the conditional log-or for a one unit increase in the covariate, given all of the other covariates are the same. We estimate that the odds of good response increases by 1. a factor of exp(ˆβ TRT ) = exp(0.7005) = 2.015 for those on treatment, (this is significant) 2. a factor of exp(ˆβ SEX ) = exp(0.2168) = 1.242 for males, (not significant) 3. a factor of exp(ˆβ AGE ) = exp( 0.00530) = 0.995 for every year older, (not significant) 4. a factor of exp(ˆβ X ) = exp(1.423) = 4.150 for those who also had a good pre-treatment response at baseline, (this is significant). Lecture 18: Logistic Regression Continued p. 53/104

In SAS, the Hosmer and Lemeshow Goodness-of-fit Statistic is X 2 = 12.9408 with 10 2 = 8 df, p = 0.1139 The H-L test indicates the model fit is OK. Lecture 18: Logistic Regression Continued p. 54/104

Using D 2 Suppose, in the previous model, you want to see if there is a treatment by baseline response interaction. You use Proc Logistic to do this: Reduced Model: proc logistic descending; model y = SEX AGE TRT x / aggregate=(sex AGE TRT x) scale=1 ; run; /* SELECTED OUTPUT */ Deviance and Pearson Goodness-of-Fit Statistics Criterion DF Value Value/DF Pr > ChiSq Deviance 158 180.6392 1.1433 0.1048 Pearson 158 186.7077 1.1817 0.0590 Lecture 18: Logistic Regression Continued p. 55/104

proc logistic descending; model y = SEX AGE TRT x*trt / aggregate=(sex AGE TRT x) scale=1 ; run; /* SELECTED OUTPUT */ Deviance and Pearson Goodness-of-Fit Statistics Criterion DF Value Value/DF Pr > ChiSq Deviance 157 178.4911 1.1369 0.1153 Pearson 157 183.9152 1.1714 0.0698 Number of unique profiles: 163 Lecture 18: Logistic Regression Continued p. 56/104

To calcluate D 2, Proc Logistic uses the aggregate option to figure out the number of distinct covariate patterns (strata) corresponding to the observed combinations of the covariates (SEX AGE TRT x), and uses that as the saturated model. There are 163 patterns in this dataset: POPULATION PROFILES Sample Sample SEX AGE TRT X Size ---------------------------------- 1 0 22 0 1 1... 10 0 35 1 1 5 11 0 36 1 1 1 12 0 39 1 1 1 13 0 41 0 1 2... 163 1 66 0 0 1 Lecture 18: Logistic Regression Continued p. 57/104

With 293 observations, and 163 distinct covariate patterns, we only have and average of 293/163=1.8 observations per strata, which is not enough to justify an approximate chi-square for D 2. However, if you want to test for no treatment by baseline response interaction, you can use the WALD STAT, Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq TRT*x 1-0.9213 0.6297 2.1405 0.1435 Lecture 18: Logistic Regression Continued p. 58/104

Or, the Likelihood ratio statistic, D 2 (SEX, AGE, TRT, X) D 2 (SEX, AGE, TRT, X, TRT X) = 180.64 178.49 = 2.15 χ 2 1 and get almost identical results to the WALD Stat. Again, even though D 2 (SEX, AGE, TRT, X) and D 2 (SEX, AGE, TRT, X, TRT X) are not approximate chi-squares, their difference is. Lecture 18: Logistic Regression Continued p. 59/104

Additional Slides on the Saturated Model To see the saturated model, consider the following example based on the esophageal cancer data proc means; class age; var y; freq count; run; The above code calculates the average number of 1 s in the database = ˆp Summary Analysis Variable : y age N Obs N Mean -------------------------------------------- 25-34 116 116 0.0086207 35-44 199 199 0.0452261 45-54 213 213 0.2159624 55-64 242 242 0.3140496 65-74 161 161 0.3416149 75+ 44 44 0.2954545 Lecture 18: Logistic Regression Continued p. 60/104

Model Derived Estimates To easily estimate the model predicted probabilities, consider using the reference coding proc logistic descending; class age (param=ref); model y = age / aggregate scale=1; freq count; run; Then since each AGE category gets a dummy code, then ˆp is equal to P(Y = 1 AgeGroup) = eα+β Age Group 1 + e α+β Age Group Parameter Estimates Standard Parameter DF Estimate Error Intercept 1-0.8690 0.3304 age 25-34 1-3.8759 1.0573 age 35-44 1-2.1808 0.4749 age 45-54 1-0.4203 0.3700 age 55-64 1 0.0878 0.3583 age 65-74 1 0.2129 0.3699 Lecture 18: Logistic Regression Continued p. 61/104

For 25-34 e P(Y = 1 25 35) = 0.8690 3.8759 1+e 0.8690 3.8759 = 0.008620964 P(Y = 1 75+) = e 0.8690 1+e 0.8690 = 0.295462424 Since the model predicted probabilities match the observed probabilities, you have a zero degrees of freedom test and you have perfect fit To illustrate the Goodness of Fit, we need a non-saturated model Lecture 18: Logistic Regression Continued p. 62/104

Introduction of Tobacco and Alcohol Levels Using PROC GENMOD proc genmod descending; class age ; model y = age tob alc /dist=bin link=logit; freq count; run; Here, we want to observe the Deviance / DF ratio to check for Under/Over-dispersion Recall, for properly dispersed data (and hence a good model fit) we want the Deviance to DF ratio to be around 1 For this model, the ratio is Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 967 710.5516 0.7348 Our data appear under dispersed meaning that our model is predicting greater variance than observed This is in part an artifact since we are treating each observation as a stratum. When we calculate GOF, we need to group the data. Lecture 18: Logistic Regression Continued p. 63/104

Parameter Estimates Using PROC LOGISTIC proc logistic descending; class age (PARAM=ref) ; model y = age tob alc / aggregate scale=1; freq count; run; Here, we have specified SCALE=1 - This means that we are multiplying the variance-covariance matrix by a constant of 1 (i.e., the matrix is not adjusted...now, are data appear overdispersed) GOF statistics Deviance and Pearson Goodness-of-Fit Statistics Criterion Value DF Value/DF Pr > ChiSq Deviance 89.0166 80 1.1127 0.2297 Pearson 91.6997 80 1.1462 0.1748 Number of unique profiles: 88 Lecture 18: Logistic Regression Continued p. 64/104

Before Scaling, going back to GOF We specified Aggregate to calculate the change in deviance automatically Recall, this defines the J subtables that can be fit as a saturated model Thus, you can calculate the Change in deviance G 2 is easy to calculate since you can fit a model with only the intercept and look at the increase in likelihood with the added parameters SCALE and AGGREGATE must be used together Options for SCALE are 1 (or None): No scaling D : Var-Cov matrix scaled by the Deviance P : Var-Cov matrix scaled by Pearson s χ 2 c : Any constant c, 1 is a special case Lecture 18: Logistic Regression Continued p. 65/104

Quasi-Likelihood Estimation Scaling the Var-Cov matrix is a common method of correcting for poor predicted variance (overdispersion or underdispersion) Note that the parameter estimates are not changed by this method. However, their standard errors are adjusted for overdispersion, affecting their significance tests. Since we are altering the estimating equations, we are no longer calculating true Maximum Likelihood Estimates The revised estimating equations are sometimes known as quasi-likelihood Quasi-likelihood estimating equations rely on only the mean and variance specifications (as opposed to the full distribution of the outcome) Quasi-likelihood for the basis a variety of estimating approached included Generalized Estimating Equations Quasi-likelihood estimating equations perform well given large sample sizes Lecture 18: Logistic Regression Continued p. 66/104

Scaled Estimates You can use either GENMOD or Logistic proc genmod descending; class age ; model y = age tob alc /dist=bin link=logit aggregate scale=d; freq count; run; proc logistic descending; class age (PARAM=ref) ; model y = age tob alc / aggregate scale=d; freq count; run; Lecture 18: Logistic Regression Continued p. 67/104

GENMOD Results Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 80 89.0166 1.1127 Scaled Deviance 80 80.0000 1.0000 ******* Pearson Chi-Square 80 91.7017 1.1463 Scaled Pearson X2 80 82.4131 1.0302 Log Likelihood -319.2895 Note: Value/DF now equals 1 for the scaled data - i.e., properly dispersed Note also that V/DF for the unscaled deviance is 1.1127, which indicates, when 88 unique stratum are considered the data are over dispersed meaning the Var(Y) is greater than our predicted model Lecture 18: Logistic Regression Continued p. 68/104

Parameter Estimates - GENMOD Analysis Of Parameter Estimates Standard Parameter DF Estimate Error Intercept 1-2.6068 0.4322 age 25-34 1-4.6681 1.1681 age 35-44 1-2.8131 0.5539 age 45-54 1-1.0542 0.4438 age 55-64 1-0.5047 0.4292 age 65-74 1 0.0317 0.4383 age 75+ 0 0.0000 0.0000 tob 1 0.4094 0.0921 alc 1 0.2548 0.0265 Scale 0 1.0548 0.0000 NOTE: The scale parameter was estimated by the square root of DEVIANCE/DOF. Lecture 18: Logistic Regression Continued p. 69/104

USING PROC LOGISTIC Deviance and Pearson Goodness-of-Fit Statistics Criterion Value DF Value/DF Pr > ChiSq Deviance 89.0166 80 1.1127 0.2297 Pearson 91.6997 80 1.1462 0.1748 Number of unique profiles: 88 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 250.6831 7 <.0001 Note: In GENMOD, you get the scaled estimates for GOF, but LOGISTIC doesn t print them. I guess since they would equal the DF, they are not that exciting to look at. Lecture 18: Logistic Regression Continued p. 70/104

Parameter Estimates NOTE: The covariance matrix has been multiplied by the heterogeneity factor (Deviance / DF) 1.11271. Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Intercept 1-2.6068 0.4322 age 25-34 1-4.6680 1.1680 age 35-44 1-2.8131 0.5539 age 45-54 1-1.0542 0.4438 age 55-64 1-0.5047 0.4292 age 65-74 1 0.0317 0.4383 tob 1 0.4094 0.0921 alc 1 0.2548 0.0265 Lecture 18: Logistic Regression Continued p. 71/104

Unscaled Estimates Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 278.9369 7 <.0001 Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Intercept 1-2.6068 0.4097 age 25-34 1-4.6680 1.1073 age 35-44 1-2.8131 0.5251 age 45-54 1-1.0542 0.4208 age 55-64 1-0.5047 0.4069 age 65-74 1 0.0317 0.4155 tob 1 0.4094 0.0873 alc 1 0.2548 0.0251 Lecture 18: Logistic Regression Continued p. 72/104

Residuals and Regression Diagnostics Additional factors to assess when determining Goodness of Fit Sometimes you can look at residuals to see where the model does not fit well. When the responses are binary, but not binomial, a residual can be defined as e i = Y i bp i p bpi (1 bp i ) In large samples, we can replace bp i in e i by p i to give R i = Y i p i p pi (1 p i ) Lecture 18: Logistic Regression Continued p. 73/104

If the model is correctly specified, in large samples, this residual will have mean 0: E(e i ) = E Y i bp i bpi (1 bp i ) E(Y i ) p i pi (1 p i ) ««= p i p i pi (1 p i ) = 0, and variance: V ar(e i ) = V ar Y i bp i bpi (1 bp i ) «V ar[y i p i ] p i (1 p i ) = p i(1 p i ) p i (1 p i ) = 1, Lecture 18: Logistic Regression Continued p. 74/104

If the true model is E(Y i ) = π i p i, then the residual has mean E(e i ) E(Y i ) p i pi (1 p i ) «= π i p i pi (1 p i ) 0, and variance : V ar(e i ) V ar[y i p i ] p i (1 p i ) = π i(1 π i ) p i (1 p i ) 1, Lecture 18: Logistic Regression Continued p. 75/104

Thus, if we just look at the simple average of these residuals, n 1 n X and we do not get around 0, it gives us an idea that the model does not fit well. And, if we just look at the average of the squared residuals i=1 n 1 n X and we do not get around 1, then we know that something may be wrong with the model. If the model fits, then i=1 e i e 2 i V ar(e i ) E(e 2 i ) 1 Lecture 18: Logistic Regression Continued p. 76/104

However, since Y i is always 0 or 1, regardless of the sample size (since Y i is Bernoulli, and does not change with sample size), the observed residuals is not really that informative when the data are binary (Cox). Note, the ordinary residuals Y i bp i are not as informative since you can show nx (Y i bp i ) = 0 i=1 as long as there is an intercept in the model. In a moment, we will talk about ways to look at the residuals when the data are not grouped. First we consider the grouped case. Lecture 18: Logistic Regression Continued p. 77/104

Grouped binomials Suppose the data actually arise as J binomials, where y j is the number of successes, and n j the sample size, with bp j = 0 @ eˆβ x j 1 + eˆβ x j 1 A, The (unadjusted) Pearson residuals e j =! [y j n j bp j ] p nj bp j (1 bp j ) In large samples, we can replace bp j by p j, and e j =! [y j n j p j ] p nj p j (1 p j ) Lecture 18: Logistic Regression Continued p. 78/104

If the model is correctly specified, in large samples (n j large), this residual will have mean 0: E(e i ) E(Y j) n j p j p nj p j (1 p j ) = n jp i n j p i p pi (1 p i ) = 0, and variance: V ar(e i ) V ar[y j n j p j ] n j p j (1 p j ) = n jp j (1 p j ) p nj p j (1 p j ) = 1, Also, Y j will become more normal as n j gets large, and, if the model fits e j N(0,1) Then, if the model fits, we would expect the 95% of the residuals to be between -2 and 2, and to be approximately normal when plotted. Lecture 18: Logistic Regression Continued p. 79/104

If the true model is E(Y j ) = n j π j n j p j, then the residual has mean µ i = E(e i ) = E(Y j ) n j p j nj p j (1 p j ) = n jπ i n j p i pi (1 p i ) 0, and variance σ 2 i = V ar(y j ) = V ar[y j n j p j ] n j p j (1 p j ) = n jπ j (1 π j ) n j p j (1 p j ) 1, Lecture 18: Logistic Regression Continued p. 80/104

Still, Y j will become more normal as n j gets large, but e j N(µ i, σ 2 i ) and plots can sometimes reveal departures from N(0, 1). Note that, the Pearson s chi-square is X 2 = JX e 2 j. j=1 You can generate reams of output of regression diagnostics using the INFLUENCE option in the model statement Lecture 18: Logistic Regression Continued p. 81/104

Residuals for Esophageal Data data two; input age $ alcohol $ tobacco $ cases controls; n = cases + controls; if alcohol = 0-39 then alc = 2; if alcohol = 40-79 then alc = 6; if alcohol = 80-119 then alc =10; if alcohol = 120+ then alc =15; if tobacco = 0-9 then tob =.5; if tobacco = 10-19 then tob =1.5; if tobacco = 20-29 then tob =2.5; if tobacco = 30+ then tob =4.0; if age = 25-34 then age1 = 1; else age1=0; if age = 35-44 then age2 = 1; else age2=0; if age = 45-54 then age3 = 1; else age3=0; if age = 55-64 then age4 = 1; else age4=0; if age = 65-74 then age5 = 1; else age5=0; cards; Lecture 18: Logistic Regression Continued p. 82/104

proc logistic data=two; model cases/n = age1-age5 tob alc / aggregate scale=d INFLUENCE; output out=dinf prob=p resdev=dr h=pii reschi=pr difchisq=difchi; run; Note: We switched to the event trials layout to reduce the output data set. Now, since we have 96 rows of data, we ll get 96 records in the outputted data set DINF. Lecture 18: Logistic Regression Continued p. 83/104

Summary of Pearson Residuals proc univariate data=dinf; var pr; run; Location Variability Mean -0.00283 Std Deviation 1.02665 Median -0.18818 Variance 1.05401 Mode. Range 6.37022 Interquartile Range 1.13606 Lecture 18: Logistic Regression Continued p. 84/104

Quantile Estimate 100% Max 4.133526 99% 4.133526 95% 2.029202 90% 1.168725 75% Q3 0.481946 50% Median -0.188184 25% Q1-0.654117 10% -1.091350 5% -1.243294 1% -2.236690 0% Min -2.236690 Lecture 18: Logistic Regression Continued p. 85/104

New to SAS 9.1 ods graphics; ods pdf file="lec18.pdf" style=journal; proc logistic data=two; graphics all; model cases/n = age1-age5 tob alc / run; aggregate scale=d INFLUENCE; ods pdf close; ods graphics off; For PDF output, see class website (too comprehensive to include in lecture) Lecture 18: Logistic Regression Continued p. 86/104

Collinearity and Confounding in Logistic Regression 1. Confounding Note, confounding can be a problem in logistic regression with many covariates. One covariate confounds the relationship between the response and another covariate if it is related to both the response and covariate. Suppose you want to control for a possible confounding factor that is not of interest itself. If, in the fitted model, the confounding factor is not significant, but it changes the significance and estimated odds ratio for the covariates of interest, then you should always keep the confounding factor in the model (Breslow and Day, Hosmer and Lemeshow). Lecture 18: Logistic Regression Continued p. 87/104

2. Collinearity Strictly speaking, collinearity refers to correlation among the covariates. Thus, if there is collinearity in a dataset, there will very often also be confounding, since many of the covariates will be related to both the response and other covariates Sometimes, collinearity is unavoidable if we have both x ik and x 2 ik as a covariate (such as age and age 2 ), which are usually highly correlated. In extreme cases, where two covariates are very highly correlated, we get unstable fitted equations, symptomatic of collinearity among the covariates. When a pair of covariates are collinear, estimated coefficients may even change signs when the covariates are in the model together versus in the model separately. Lecture 18: Logistic Regression Continued p. 88/104

With logisitic regression, one symptom of multicollinearity is failure of the Newton-Raphson algorithm to converge, because of infinite regression coefficients, very large standard errors and/or coefficients that change wildly under different model specifications When there is collinearity, one should consider which of the collinear variables is most important. Lecture 18: Logistic Regression Continued p. 89/104

Model Building Strategies Building logistic regression models when there are many possible covariates can be a bewildering experience. There can be many interactions to consider, categorical vs. continuous covariates, data transformations, etc. It is often useful to work hierarchically, looking at increasingly more complex structures of nested models, using test statistics (likelihood ratio, score or Wald) in deciding which covariates are important or not important in predicting response Lecture 18: Logistic Regression Continued p. 90/104

Some helpful steps Often, you first look at the relationship between the response and each covariate separately: 1) With Pearson s chi-square (or Fisher s exact test) for a categorical covariate. 2) If the covariate is ordered or continuous, you often look at the simple logistic regression logit(p) = β 0 + β 1 x and test for significance of β 1 in the regression (or, use an exact ordinal test). These relationships are sometimes called univariate (one covariate) or bivariate (1 response versus 1 covariate) relationships. Lecture 18: Logistic Regression Continued p. 91/104

Example Arthritis Clinical Trial We are interested in seeing how the binary response Y = ( 1 if good at 13 weeks 0 if poor at 13 weeks. is affected by the covariates, 1. BASELINE self-assessment: X 1 = ( 1 if good at BASELINE 0 if poor at BASELINE. 2. AGE IN YEARS, 3. GENDER SEX = ( 1 if male 0 if female. 4. TREATMENT TRT = ( 1 if auranofin 0 if placebo. Lecture 18: Logistic Regression Continued p. 92/104

Univariate Relationships Pearson s Covariate % GOOD DF Chi-Square P-value --------- ------------ ---- ---------- ------- Baseline GOOD 87.50 1 22.849 0.000 POOR 63.44 AGE 21-47 80.61 2 3.636 0.162 48-56 86.08 57-66 75.00 GENDER MALE 80.75 1 0.382 0.536 FEMALE 77.50 TREATMENT AURANOFIN 84.93 1 4.648 0.031 PLACEBO 74.83 Lecture 18: Logistic Regression Continued p. 93/104

Treatment and Baseline response are significant, whereas the other two are not. Instead of continuous age, we formed three age groups (based on terciles) Category Age Range 0 21-47 1 48-56 2 57-66 Alternatively, we could have ran a logistic regression with continuous age as a covariate. Lecture 18: Logistic Regression Continued p. 94/104

Multivariate Models After carefully looking at the univariate analyses as a screening tool, you can run a multivariate analyses including all covariates thought to be important from the univariate analyses Hosmer and Lemeshow recommend including any covariate in the multivariate analyses which had p value less than.25 in a univariate analysis. Hosmer and Lemeshow also recommend including other variables known to be important (treatment, exposure variables, possible confounding variables, etc.) that may not be significant. Lecture 18: Logistic Regression Continued p. 95/104

Consider the importance of each covariate in the model, and look to see whether some could be deleted or others need to be added (via test statistics, usually). Once you feel close to a final model, look more carefully for possible interactions, recoding of variables that might be helpful, addition of quadratic terms, or other transformations, etc. If possible, meet with an investigator on the subject matter to see if the model is biologically plausible. Lecture 18: Logistic Regression Continued p. 96/104

Notes There is more than ONE final model In complex datasets, we often will present the results of several related models. What is important is that you write up your model building strategy in a fair and descriptive way. Statistical Significance is not the only reason to keep a covariate in the model. If a covariate is thought (and shown by others) to be a confounding variable, for the association between the exposure and disease, then it should be kept in. You may also be interested in p values for covariates which are not significant, so you often leave them in the model to show they are not significant. Lecture 18: Logistic Regression Continued p. 97/104