Examining a Fitted Logistic Model



Similar documents
Generalized Linear Models

Chapter 7: Simple linear regression Learning Objectives

11. Analysis of Case-control Studies Logistic Regression

Logistic Regression (a type of Generalized Linear Model)

Regression III: Advanced Methods

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Regression Modeling Strategies

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

13. Poisson Regression Analysis

2. Simple Linear Regression

Local classification and local likelihoods

VI. Introduction to Logistic Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Chapter 23. Inferences for Regression

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Multivariate Logistic Regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Some Essential Statistics The Lure of Statistics

Lecture 14: GLM Estimation and Logistic Regression

Regression 3: Logistic Regression

SAS Software to Fit the Generalized Linear Model

Chapter 13 Introduction to Linear Regression and Correlation Analysis

HLM software has been one of the leading statistical packages for hierarchical

Basic Statistical and Modeling Procedures Using SAS

Testing for Lack of Fit

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Logistic Regression (1/24/13)

Regression III: Advanced Methods

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Simple linear regression

data visualization and regression

Additional sources Compilation of sources:

Examples. David Ruppert. April 25, Cornell University. Statistics for Financial Engineering: Some R. Examples. David Ruppert.

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

Univariate Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Outline. Dispersion Bush lupine survival Quasi-Binomial family

SUMAN DUVVURU STAT 567 PROJECT REPORT

Simple Predictive Analytics Curtis Seare

Lecture 8: Gamma regression

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Johns Hopkins University Bloomberg School of Public Health

Section 6: Model Selection, Logistic Regression and more...

GLM I An Introduction to Generalized Linear Models

Scatter Plot, Correlation, and Regression on the TI-83/84

Logistic regression (with R)

Risk pricing for Australian Motor Insurance

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

LOGISTIC REGRESSION ANALYSIS

Statistics, Data Analysis & Econometrics

Lecture 18: Logistic Regression Continued

Introduction to Statistics and Quantitative Research Methods

Electronic Thesis and Dissertations UCLA

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

2013 MBA Jump Start Program. Statistics Module Part 3

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Simple Linear Regression Inference

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Chapter 5 Analysis of variance SPSS Analysis of variance

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

socscimajor yes no TOTAL female male TOTAL

5. Linear Regression

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Statistics. Measurement. Scales of Measurement 7/18/2012

SPSS Guide: Regression Analysis

How to set the main menu of STATA to default factory settings standards

An Introduction to Statistical Tests for the SAS Programmer Sara Beck, Fred Hutchinson Cancer Research Center, Seattle, WA

The importance of graphing the data: Anscombe s regression examples

Latent Class Regression Part II

Lecture 25. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Statistics 2014 Scoring Guidelines

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Multinomial and Ordinal Logistic Regression

Poisson Models for Count Data

Lecture 6: Poisson regression

Modelling and added value

Penalized Logistic Regression and Classification of Microarray Data

Multiple Linear Regression

Cool Tools for PROC LOGISTIC

Applying Statistics Recommended by Regulatory Documents

AP Statistics. Chapter 4 Review

Moderator and Mediator Analysis

International Statistical Institute, 56th Session, 2007: Phil Everson

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables

List of Examples. Examples 319

Lecture 3: Linear methods for classification

Module 5: Multiple Regression Analysis

Simple example of collinearity in logistic regression

Statistical Models in R

Transcription:

STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic model was fit as follows > glfit <- glm( cbind(mc,fc) ~ year, family=binomial) > summary(glfit)..... Null deviance: 80.252 on 59 degrees of freedom Residual deviance: 78.385 on 58 degrees of freedom > 1-pchisq(glfit$dev,glfit$df.resid) [1] 0.03853689 > glfitsat <- glm( cbind(mc,fc) ~ factor(year), family=binomial) > anova(glfit,glfitsat,test="chi") Analysis of Deviance Table Model 1: cbind(mc, fc) ~ year Model 2: cbind(mc, fc) ~ factor(year) Resid. Df Resid. Dev Df Deviance P(> Chi ) 1 58 78.385 2 0 0.000 58 78.385 0.03854 *

STAT 536 Lecture 16 2 Examination of Residuals Here is a plot of the Pearson residuals, which are defined based on the fitted values ŷ i = m i ˆπ i (letting m i is the total number of births in year i ) as y i ŷ i Var (ŷi ) (1) Another commonly used form of residual is the deviance residual, defined as ± 2 y i log ( yi ŷ i ) ( ) ni y i + (n i y i )log n i ŷ i taking the sign of y i ŷ i. The plot below illustrates the convergence of the two definitions for large values of n i. (2)

STAT 536 Lecture 16 3 The plot hints a the possible existence of non-linearity as a source of the lack of fit. A fifth order polynomial fit yields the following plot. The test of residual deviance yields the following: > 1-pchisq(glfitp$dev,glfit$df.resid) [1] 0.1483860

STAT 536 Lecture 16 4 Use of polynomials in non-linearity can be problematic due to their non-robustness. The use of splines is generally recommended instead. Goodness of Fit tests in the absence of replication A number of tests for lack of fit are available in CRAN packages, including library(mkdesign) and library(design). The latter library (from Frank Harrell, author of Regression Modeling Strategies) provides it s own functions for logistic regression. Returning to the ICU mortality (APACHE score) data from last day. The most widely used (but not the most powerful) test is the Hosmer-Lemeshow test. The test statistic is calculated by first partitioning the observations by deciles of fitted values, π i. Within each decile, j, one calculates, O j = y i, E j = ŷ i and letting n j represent the number in that group (which will be roughly ) we calculate n 10 where π j = E j /n j. H = 10 j=1 (O j E j ) 2 n j π j (1 π j ) > attach(tdf) > library(design) > dd <- datadist(tdf) > options(datadist="dd") > lrmfit <- lrm( discharge ~ reason*apache, x=true,y=true) > library(mkmisc)

STAT 536 Lecture 16 5 > HLgof.test( predict(lrmfit,type="fitted"), as.integer(lrmfit$y=="d")) Hosmer-Lemeshow C statistic X-squared = 1.9453, df = 8, p-value = 0.9826 Hosmer-Lemeshow H statistic X-squared = 7.7566, df = 8, p-value = 0.4576 > residuals(lrmfit,type="gof") Sum of squared errors Expected value H0 SD 7.8114008 7.6517094 0.2591992 Z P 0.6160952 0.5378317 Assessing the Strength of Relationships in Logistic Regression If one treats y as representing a diagnostic results (1 = Positive) and the fitted η s (i.e. ˆη i s) as a continuous diagostic indicator, we can use the idea of area under the curve (AUC) to capture the strength of the relatonship. > rocplot(lrmfit$y,predict(lrmfit)) Harrell s library(design) provides automatic re-scaling of explanatory variables to aid in interpreting the magnitude of logistic regression coefficients and odds ratios > summary(lrmfit) Factor Low High Diff. Effect S.E. 95% LL 95% UL apache 10 24 14 2.03 1.72-1.33 5.40 Odds Ratio 10 24 14 7.63 NA 0.26 221.17 [ output truncated ]

STAT 536 Lecture 16 6 Model Building Strategies The key assumptions of the logistic regression models are independence of y i s correct specification of the relationship between π i and the explanatory values The latter depends on the validity of the link specification and of the appropriateness of the linear predictor. One particular issue that must be addressed is the potential utility for transforming continuous variables to improve the quality of the fit. Residual plots are often used. The data analysed below describes occurrence of bleeding in patients enrolled in a clinical trial testing the efficacy of two protocols for treating blood clots (thromboses) using heparin, an anti-clotting drug. Bleeding is often a side-effect of heparin therapy. Physicians had notice that older women seemed to be susceptible to bleeds. Weight is also a factor, as well as a measure of the patients innate clotting tendency, measured by activated partial thromboplastin time (aptt for short) which is the time taken for clots to form in a laboratory blood sample test. Patients with longer aptt values are more susceptible to bleeding. Here are deviance residual plots after fitting a model with age, sex, weight and aptt. The dichotomous nature of logistic regression residuals makes it almost impossible to discern any pattern in such plots. Generalized Additive Modeling is is alternate approach to examining functional form developed by Hastie and Tibshirani. Iterative non-parametric fits are performed using scatter-plot smoothing to estimate the additive components. The algorithm produces both estimates and an assessment of the statistical signficance of deviation from linearity.

STAT 536 Lecture 16 7 Call: gam(formula = any.bld ~ gender + s(weight) + s(age) + s(aptt0), family = binomial) [ output truncated ] Df Npar Df Npar Chisq P(Chi) (Intercept) 1 gender 1 s(weight) 1 3 4.6494 0.1994 s(age) 1 3 2.4205 0.4898 s(aptt0) 1 3 6.1242 0.1057