VI. Introduction to Logistic Regression



Similar documents
Basic Statistical and Modeling Procedures Using SAS

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

SAS Software to Fit the Generalized Linear Model

Beginning Tutorials. PROC FREQ: It s More Than Counts Richard Severino, The Queen s Medical Center, Honolulu, HI OVERVIEW.

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Lecture 14: GLM Estimation and Logistic Regression

Statistics, Data Analysis & Econometrics

Lecture 19: Conditional Logistic Regression

11. Analysis of Case-control Studies Logistic Regression

Generalized Linear Models

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

Logit Models for Binary Data

Multinomial and Ordinal Logistic Regression

Abbas S. Tavakoli, DrPH, MPH, ME 1 ; Nikki R. Wooten, PhD, LISW-CP 2,3, Jordan Brittingham, MSPH 4

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Lecture 18: Logistic Regression Continued

Examples of Using R for Modeling Ordinal Data

Ordinal Regression. Chapter

13. Poisson Regression Analysis

Logistic Regression (1/24/13)

Poisson Models for Count Data

Two Correlated Proportions (McNemar Test)

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Poisson Regression or Regression of Counts (& Rates)

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Cool Tools for PROC LOGISTIC

Simple Linear Regression Inference

Common Univariate and Bivariate Applications of the Chi-square Distribution

ABSTRACT INTRODUCTION

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

SUGI 29 Statistics and Data Analysis

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Categorical Data Analysis

Discrete Distributions

Latent Class Regression Part II

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Chapter 39 The LOGISTIC Procedure. Chapter Table of Contents

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Multivariate Logistic Regression

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Regression with a Binary Dependent Variable

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

How to set the main menu of STATA to default factory settings standards

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Examining a Fitted Logistic Model

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

Predicting the US Presidential Approval and Applying the Model to Foreign Countries.

Chapter 7: Simple linear regression Learning Objectives

Point Biserial Correlation Tests

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

SAS Syntax and Output for Data Manipulation:

Using Stata for Categorical Data Analysis

SUMAN DUVVURU STAT 567 PROJECT REPORT

Statistical Machine Learning

ADVANCED FORECASTING MODELS USING SAS SOFTWARE

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Regression 3: Logistic Regression

Probability Calculator

Final Exam Practice Problem Answers

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

5 Modeling Survival Data with Parametric Regression

GEEs: SAS Syntax and Examples

1.1. Simple Regression in Excel (Excel 2010).

Week 5: Multiple Linear Regression

LOGIT AND PROBIT ANALYSIS

Module 14: Missing Data Stata Practical

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Weight of Evidence Module

Statistical Functions in Excel

LOGISTIC REGRESSION ANALYSIS

GLM I An Introduction to Generalized Linear Models

Estimation of σ 2, the variance of ɛ

Nominal and ordinal logistic regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistics and Data Analysis

MORE ON LOGISTIC REGRESSION

Gamma Distribution Fitting

Penalized regression: Introduction

Additional sources Compilation of sources:

6 Variables: PD MF MA K IAH SBS

Odds ratio, Odds ratio test for independence, chi-squared statistic.

Paper Evaluation of methods to determine optimal cutpoints for predicting mortgage default Abstract Introduction

Linear Classification. Volker Tresp Summer 2015

Handling missing data in Stata a whirlwind tour

Logistic (RLOGIST) Example #1

Strategies for Developing and Delivering an Effective Online SAS Programming Course Justina M. Flavin, Independent Consultant, San Diego, CA

Logistic Regression (a type of Generalized Linear Model)

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

A model to predict churn

Transcription:

VI. Introduction to Logistic Regression We turn our attention now to the topic of modeling a categorical outcome as a function of (possibly) several factors. The framework of generalized linear models unifies a family of regression models that includes (but is not limited to) logistic, Poisson, and linear regression.

Linear Regression: Modeling the Mean Recall that linear regression involves modeling the mean of some outcome variable as a function of one or more explanatory variables. That is, we have a sample Y 1,, Y n of independent measures, where the i th subject in our sample has p explanatory variables x i1, x i2,, x ip, and E(Y i ) = μ i. The linear regression model specifies that Y i = β 0 + β 1 x i1 + β 2 x i2 + + β p x ip + ε i, where ε i ~ N(0, σ 2 ), for i = 1,,n. Then E(Y i x i1, x i2,, x ip ) = β 0 + β 1 x i1 + + β p x ip, and Var(Y i ) = σ 2. This model assumes that the mean of the outcome variable changes linearly with respect to the explanatory variables.

The Three Components of a Generalized Linear Model Whereas with linear regression, we model the mean of the outcome variable directly, a generalized linear model is constructed to model the effects of the covariates on a function of the mean. There are hence three parts or components that comprise a generalized linear model: 1. The random component, which specifies the distribution of the outcome variable. 2. The systematic component, which represents a function of the covariates that will link to the outcome variable. 3. The link function, which determines how the mean of the outcome variable relates to the covariates.

Generalized Linear Models for Binary Data We have a sample Y 1,, Y n of independent binary outcome measurements. The i th subject in our sample has p explanatory variables x i1, x i2,, x ip. Suppose that P(Y i = 1) = π i and P(Y i = 0) = 1 π i. Hence, E(Y i ) = π i. The random component in this case is clearly binomial. For the purpose of this class, we will always assume that the systematic component is simply a linear combination of the covariates, or β 0 + β 1 x i1 + + β p x ip. The remaining question is: how do we model π i as a function of the covariates (i.e., what is the link function)?

Link Functions for the Binomial Distribution Suppose that we assume that E(Y i ) = π i = β 0 + β 1 x i1 + + β p x ip. We call this the identity link. Does this model have any practical shortcomings? Since the systematic component can take on any value, we often prefer using a link that t will constrain π i to the interval lbetween 0 and 1. The so-called logistic (also called the logit or log-odds) link g(π i ) = log[π i /(1 π i )] accomplishes this. There are other links (e.g., the probit), but we will focus mainly on the logit.

Example VI.A The Cache County Memory Study (CCMS) has followed approximately 5100 elderly l men and women continually since 1995. The project s focus has been to better understand genetic and environmental modifiers of dementia risk particularly Alzheimer s disease and cognitive health. The ε4 allele of the APOE gene is a reported risk factor for AD. The data in the following table summarize findings (thus far) from the CCMS relative to APOE and AD. AD No AD Total 1 ε4 Allele 308 1296 1604 No ε4 262 3096 3358 Total 570 4392 4962

Example VI.A Consider a generalized linear model in this case. Note that these data can be viewed as a sample of outcome measures Y 1,,YY 4962, where Y i = 1 if the i th individual was diagnosed with AD, and Y i = 0 otherwise. Moreover, we have a single covariate X i, which is 1 if the i th individual has at least one copy of the APOE ε4 allele, and 0 otherwise. The model looks something like this: logit(π i ) = log[π i /(1 π i )] = β 0 + β 1 X i

Example VI.A (cont d) First, how do we interpret the coefficients of this model? What is logit(π i X i = 1)? What is logit(π i X i = 0)? What is the log odds ratio with respect to AD risk, comparing those with at least one ε4 to those with no ε4? The regression coefficient of a binary covariate in a logistic regression model represents the log odds ratio comparing the group identified as 1 to the group identified as 0. More generally, the coefficient of any arbitrary covariate in a logistic regression model represents the log odds ratio for subjects who differ by one unit with respect to the covariate.

Fitting the Generalized Linear Model for the Alzheimer s Data in SAS We obtain parameter estimates for a generalized linear model using the method of maximum likelihood these estimates typically y cannot be computed in closed form. The following SAS program shows how to read the AD data into SAS, and then obtain a fit for the regression coefficients in the model of Example VI.A. options ls=79 nodate; data; input e4 ad count; cards; 1 1 308 1 0 1296 0 1 262 0 0 3096 ;; proc genmod descending; model ad=e4 / dis=bin link=logit type3; weight count; run; proc sort; by descending ad descending e4; run; f proc freq order=data; tables e4*ad / chisq relrisk; weight count; run;

The FREQ Procedure SAS Output Stat 5100 Linear Regression and Time Series Table of e4 by ad e4 ad Frequency Percent Row Pct Col Pct 1 0 Total ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 308 1296 1604 6.21 26.12 32.33 19.20 80.80 54.04 29.51 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 0 262 3096 3358 5.28 62.39 67.67 7.80 92.20 45.96 70.49 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Total 570 4392 4962 11.49 88.51 100.00 Statistics for Table of e4 by ad Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 138.7375 <.0001 Likelihood Ratio Chi-Square 1 129.98029802 <.0001 Continuity Adj. Chi-Square 1 137.6186 <.0001 Mantel-Haenszel Chi-Square 1 138.7095 <.0001 Phi Coefficient 0.1672 Contingency Coefficient 0.1649 Cramer's V 0.1672

SAS Output (cont d) Stat 5100 Linear Regression and Time Series Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 308 Left-sided Pr <= F 1.0000 Right-sided id d Pr >= F 3.366E-30366E 30 Table Probability (P) 6.085E-30 Two-sided Pr <= P 4.889E-30 Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 2.8083 2.3527 3.3522 Cohort (Col1 Risk) 2.4611 2.1106 2.8697 Cohort (Col2 Risk) 0.8764 0.8540 0.8993

The GENMOD Procedure Model Information SAS Output (cont d) Data Set WORK.DATA1 Distribution Binomial Link Function Logit Dependent Variable ad Scale Weight Variable count Number of Observations Read 4 Number of Observations Used 4 Sum of Weights 4962 Number of Events 2 Number of Trials 4 Response Profile Ordered Total Value ad Frequency 1 1 570 2 0 4392 PROC GENMOD is modeling the probability that ad='1'. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 2 3408.7579 1704.3790 Scaled Deviance 2 3408.7579 1704.3790 Pearson Chi-Square 2 4962.0000 2481.0000 Scaled Pearson X2 2 4962.0000 2481.0000 Log Likelihood -1704.3790 Algorithm converged. Stat 5100 Linear Regression and Time Series

SAS Output (cont d) Stat 5100 Linear Regression and Time Series Analysis Of Parameter Estimates Standard Wald 95% Chi- Parameter DF Estimate Error Confidence Limits Square Pr > ChiSq Intercept 1 2 4695 0 0643 2 5956 2 3434 1473 15 < 0001 Intercept 1-2.4695 0.0643-2.5956-2.3434 1473.15 <.0001 e4 1 1.0326 0.0903 0.8556 1.2096 130.69 <.0001 Scale 0 1.0000 0.0000 1.0000 1.0000

Example VI.A (cont d) The SAS output t indicates that t ˆ 2.4695 and ˆ 1.0326. 0 2 1 According to the fit of the regression model, what are the estimated log odds of AD for someone with no APOE ε4 allele? What are the estimated log odds given for someone with at least one high-risk allele? Wh t i th ti t d l dd ti f AD i k i 4 What is the estimated log odds ratio of AD risk comparing ε4 carriers to non-carriers? How does this estimate compare with the sample odds ratio computed using the data in the 2 x 2 table?