13. Poisson Regression Analysis



Similar documents
Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

11. Analysis of Case-control Studies Logistic Regression

Generalized Linear Models

VI. Introduction to Logistic Regression

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Poisson Models for Count Data

Logistic regression modeling the probability of success

Simple Linear Regression Inference

Multiple logistic regression analysis of cigarette use among high school students

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Logistic Regression (a type of Generalized Linear Model)

Regression III: Advanced Methods

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

SAS Software to Fit the Generalized Linear Model

Examining a Fitted Logistic Model

Introduction to Statistics and Quantitative Research Methods

Part 2: Analysis of Relationship Between Two Variables

2. Simple Linear Regression

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Example: Boats and Manatees

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

Advanced Statistical Analysis of Mortality. Rhodes, Thomas E. and Freitas, Stephen A. MIB, Inc. 160 University Avenue. Westwood, MA 02090

LOGISTIC REGRESSION ANALYSIS

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

3.1 Solving Systems Using Tables and Graphs

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Univariate Regression

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Logit Models for Binary Data

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

APPLICATION OF LINEAR REGRESSION MODEL FOR POISSON DISTRIBUTION IN FORECASTING

7.1 The Hazard and Survival Functions

Multivariate Logistic Regression

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Chapter 7: Simple linear regression Learning Objectives

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

Regression 3: Logistic Regression

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Missing data and net survival analysis Bernard Rachet

SPSS Guide: Regression Analysis

Poisson Regression or Regression of Counts (& Rates)

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Examples of Using R for Modeling Ordinal Data

A Basic Introduction to Missing Data

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Outline. Dispersion Bush lupine survival Quasi-Binomial family

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

Basic Statistical and Modeling Procedures Using SAS

Logistic Regression (1/24/13)

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

1 Basic ANOVA concepts

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Simple Regression Theory II 2010 Samuel L. Baker

2 Sample t-test (unequal sample sizes and unequal variances)

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Methods for Meta-analysis in Medical Research

LOGIT AND PROBIT ANALYSIS

Module 5: Multiple Regression Analysis

Introduction to General and Generalized Linear Models

The Probit Link Function in Generalized Linear Models for Data Mining Applications

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Guide to Biostatistics

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.

HLM software has been one of the leading statistical packages for hierarchical

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

Chapter 3 Quantitative Demand Analysis

Multiple Linear Regression

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Scatter Plot, Correlation, and Regression on the TI-83/84

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

5. Multiple regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Nominal and ordinal logistic regression

Ordinal Regression. Chapter

Introduction to Quantitative Methods

Confidence Intervals for the Difference Between Two Means

Econometrics Simple Linear Regression

Week 5: Multiple Linear Regression

Correlation and Simple Linear Regression

Parametric Survival Models

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

Module 3: Correlation and Covariance

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Non-Inferiority Tests for Two Means using Differences

SUMAN DUVVURU STAT 567 PROJECT REPORT

Transcription:

136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often encounters situations where the outcome variable is numeric, but in the form of counts. Often it is a count of rare events such as the number of new cases of lung cancer occurring in a population over a certain period of time. The aim of regression analysis in such instances is to model the dependent variable Y as the estimate of outcome using some or all of the explanatory variables (in mathematical terminology estimating the outcome as a function of some explanatory variables. When the response variable had a Normal distribution we found that its mean could be linked to a set of explanatory variables using a linear function like Y = β + β 1 X 1 + β 2 X 2.+ β k X k. In the case of binary regression the fact that probability lies between -1 imposes a constraint. The normality assumption of multiple linear regression is lost, and so also is the assumption of constant variance. Without these assumptions the F and t tests have no basis. The solution was to use the logistic transformation of the probability p or logit p, such that log e (p/1 p) = β + β 1 Χ 1 + β 2 Χ 2.β n Χ n. The β coefficients could now be interpreted as increasing or decreasing the log odds of an event, and expβ (the odds multiplier) could be used as the odds ratio for a unit increase or decrease in the explanatory variable. In survival analysis we used the natural logarithm of the hazard ratio, that is log e h(t)/h (t) = β +β 1 X 1 +..+ β n X n When the response variable is in the form of a count we face a yet different constraint. Counts are all positive integers and for rare events the Poisson distribution (rather than the Normal) is more appropriate since the Poisson mean >. So the logarithm of the response variable is linked to a linear function of explanatory variables such that log e (Y) = β + β 1 Χ 1 + β 2 Χ 2 etc. and so Y = (e β ) (e β1χ1 ) (e β2χ2 ).. etc. In other words, the typical Poisson regression model expresses the log outcome rate as a linear function of a set of predictors. Assumptions in Poisson Regression The assumptions include: 1. Logarithm of the disease rate changes linearly with equal increment increases in the exposure variable. 2. Changes in the rate from combined effects of different exposures or risk factors are multiplicative. 3. At each level of the covariates the number of cases has variance equal to the mean. 4. Observations are independent. Methods to identify violations of assumption (3) i.e. to determine whether variances are too large or too small include plots of residuals versus the mean at different levels of the predictor variable. Recall that in the case of normal linear regression, diagnostics of the model used plots of residuals against fits (fitted values). This means that the same diagnostics can be used in the case of Poisson Regression.

Research methods II: Multivariate Analysis 137 The examples below illustrate the use of Poisson Regression. Example 1 Births by caesarean section are said to be more frequent in private (fee paying) hospitals as compared to non-fee paying public hospitals. Data about total annual births and the number of caesarean sections carried out were obtained from the records of 4 private hospitals and 16 public hospitals. These are tabulated below: Births Hospital Caesareans type 236 8 739 1 16 97 1 15 2371 1 23 39 1 5 679 1 13 26 4 1272 1 19 3246 1 33 194 1 19 357 1 1 18 1 16 127 1 22 28 2 257 1 22 138 2 52 1 18 151 1 21 275 1 24 192 1 9 = Private 1 = Public Hospital Hospitals The result of Poisson regression analysis is described below: [ In Genstat Stats Regression Analysis Generalized Linear General Model in the Analysis field Distribution Poisson ] We first regress the response variable Caesarean Sections against one explanatory variable viz. number of births ; then add one more explanatory variable number of obstetricians, and finally add a third variable in the form of indicator variable for public hospital (i.e. Public hospital =1; Private Hospital =) in the regression analysis.

138 Poisson Regression Analysis ***** Regression Analysis ***** Response variate: CAESAREA Distribution: Poisson Link function: Log Fitted terms: Constant, BIRTHS *** Summary of analysis *** mean deviance d.f. deviance deviance ratio Regression 1 63.575488916 63.575488916 63.58 Residual 18 36.414789139 2.2343841 Total 19 99.9927855 5.262646213 Change -1-63.575488916 63.575488916 63.58 * MESSAGE: The following units have large standardized residuals: 13 2.25 14-2.82 16-2.94 17 2.17 * MESSAGE: The following units have high leverage: 9.41 19.21 *** Estimates of regression coefficients *** estimate s.e. t(*) Constant 2.132.12 2.95 BIRTHS.445.54 8.17 * MESSAGE: s.e.s are based on dispersion parameter with value 1 The regression equation may now be written as: Log e (Y) = β + β 1 X 1 On substituting the values of Y and X, the equation can be written as: Log e (Caesarean) = 2.132 +.441 Births Which leads to caesarean = (e) 2.132 x (e).4 Births Recall that we have used poisson regression to model the data and thereby obtained estimates of caesarean sections based on just one explanatory variable viz. number of births. We next add a second term to the model, - type of hospital and obtain the following. ***** Regression Analysis ***** Response variate: CAESAREA Distribution: Poisson Link function: Log Fitted terms: Constant, BIRTHS, HOSP_TYP *** Summary of analysis ***

Research methods II: Multivariate Analysis 139 mean deviance d.f. deviance deviance ratio Regression 2 81.9517766 4.97553883 4.98 Residual 17 18.392449 1.61129438 Total 19 99.9927855 5.262646213 Change -2-81.9517766 4.97553883 4.98 * MESSAGE: ratios are based on dispersion parameter with value 1 * MESSAGE: The following units have large standardized residuals: 5-2.48 * MESSAGE: The following units have high leverage: 9.41 *** Estimates of regression coefficients *** estimate s.e. t(*) Constant 1.351.249 5.43 BIRTHS.3261.63 5.41 HOSP_TYP 1.45.272 3.84 MESSAGE: s.e.s are based on dispersion parameter with value 1 This is the full model for which the regression equation may be written as: Log (Y) = β + β 1 X 1 + β 2 X 2 On substituting the values we have Log e (Y) = 1.351 +.33 Births + 1.45 Hosptype for Public Hospitals (For private hospitals X 2 is and β 2 X 2 = ). This leads to caesarean sections = (e) 1.351 x (e).33 births x (e) 1.45 Public Hospitals Caesarean sections in Public Hospitals = 3.86 x 1 x 2.84 = 1.97 =approximately 11. Caesarean sections in Private Hospitals = 3.86 x 1 = 3.86 = Approximately 4. Conclusion According to evidence presented all things being equal it is the other way round. Caesarean sections are about twice as common in Public Hospitals than in Private ones. Of the two models presented which one gives best estimates? For the answer we look at the tables which give the values for deviance. Deviance serves the same purpose as sum of squares in multiple linear regression. An important use of deviance, and difference between deviances, is in the comparison of fits of two models when additional explanatory variable(s) get added to the initial simple model. In the case of the first analysis using just one explanatory variable, the deviance explained by the regression is 63.575. This changes to 81.95 when a second variable, Hospital Type is added to the model. The difference is 18.375 (81.95 63.575 = 18.375) and the difference in degree of freedom is 1. Looking this up in the table of χ 2 at 1 degree of freedom, we find that the result is significant (P <.1). We can opt for this model.

14 Poisson Regression Analysis The next step is to check for the fit of the model by carrying out diagnostic plots of deviance against fits Plot of Residuals against Fittefd Values 2. Residuals 1.5 1..5. -.5-1. -1.5-2. -2.5 1 2 3 Fitted Values Second Example to illustrate Poisson Regression Analysis A cohort of subjects, some non-smokers and others smokers, was observed for several years. The number of cases of cancer of the lung diagnosed among the different categories was recorded. Data regarding the number of years of smoking were also obtained from each individual. For each category the person-years of observation were calculated. The investigators wish to address the question of the relative risks of smoking. The following records were kept. CIGS No. Person Per years years Day smoking CASES 15 1366 1 25 5969 35 3512 45 1421 55 826 2 5 15 3121 5 25 2288 5 35 1648 1 5 45 927 5 55 66 11 15 3577 11 25 2546 1 11 35 1826 11 45 988 2 11 55 449 3 16 15 4317 16 25 3185 16 35 1893 16 45 849 2 16 55 28 5

Research methods II: Multivariate Analysis 141 2 15 5683 2 25 5483 1 2 35 3646 5 2 45 1567 9 2 55 416 7 27 15 342 27 25 429 4 27 35 3529 9 27 45 149 1 27 55 284 3 4 15 67 4 25 1482 4 35 1336 6 4 45 556 7 4 55 14 1 In the above data set the average number of cigarettes smoked per day represents the daily dose, and the years of smoking together with the average number of cigarettes smoked daily represents the total dose inhaled over time. Both appear to be related to the outcome as illustrated in the charts below. Cases of Cancer of the Lung by average number of cigarettes smoked 1 CASES 5 1 2 3 4 CIGS_DAY Cases of Cancer of the Lung by number of years of smoking 1 CASES 5 15 25 35 45 55 SMOKING_YR

142 Poisson Regression Analysis Since over a number of years of observation some cases of cancer of the lung can be expected to arise from causes not related to smoking, we use this as our base line (uninformative) model and perform the first analysis as below ***** Regression Analysis ***** Response variate: CASES Distribution: Poisson Link function: Log Fitted terms: Constant, PERSONYR *** Summary of analysis *** mean deviance d.f. deviance deviance ratio Regression 1 8.74385332 8.74385332 8.74 Residual 33 128.546991211 3.89536337 Total 34 137.29844242 4.379667 Change -1-8.74385332 8.74385332 8.74 * MESSAGE: ratios are based on dispersion parameter with value 1 *** Estimates of regression coefficients *** estimate s.e. t(*) Constant 1.28.169 7.16 PERSONYR -.1921.711-2.7 * MESSAGE: s.e.s are based on dispersion parameter with value 1 We next perform the second analysis using the full model as below: ***** Regression Analysis ***** Response variate: CASES Distribution: Poisson Link function: Log Fitted terms: Constant, PERSONYR, CIGS_DAY, SMOKING_ *** Summary of analysis *** mean deviance d.f. deviance deviance ratio Regression 3 63.168816931 21.5627231 21.6 Residual 31 74.12227311 2.39133139 Total 34 137.29844242 4.379667 Change -3-63.168816931 21.5627231 21.6 * MESSAGE: ratios are based on dispersion parameter with value 1 *** Estimates of regression coefficients *** estimate s.e. t(*) Constant -4.669.988-4.72 PERSONYR.41.14 3.94 CIGS_DAY.559.1 5.58 SMOKING_.888.166 5.34 MESSAGE: s.e.s are based on dispersion parameter with value 1

Research methods II: Multivariate Analysis 143 The difference in the deviance is 63.168 8.743853 = 54.425 at 3 1 = 2 degrees of freedom. Entering the table for values of χ 2 at 2 degrees of freedom, we get a highly significant P value for 54.425. This indicates a good fit for the model. We accept the model and obtain a regression equation which could be written as Log cases = α + β 1 Personyears + β 2 Cigarettes/day +β 3 Years of Smoking = 4.669 +.41Personyears +.559Cigarettes/day +.888Years of Smoking Cases = (e) 4.669 (e).41personyears (e).559cigarettes/day.888years of smoking (e) The equation is useful for estimating the relative risk of developing lung cancer by the number of cigarettes smoked (i.e. strength of the dose), or by number of years of smoking (total dose). For example, all things being equal the relative risk of smoking 25 cigarettes/day as compared to 15 can be estimated at (e).559x25 (e).559x15 = (e).559x1, and so on. Similar estimates could be used for calculating the relative risk by different years of smoking. A plot of residuals against fitted values is shown below: Plot of Residuals against Fittefd Values 3 2 1 Residuals -1-2 -3-4 -5-6 5 Fitted Values 1 Summary The typical Poisson regression model expresses the natural logarithm of the event or outcome of interest as a linear function of a set of predictors. The dependent variable is a count of the occurrences of interest e.g. the number of cases of a disease that occur over a period of follow-up. Typically, one can estimate a rate ratio associated with a given predictor or exposure. A measure of the goodness of fit of the Poisson regression model is obtained by using the deviance statistic of a base-line model against a fuller model.