Logistic and Poisson Regression: Modeling Binary and Count Data. Statistics Workshop Mark Seiss, Dept. of Statistics

Similar documents
Generalized Linear Models

SAS Software to Fit the Generalized Linear Model

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Statistics in Retail Finance. Chapter 2: Statistical models of default

11. Analysis of Case-control Studies Logistic Regression

Lecture 6: Poisson regression

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Binary Logistic Regression

Using Stata for Categorical Data Analysis

Poisson Models for Count Data

Logistic Regression.

Multinomial and Ordinal Logistic Regression

Ordinal Regression. Chapter

HLM software has been one of the leading statistical packages for hierarchical

Logistic Regression (a type of Generalized Linear Model)

Lecture 8: Gamma regression

VI. Introduction to Logistic Regression

Examples of Using R for Modeling Ordinal Data

Module 4 - Multiple Logistic Regression

Automated Biosurveillance Data from England and Wales,

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Chapter 5 Analysis of variance SPSS Analysis of variance

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

How to set the main menu of STATA to default factory settings standards

Multivariate Logistic Regression

Statistical Models in R

Chapter 7: Simple linear regression Learning Objectives

Basic Statistical and Modeling Procedures Using SAS

GENERALIZED LINEAR MODELS IN VEHICLE INSURANCE

Logit Models for Binary Data

GLM I An Introduction to Generalized Linear Models

International Statistical Institute, 56th Session, 2007: Phil Everson

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Chapter 29 The GENMOD Procedure. Chapter Table of Contents

13. Poisson Regression Analysis

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Free Trial - BIRT Analytics - IAAs

Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables

Appendix 1: Estimation of the two-variable saturated model in SPSS, Stata and R using the Netherlands 1973 example data

Examining a Fitted Logistic Model

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Online Appendix to Are Risk Preferences Stable Across Contexts? Evidence from Insurance Data

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

SUGI 29 Statistics and Data Analysis

Local classification and local likelihoods

LOGIT AND PROBIT ANALYSIS

7 Generalized Estimating Equations

Logistic regression (with R)

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Is it statistically significant? The chi-square test

Research Methods & Experimental Design

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Discussion Section 4 ECON 139/ Summer Term II

Poisson Regression or Regression of Counts (& Rates)

SUMAN DUVVURU STAT 567 PROJECT REPORT

Computer exercise 4 Poisson Regression

LOGISTIC REGRESSION ANALYSIS

Lecture 14: GLM Estimation and Logistic Regression

The first three steps in a logistic regression analysis with examples in IBM SPSS. Steve Simon P.Mean Consulting

Lecture 18: Logistic Regression Continued

Binary Diagnostic Tests Two Independent Samples

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Logistic Regression Logistic regression is an example of a large class of regression models called generalized linear models (GLM)

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Categorical Data Analysis

Crosstabulation & Chi Square

Introduction to Longitudinal Data Analysis

Demand for Life Insurance in Malaysia

Aileen Murphy, Department of Economics, UCC, Ireland. WORKING PAPER SERIES 07-10

Logistic Regression (1/24/13)

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Nominal and ordinal logistic regression

Measurement and Measurement Scales

Lecture 19: Conditional Logistic Regression

Wooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Logistic (RLOGIST) Example #7

Family economics data: total family income, expenditures, debt status for 50 families in two cohorts (A and B), annual records from

Probability Calculator

15.1 The Structure of Generalized Linear Models

Yew May Martin Maureen Maclachlan Tom Karmel Higher Education Division, Department of Education, Training and Youth Affairs.

Mind on Statistics. Chapter 4

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

SPSS Guide: Regression Analysis

MORE ON LOGISTIC REGRESSION

This presentation was made at the California Association for Institutional Research Conference on November 19, 2010.

Logistic (RLOGIST) Example #1

Row vs. Column Percents. tab PRAYER DEGREE, row col

A LOGISTIC REGRESSION MODEL TO PREDICT FRESHMEN ENROLLMENTS Vijayalakshmi Sampath, Andrew Flagel, Carolina Figueroa

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Choosing number of stages of multistage model for cancer modeling: SOP for contractor and IRIS analysts

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Transcription:

Logistic and Poisson Regression: Modeling Binary and Count Data Statistics Workshop Mark Seiss, Dept. of Statistics March 3, 2009

Presentation Outline 1. Introduction to Generalized Linear Models 2. Binary Response Data - Logistic Regression Model 3. Count Response Data - Poisson Regression Model 4. Variable Significance Likelihood Ratio Test

Reference Material Short Course Presentation and Data from Examples www.lisa.stat.vt.edu/short_courses.php Categorical Data Analysis Alan Agresti Examples found with SAS Code at www.stat.ufl.edu/~aa/cda/cda.html UCLA Statistical Consulting Website www.ats.ucla.edu/stat/ Detailed examples of statistical analysis of data using SAS, SPSS, Stata, R, etc.

Generalized Linear Models Generalized linear models (GLM) extend ordinary regression to non-normal response distributions. Model for i = 1 to n Why do we use GLM s? Linear regression assumes that the response is distributed normally GLM s allow for analysis when it is not reasonable to assume the data is distributed normally.

Generalized Linear Models Predictor Variables Two Types: Continuous and Categorical Continuous Predictor Variables Examples Time, Grade Point Average, Test Score, etc. Coded with one parameter Categorical Predictor Variables Examples Sex, Political Affiliation, Marital Status, etc. Actual value assigned to Category not important Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc. Coded Differently than continuous variables

Generalized Linear Models Categorical Predictor Variables cont. Consider a categorical predictor variable with L categories One category selected as reference category Assignment of Reference Category is arbitrary Variable represented by L-1 dummy variables Model Identifiability Two types of coding Dummy and Effect

Generalized Linear Models Summary Generalized Linear Models Continuous and Categorical Predictor Variables

Generalized Linear Models Questions/Comments

Logistic Regression Consider a binary response variable. Variable with two outcomes One outcome represented by a 1 and the other represented by a 0 Examples: Does the person have a disease? Who is the person voting for? Outcome of a baseball game? Yes or No McCain or Obama Win or loss

Logistic Regression Logistic Regression Example Data Set Response Variable > Admission to Grad School (Admit) 0 if admitted, 1 if not admitted Predictor Variables GRE Score (gre) Continuous University Prestige (topnotch) 1 if prestigious, 0 otherwise Grade Point Average (gpa) Continuous

Logistic Regression First 10 Observations of the Data Set ADMIT GRE TOPNOTCH GPA 1 380 0 3.61 0 660 1 3.67 0 800 1 4 0 640 0 3.19 1 520 0 2.93 0 760 0 3 0 560 0 2.98 1 400 0 3.08 0 540 0 3.39 1 700 1 3.92

Logistic Regression Consider the logistic regression model GLM with binomial random component and logit link g(µ) = logit(µ) Range of values for π(x i ) is 0 to 1

Logistic Regression Interpretation of Coefficient β Odds Ratio The odds ratio is a statistic that measures the odds of an event compared to the odds of another event. Say the probability of Event 1 is π 1 and the probability of Event 2 is π 2. Then the odds ratio of Event 1 to Event 2 is: Value of Odds Ratio range from 0 to Infinity Value between 0 and 1 indicate the odds of Event 2 are greater Value between 1 and infinity indicate odds of Event 1 are greater Value equal to 1 indicates events are equally likely

Logistic Regression Interpretation of Coefficient β Odds Ratio cont. From our logistic regression model with a single continuous variable, the ratio of the odds of Y=0 for X+1 and X is From our logistic regression model with a single two category variable with effect coding, the ratio of the odds of Y=0 from one category to another is

Logistic Regression Single Continuous Predictor Variable - GPA Generalized Linear Model Fit Response: Admit Modeling P(Admit=0) Distribution: Binomial Link: Logit Observations (or Sum Wgts) = 400 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 6.50444839 13.0089 1 0.0003 Full 243.48381 Reduced 249.988259 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 401.1706 398 0.4460 Deviance 486.9676 398 0.0015

Logistic Regression Single Continuous Predictor Variable GPA cont. Effect Tests Source DF L-R ChiSquare Prob>ChiSq GPA 1 13.008897 0.0003 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept -4.357587 1.0353175 19.117873 <.0001-6.433355-2.367383 GPA 1.0511087 0.2988695 13.008897 0.0003 0.4742176 1.6479411 Interpretation of the Parameter Estimate: Exp{1.0511087} = 2.86 = odds ratio between the odds at x+1 and odds at x for all x The ratio of the odds of being admitted between a person with a 3.0 gpa and 2.0 gpa is equal to 2.86 or equivalently the odds of the person with the 3.0 is 2.86 times the odds of the person with the 2.0.

Logistic Regression Single Categorical Predictor Variable Top Notch Generalized Linear Model Fit Response: Admit Modeling P(Admit=0) Distribution: Binomial Link: Logit Observations (or Sum Wgts) = 400 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 3.53984692 7.0797 1 0.0078 Full 246.448412 Reduced 249.988259 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 400.0000 398 0.4624 Deviance 492.8968 398 0.0008 I

Logistic Regression Single Categorical Predictor Variable Top Notch cont. Effect Tests Source DF L-R ChiSquare Prob>ChiSq TOPNOTCH 1 7.0796939 0.0078 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept -0.525855 0.138217 14.446085 0.0001-0.799265-0.255667 TOPNOTCH[0] -0.371705 0.138217 7.0796938 0.0078-0.642635-0.099011 Interpretation of the Parameter Estimate: Exp{2*-.371705} = 0.4755 = odds ratio between the odds of admittance for a student at a less prestigous university and the odds of admittance for a student from a more prestigous university. The odds of being admitted from a less prestigous university is.48 times the odds of being admitted from a more prestigous university.

Logistic Regression Summary Introduction to the Logistic Regression Model Interpretation of the Parameter Estimates β Odds Ratio

Logistic Regression Questions/Comments

Poisson Regression Consider a count response variable. Response variable is the number of occurrences in a given time frame. Outcomes equal to 0, 1, 2,. Examples: Number of penalties during a football game. Number of customers shop at a store on a given day. Number of car accidents at an intersection.

Poisson Regression Poisson Regression Example Data Set Response Variable > Number of Days Absent Integer Predictor Variables Gender- 1 if Female, 2 if Male Ethnicity 6 Ethnic Categories School 1 if School, 2 if School 2 Math Test Score Continuous Language Test Score Continuous Bilingual Status 4 Bilingual Categories

Poisson Regression First 10 Observations from the Poisson Regression Example Data Set GENDER Ethnicity School Math Score Lang. Score Bilingual.status Days Absent 1 2 4 1 56.988830 42.45086 2 4 2 2 4 1 37.094160 46.82059 2 4 3 1 4 1 32.275460 43.56657 2 2 4 1 4 1 29.056720 43.56657 2 3 5 1 4 1 6.748048 27.24847 3 3 6 1 4 1 61.654280 48.41482 0 13 7 1 4 1 56.988830 40.73543 2 11 8 2 4 1 10.390490 15.35938 2 7 9 2 4 1 50.527950 52.11514 2 10 10 2 6 1 49.472050 42.45086 0 9

Poisson Regression Consider the Poisson log-linear model GLM with Poisson random component and log link g(µ) = log(µ) Predicted response values fall between 0 and +

Poisson Regression Interpretation of Coefficient β From our Poisson regression model with a single continuous variable, the relationship between the predicted response at value x and value x+1 is From our Poisson regression model with a single two category variable with effect coding, the relationship between the predicted response from one category to another is

Poisson Regression Single Continuous Predictor Variable Math Score Generalized Linear Model Fit Response: number days absent Distribution: Poisson Link: Log Observations (or Sum Wgts) = 316 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 39.619507 79.2390 1 <.0001 Full 1595.98854 Reduced 1635.60805 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 3080.403 314 0.0000 Deviance 2330.581 314 <.0001

Poisson Regression Single Continuous Predictor Variable Math Score Effect Tests Source DF L-R ChiSquare Prob>ChiSq ctbs math nce 1 79.239014 <.0001 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept 2.3020999 0.0627765 1044.4013 <.0001 2.1780081 2.424086 ctbs math nce -0.011568 0.0012941 79.239014 <.0001-0.014101-0.009029 Interpretation of the parameter estimate: Exp{-0.011568} =.98 = multiplicative effect on the expected number of days absent for an increase of 1 in the Math Score Fabricated Example If a student is expected to miss 5 days with a math score of 50, then another student with a math score of 51 is expected to miss 5*.98 = 4.9 days

Poisson Regression Single Continuous Predictor Variable Gender Generalized Linear Model Fit Response: number days absent Distribution: Poisson Link: Log Observations (or Sum Wgts) = 316 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 22.6810514 45.3621 1 <.0001 Full 1612.927 Reduced 1635.60805 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 2877.292 314 0.0000 Deviance 2364.458 314 <.0001

Poisson Regression Single Continuous Predictor Variable Gender Effect Tests Source DF L-R ChiSquare Prob>ChiSq GENDER 1 45.362103 <.0001 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept 1.743096 0.023734 3155.5494 0.0000 1.6962023 1.7892445 GENDER[1] 0.1586429 0.023734 45.362103 <.0001 0.1122479 0.2053005 Interpretation of the parameter estimate: Exp{2*0.1586} = 1.3733 = multiplicative effect on the expected number of days absent of being female rather than male If a male student is expected to miss X days, then a female student is expected to miss 1.3733*X.

Poisson Regression Summary Introduction to the Poisson Regression Model Interpretation of β

Likelihood Ratio Test Deviance Let L(µ y) = maximum of the log likelihood for the model L(y y) = maximum of the log likelihood for the saturated model Deviance = D(y µ) = -2 [L(µ y) - L(y y) ] Tests the null hypothesis that the model is a good alternative to the observed values Deviance has an asymptotic chi-squared distribution with N p degrees of freedom, where p is the number of parameters in the model.

Likelihood Ratio Test Nested Models Model 1 - model with p predictor variables {X 1, X 2, X 3,.,X p } and vector of fitted values µ 1 Model 2 - model with q<p predictor variables {X 1, X 2, X 3,.,X q } and vector of fitted values µ 2 Model 2 is nested within Model 1 if all predictor variables found in Model 2 are included in Model 1. i.e. the set of predictor variables in Model 2 are a subset of the set of predictor variables in Model 1 Model 2 is a special case of Model 1 - all the coefficients associated with X p+1, X p+2, X p+3,.,x q are equal to zero

Likelihood Ratio Test Likelihood Ratio Test Null Hypothesis: There is not a significant difference between the fit of two models. Null Hypothesis for Nested Models: The predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit. Alternate Hypothesis for Nested Models - The predictor variables in Model 1 that are not found in Model 2 are significant to the model fit. Likelihood Ratio Statistic = -2* [L(y,u 2 )-L(y,u 1 )] = D(y,µ 2 ) - D(y, µ 1 ) Difference of the deviances of the two models Always D(y,µ 2 ) > D(y,µ 1 ) implies LRT > 0 LRT is distributed Chi-Squared with p-q degrees of freedom

Likelihood Ratio Test Theoretical Example of Likelihood Ratio Test 3 predictor variables 1 Continuous (X 1 ), 1 Categorical with 4 Categories (X 2, X 3, X 4 ), 1 Categorical with 1 Category (X 5 ) Model 1 - predictor variables {X 1, X 2, X 3, X 4, X 5 } Model 2 - predictor variables {X 1, X 5 } Null Hypothesis Variables with 4 categories is not significant to the model (β 2 = β 3 = β 4 = 0) Alternate Hypothesis - Variable with 4 categories is significant Likelihood Ratio Statistic = D(y,µ 2 ) - D(y, µ 1 ) Difference of the deviance statistics from the two models Chi-Squared Distribution with 5-2=3 degrees of freedom

Likelihood Ratio Test Likelihood Ratio Test Consider the model with GPA, GRE, and Top Notch as predictor variables Generalized Linear Model Fit Response: Admit Modeling P(Admit=0) Distribution: Binomial Link: Logit Observations (or Sum Wgts) = 400 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 10.9234504 21.8469 3 <.0001 Full 239.064808 Reduced 249.988259 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 396.9196 396 0.4775 Deviance 478.1296 396 0.0029

Likelihood Ratio Test Variable Selection Likelihood Ratio Test cont. Effect Tests Source DF L-R ChiSquare Prob>ChiSq TOPNOTCH 1 2.2143635 0.1367 GPA 1 4.2909753 0.0383 GRE 1 5.4555484 0.0195 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept -4.382202 1.1352224 15.917859 <.0001-6.657167-2.197805 TOPNOTCH[0] -0.218612 0.1459266 2.2143635 0.1367-0.503583 0.070142 GPA 0.6675556 0.3252593 4.2909753 0.0383 0.0356956 1.3133755 GRE 0.0024768 0.0010702 5.4555484 0.0195 0.0003962 0.0046006

Likelihood Ratio Test Questions/Comments