Logistic and Poisson Regression: Modeling Binary and Count Data. Statistics Workshop Mark Seiss, Dept. of Statistics

Logistic and Poisson Regression: Modeling Binary and Count Data Statistics Workshop Mark Seiss, Dept. of Statistics March 3, 2009

Presentation Outline 1. Introduction to Generalized Linear Models 2. Binary Response Data - Logistic Regression Model 3. Count Response Data - Poisson Regression Model 4. Variable Significance Likelihood Ratio Test

Reference Material Short Course Presentation and Data from Examples www.lisa.stat.vt.edu/short_courses.php Categorical Data Analysis Alan Agresti Examples found with SAS Code at www.stat.ufl.edu/~aa/cda/cda.html UCLA Statistical Consulting Website www.ats.ucla.edu/stat/ Detailed examples of statistical analysis of data using SAS, SPSS, Stata, R, etc.

Generalized Linear Models Generalized linear models (GLM) extend ordinary regression to non-normal response distributions. Model for i = 1 to n Why do we use GLM s? Linear regression assumes that the response is distributed normally GLM s allow for analysis when it is not reasonable to assume the data is distributed normally.

Generalized Linear Models Predictor Variables Two Types: Continuous and Categorical Continuous Predictor Variables Examples Time, Grade Point Average, Test Score, etc. Coded with one parameter Categorical Predictor Variables Examples Sex, Political Affiliation, Marital Status, etc. Actual value assigned to Category not important Ex) Sex - Male/Female, M/F, 1/2, 0/1, etc. Coded Differently than continuous variables

Generalized Linear Models Categorical Predictor Variables cont. Consider a categorical predictor variable with L categories One category selected as reference category Assignment of Reference Category is arbitrary Variable represented by L-1 dummy variables Model Identifiability Two types of coding Dummy and Effect

Generalized Linear Models Summary Generalized Linear Models Continuous and Categorical Predictor Variables

Generalized Linear Models Questions/Comments

Logistic Regression Consider a binary response variable. Variable with two outcomes One outcome represented by a 1 and the other represented by a 0 Examples: Does the person have a disease? Who is the person voting for? Outcome of a baseball game? Yes or No McCain or Obama Win or loss

Logistic Regression Logistic Regression Example Data Set Response Variable > Admission to Grad School (Admit) 0 if admitted, 1 if not admitted Predictor Variables GRE Score (gre) Continuous University Prestige (topnotch) 1 if prestigious, 0 otherwise Grade Point Average (gpa) Continuous

Logistic Regression First 10 Observations of the Data Set ADMIT GRE TOPNOTCH GPA 1 380 0 3.61 0 660 1 3.67 0 800 1 4 0 640 0 3.19 1 520 0 2.93 0 760 0 3 0 560 0 2.98 1 400 0 3.08 0 540 0 3.39 1 700 1 3.92

Logistic Regression Consider the logistic regression model GLM with binomial random component and logit link g(µ) = logit(µ) Range of values for π(x i ) is 0 to 1

Logistic Regression Interpretation of Coefficient β Odds Ratio The odds ratio is a statistic that measures the odds of an event compared to the odds of another event. Say the probability of Event 1 is π 1 and the probability of Event 2 is π 2. Then the odds ratio of Event 1 to Event 2 is: Value of Odds Ratio range from 0 to Infinity Value between 0 and 1 indicate the odds of Event 2 are greater Value between 1 and infinity indicate odds of Event 1 are greater Value equal to 1 indicates events are equally likely

Logistic Regression Interpretation of Coefficient β Odds Ratio cont. From our logistic regression model with a single continuous variable, the ratio of the odds of Y=0 for X+1 and X is From our logistic regression model with a single two category variable with effect coding, the ratio of the odds of Y=0 from one category to another is

Logistic Regression Single Continuous Predictor Variable - GPA Generalized Linear Model Fit Response: Admit Modeling P(Admit=0) Distribution: Binomial Link: Logit Observations (or Sum Wgts) = 400 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 6.50444839 13.0089 1 0.0003 Full 243.48381 Reduced 249.988259 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 401.1706 398 0.4460 Deviance 486.9676 398 0.0015

Logistic Regression Single Continuous Predictor Variable GPA cont. Effect Tests Source DF L-R ChiSquare Prob>ChiSq GPA 1 13.008897 0.0003 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept -4.357587 1.0353175 19.117873 <.0001-6.433355-2.367383 GPA 1.0511087 0.2988695 13.008897 0.0003 0.4742176 1.6479411 Interpretation of the Parameter Estimate: Exp{1.0511087} = 2.86 = odds ratio between the odds at x+1 and odds at x for all x The ratio of the odds of being admitted between a person with a 3.0 gpa and 2.0 gpa is equal to 2.86 or equivalently the odds of the person with the 3.0 is 2.86 times the odds of the person with the 2.0.

Logistic Regression Single Categorical Predictor Variable Top Notch Generalized Linear Model Fit Response: Admit Modeling P(Admit=0) Distribution: Binomial Link: Logit Observations (or Sum Wgts) = 400 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 3.53984692 7.0797 1 0.0078 Full 246.448412 Reduced 249.988259 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 400.0000 398 0.4624 Deviance 492.8968 398 0.0008 I

Logistic Regression Single Categorical Predictor Variable Top Notch cont. Effect Tests Source DF L-R ChiSquare Prob>ChiSq TOPNOTCH 1 7.0796939 0.0078 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept -0.525855 0.138217 14.446085 0.0001-0.799265-0.255667 TOPNOTCH[0] -0.371705 0.138217 7.0796938 0.0078-0.642635-0.099011 Interpretation of the Parameter Estimate: Exp{2*-.371705} = 0.4755 = odds ratio between the odds of admittance for a student at a less prestigous university and the odds of admittance for a student from a more prestigous university. The odds of being admitted from a less prestigous university is.48 times the odds of being admitted from a more prestigous university.

Logistic Regression Summary Introduction to the Logistic Regression Model Interpretation of the Parameter Estimates β Odds Ratio

Logistic Regression Questions/Comments

Poisson Regression Consider a count response variable. Response variable is the number of occurrences in a given time frame. Outcomes equal to 0, 1, 2,. Examples: Number of penalties during a football game. Number of customers shop at a store on a given day. Number of car accidents at an intersection.

Poisson Regression Poisson Regression Example Data Set Response Variable > Number of Days Absent Integer Predictor Variables Gender- 1 if Female, 2 if Male Ethnicity 6 Ethnic Categories School 1 if School, 2 if School 2 Math Test Score Continuous Language Test Score Continuous Bilingual Status 4 Bilingual Categories

Poisson Regression First 10 Observations from the Poisson Regression Example Data Set GENDER Ethnicity School Math Score Lang. Score Bilingual.status Days Absent 1 2 4 1 56.988830 42.45086 2 4 2 2 4 1 37.094160 46.82059 2 4 3 1 4 1 32.275460 43.56657 2 2 4 1 4 1 29.056720 43.56657 2 3 5 1 4 1 6.748048 27.24847 3 3 6 1 4 1 61.654280 48.41482 0 13 7 1 4 1 56.988830 40.73543 2 11 8 2 4 1 10.390490 15.35938 2 7 9 2 4 1 50.527950 52.11514 2 10 10 2 6 1 49.472050 42.45086 0 9

Poisson Regression Consider the Poisson log-linear model GLM with Poisson random component and log link g(µ) = log(µ) Predicted response values fall between 0 and +

Poisson Regression Interpretation of Coefficient β From our Poisson regression model with a single continuous variable, the relationship between the predicted response at value x and value x+1 is From our Poisson regression model with a single two category variable with effect coding, the relationship between the predicted response from one category to another is

Poisson Regression Single Continuous Predictor Variable Math Score Generalized Linear Model Fit Response: number days absent Distribution: Poisson Link: Log Observations (or Sum Wgts) = 316 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 39.619507 79.2390 1 <.0001 Full 1595.98854 Reduced 1635.60805 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 3080.403 314 0.0000 Deviance 2330.581 314 <.0001

Poisson Regression Single Continuous Predictor Variable Math Score Effect Tests Source DF L-R ChiSquare Prob>ChiSq ctbs math nce 1 79.239014 <.0001 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept 2.3020999 0.0627765 1044.4013 <.0001 2.1780081 2.424086 ctbs math nce -0.011568 0.0012941 79.239014 <.0001-0.014101-0.009029 Interpretation of the parameter estimate: Exp{-0.011568} =.98 = multiplicative effect on the expected number of days absent for an increase of 1 in the Math Score Fabricated Example If a student is expected to miss 5 days with a math score of 50, then another student with a math score of 51 is expected to miss 5*.98 = 4.9 days

Poisson Regression Single Continuous Predictor Variable Gender Generalized Linear Model Fit Response: number days absent Distribution: Poisson Link: Log Observations (or Sum Wgts) = 316 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 22.6810514 45.3621 1 <.0001 Full 1612.927 Reduced 1635.60805 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 2877.292 314 0.0000 Deviance 2364.458 314 <.0001

Poisson Regression Single Continuous Predictor Variable Gender Effect Tests Source DF L-R ChiSquare Prob>ChiSq GENDER 1 45.362103 <.0001 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept 1.743096 0.023734 3155.5494 0.0000 1.6962023 1.7892445 GENDER[1] 0.1586429 0.023734 45.362103 <.0001 0.1122479 0.2053005 Interpretation of the parameter estimate: Exp{2*0.1586} = 1.3733 = multiplicative effect on the expected number of days absent of being female rather than male If a male student is expected to miss X days, then a female student is expected to miss 1.3733*X.

Poisson Regression Summary Introduction to the Poisson Regression Model Interpretation of β

Likelihood Ratio Test Deviance Let L(µ y) = maximum of the log likelihood for the model L(y y) = maximum of the log likelihood for the saturated model Deviance = D(y µ) = -2 [L(µ y) - L(y y) ] Tests the null hypothesis that the model is a good alternative to the observed values Deviance has an asymptotic chi-squared distribution with N p degrees of freedom, where p is the number of parameters in the model.

Likelihood Ratio Test Nested Models Model 1 - model with p predictor variables {X 1, X 2, X 3,.,X p } and vector of fitted values µ 1 Model 2 - model with q<p predictor variables {X 1, X 2, X 3,.,X q } and vector of fitted values µ 2 Model 2 is nested within Model 1 if all predictor variables found in Model 2 are included in Model 1. i.e. the set of predictor variables in Model 2 are a subset of the set of predictor variables in Model 1 Model 2 is a special case of Model 1 - all the coefficients associated with X p+1, X p+2, X p+3,.,x q are equal to zero

Likelihood Ratio Test Likelihood Ratio Test Null Hypothesis: There is not a significant difference between the fit of two models. Null Hypothesis for Nested Models: The predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit. Alternate Hypothesis for Nested Models - The predictor variables in Model 1 that are not found in Model 2 are significant to the model fit. Likelihood Ratio Statistic = -2* [L(y,u 2 )-L(y,u 1 )] = D(y,µ 2 ) - D(y, µ 1 ) Difference of the deviances of the two models Always D(y,µ 2 ) > D(y,µ 1 ) implies LRT > 0 LRT is distributed Chi-Squared with p-q degrees of freedom

Likelihood Ratio Test Theoretical Example of Likelihood Ratio Test 3 predictor variables 1 Continuous (X 1 ), 1 Categorical with 4 Categories (X 2, X 3, X 4 ), 1 Categorical with 1 Category (X 5 ) Model 1 - predictor variables {X 1, X 2, X 3, X 4, X 5 } Model 2 - predictor variables {X 1, X 5 } Null Hypothesis Variables with 4 categories is not significant to the model (β 2 = β 3 = β 4 = 0) Alternate Hypothesis - Variable with 4 categories is significant Likelihood Ratio Statistic = D(y,µ 2 ) - D(y, µ 1 ) Difference of the deviance statistics from the two models Chi-Squared Distribution with 5-2=3 degrees of freedom

Likelihood Ratio Test Likelihood Ratio Test Consider the model with GPA, GRE, and Top Notch as predictor variables Generalized Linear Model Fit Response: Admit Modeling P(Admit=0) Distribution: Binomial Link: Logit Observations (or Sum Wgts) = 400 Whole Model Test Model -LogLikelihood L-R ChiSquare DF Prob>ChiSq Difference 10.9234504 21.8469 3 <.0001 Full 239.064808 Reduced 249.988259 Goodness Of Fit Statistic ChiSquare DF Prob>ChiSq Pearson 396.9196 396 0.4775 Deviance 478.1296 396 0.0029

Likelihood Ratio Test Variable Selection Likelihood Ratio Test cont. Effect Tests Source DF L-R ChiSquare Prob>ChiSq TOPNOTCH 1 2.2143635 0.1367 GPA 1 4.2909753 0.0383 GRE 1 5.4555484 0.0195 Parameter Estimates Term Estimate Std Error L-R ChiSquare Prob>ChiSq Lower CL Upper CL Intercept -4.382202 1.1352224 15.917859 <.0001-6.657167-2.197805 TOPNOTCH[0] -0.218612 0.1459266 2.2143635 0.1367-0.503583 0.070142 GPA 0.6675556 0.3252593 4.2909753 0.0383 0.0356956 1.3133755 GRE 0.0024768 0.0010702 5.4555484 0.0195 0.0003962 0.0046006

Likelihood Ratio Test Questions/Comments