Goodness of Fit Tests for Categorical Data: Comparing Stata, R and SAS

Similar documents
Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Multivariate Logistic Regression

Generalized Linear Models

Multinomial and Ordinal Logistic Regression

STATISTICA Formula Guide: Logistic Regression. Table of Contents

VI. Introduction to Logistic Regression

Basic Statistical and Modeling Procedures Using SAS

SAS Software to Fit the Generalized Linear Model

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

11. Analysis of Case-control Studies Logistic Regression

Lecture 18: Logistic Regression Continued

LOGISTIC REGRESSION ANALYSIS

Lecture 14: GLM Estimation and Logistic Regression

MULTIPLE REGRESSION EXAMPLE

Handling missing data in Stata a whirlwind tour

From the help desk: hurdle models

Discussion Section 4 ECON 139/ Summer Term II

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

How to set the main menu of STATA to default factory settings standards

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Simple Linear Regression Inference

Ordinal Regression. Chapter

Lecture 19: Conditional Logistic Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Logistic Regression.

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

The Probit Link Function in Generalized Linear Models for Data Mining Applications

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Logit Models for Binary Data

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Appendix 1: Estimation of the two-variable saturated model in SPSS, Stata and R using the Netherlands 1973 example data

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

The Stata Journal. Editor Nicholas J. Cox Department of Geography Durham University South Road Durham City DH1 3LE UK

Logistic regression (with R)

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Standard errors of marginal effects in the heteroskedastic probit model

Régression logistique : introduction

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Final Exam Practice Problem Answers

From the help desk: Swamy s random-coefficients model

A LOGISTIC REGRESSION MODEL TO PREDICT FRESHMEN ENROLLMENTS Vijayalakshmi Sampath, Andrew Flagel, Carolina Figueroa

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Using Stata for Categorical Data Analysis

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Lecture 6: Poisson regression

Module 5: Introduction to Multilevel Modelling SPSS Practicals Chris Charlton 1 Centre for Multilevel Modelling

Introduction. Survival Analysis. Censoring. Plan of Talk

Nonlinear Regression Functions. SW Ch 8 1/54/

Statistics in Retail Finance. Chapter 2: Statistical models of default

Independent t- Test (Comparing Two Means)

From this it is not clear what sort of variable that insure is so list the first 10 observations.

Random effects and nested models with SAS

August 2012 EXAMINATIONS Solution Part I

CREDIT SCORING MODEL APPLICATIONS:

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Row vs. Column Percents. tab PRAYER DEGREE, row col

Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc

Logit and Probit. Brad Jones 1. April 21, University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Poisson Models for Count Data

Multinomial Logistic Regression

Title. Syntax. stata.com. fp Fractional polynomial regression. Estimation

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

10 Dichotomous or binary responses

Nonlinear relationships Richard Williams, University of Notre Dame, Last revised February 20, 2015

PARALLEL LINES ASSUMPTION IN ORDINAL LOGISTIC REGRESSION AND ANALYSIS APPROACHES

Logistic Regression (a type of Generalized Linear Model)

especially with continuous

SUGI 29 Statistics and Data Analysis

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Additional sources Compilation of sources:

Logistic (RLOGIST) Example #1

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

The Stata Journal. Editor Nicholas J. Cox Geography Department Durham University South Road Durham City DH1 3LE UK

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Examples of Using R for Modeling Ordinal Data

Regression Analysis: A Complete Example

Developing Business Failure Prediction Models Using SAS Software Oki Kim, Statistical Analytics

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

In the general population of 0 to 4-year-olds, the annual incidence of asthma is 1.4%

Chapter 18. Effect modification and interactions Modeling effect modification

Multiple Linear Regression

Estimation of σ 2, the variance of ɛ

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Logistic (RLOGIST) Example #3

Hypothesis testing - Steps

Poisson Regression or Regression of Counts (& Rates)

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Transcription:

Goodness of Fit Tests for Categorical Data: Comparing Stata, R and SAS Rino Bellocco 1,2, Sc.D. Sara Algeri 1, MS 1 University of Milano-Bicocca, Milan, Italy & 2 Karolinska Institutet, Stockholm, Sweden San Servolo, Venice, Italy November 17-18, 2011 (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 1 / 41

Content 1 Inference in Logistic Regression 2 Model definition 3 Deviance 4 Likelihood Ratio Test - Implementation 5 References (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 2 / 41

Outline Inference in Logistic Regression 1 Inference in Logistic Regression 2 Model Definition 3 The Deviance 4 Likelihood Ratio Test implementation 5 References (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 3 / 41

Inference in Logistic Regression Steps Definition of exposure, confounders, interaction Model Building Likelihood based theory: estimation, confidence intervals and testing Goodness of Fit (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 4 / 41

Outline Model Definition 1 Inference in Logistic Regression 2 Model Definition 3 The Deviance 4 Likelihood Ratio Test implementation 5 References (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 5 / 41

Unit of Analysis Model Definition In Generalized Linear Model when the covariates involved are/can be restricted to categorical data, the units of analysis could be: subjects or groups of subjects (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 6 / 41

Model Definition Unit of Analysis: The Subjects Let us consider the case of a Logistic Regression Model for a binary outcome Y : [ ] π(x) ln = β 0 + β 1 x 1 + + β p x p (1) 1 π(x) The units of analysis are subjects The data layout is based on one record for each subject (individual format) The goal is to predict π(x) which is the probability of success P(Y = 1) given the set of covariates x = (x 1,..., x p ) Thus, the log-likelihood function will be: n { yi ln[π(x i )] + (1 y i )ln[1 π(x i )] } (2) i=1 n = The total number of observations. (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 7 / 41

Model Definition Unit of Analysis: Group of Subjects The analytical units are groups of subjects with the same covariate patterns The quantity π(x) is now referred to the proportion of successes or each group to be estimated The data layout in this case will be one record for each covariate pattern (events-trials format) The log-likelihood function (2) can be written as: K { sj ln[π(x j )] + (m j s j )ln[1 π(x j )] } (3) j=1 K = total number of possible (observed) covariate patterns s j = number of successes for the j th covariate pattern m j = number of total individuals for the j th covariate pattern In spite of different structures, estimation will be the same (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 8 / 41

Outline The Deviance 1 Inference in Logistic Regression 2 Model Definition 3 The Deviance 4 Likelihood Ratio Test implementation 5 References (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 9 / 41

The Deviance The Deviance Test Statistics One methods for goodness-of-fit assessment is to use the deviance statistics (D 2 ) D 2 = 2 { ln [ L s ( β) ] ln [ L m ( β) ]} (4) ln [ L m ( β) ] = maximized log-likelihood of the fitted model ln [ L s ( β) ] = maximized log-likelihood of the saturated model This quantity compares the values predicted by the fitted model and those predicted by "the most complete model we could fit". Evidence for model lack-of-fit occurs when the value of D 2 is large (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 10 / 41

The Deviance The Deviance Test Statistics Asymptotic Distribution Under specific regularity conditions D 2 converges asymptotically to a χ 2 distribution with h degrees of freedom h is the difference between the number parameters in the saturated model and the number of parameters in the model being considered: D 2 χ 2 (h) (5) Thus, we can test the null hypothesis: So H 0 is rejected when: H 0 : β h = 0 α= fixed level of significance. D 2 χ 2 1 α (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 11 / 41

The Deviance The Deviance Test Statistics Asymptotic Distribution-cont If H 0 cannot be rejected we can safely conclude that the fitting of the model of interest is substantially similar to that of the most completed model that can be built The deviance test is to all intents and purposes a Likelihood Ratio Test which compares two nested models in terms of log-likelihood. In fact, all the possible models we can built are nested into the saturated model (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 12 / 41

The Deviance Saturated Model The saturated model represents the largest model we can fit and leads to perfect prediction of the outcome of interest This definition does not lead to an unique specification but we can identify three different approaches for its specification Casewise approach Contingency table approach Collapsing approach (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 13 / 41

Saturated Model Casewise Approach The Deviance When the unit of analysis is the subject the saturated model has as many parameters as the number of observations (n=number of subjects) "Perfect Fit" of the data the log-likelihood (2), is always equal to zero Consequently, the deviance statistics ( (4)) results to be: G 2 = 2 [ lnl m ( β) ] = 2 n i=1 [ ] π(xi ) { πln + ln[1 π(x i )] } 1 π(x i ) (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 14 / 41

The Deviance Saturated Model Casewise Approach - cont This approach is used in with continuous covariates, here the number of covariate patterns is quite similar to the number of subjects (n=k ). D 2 cannot be approximated to a χ 2 distribution. Thus, it might be useful to use one of the following approaches. (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 15 / 41

The Deviance Saturated Model Contingency Table and Collapsing approach The units of analysis are the group of subjects defined by the covariate pattern The saturated model corresponds to a with K parameters, where K is the number of the possible covariate patterns. In these situations (n K ) and the log-likelihood of the saturated model is not equal to zero, and D 2 will be: D 2 = 2 { ln [ L s ( β) ] ln [ ]} L m ( β) ( K [ πs (x j ) = 2 j=1 { s j ln π m (x j ) ] + (m j s j )ln [ ]}) 1 πs (x j ) 1 π m (x j ) (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 16 / 41

The Deviance Saturated Model Contingency table and Collapsing approach - Cont π s (x j )=proportion of successes for the j th covariate pattern predicted by the saturated model π m (x j )=proportion of successes for the j th covariate pattern predicted by the fitted model. (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 17 / 41

The Deviance Saturated Model Contingency Table and Collapsing approach - What is the difference If the covariate patterns are based on all covariates available in the data set, we are following the contingency table approach; If the covariate patterns are based only on the variables in the model of interest we are fitting the collapsing approach As shown in the following application, different covariate pattern specifications lead to different results both in terms of likelihood and deviance. (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 18 / 41

Outline Likelihood Ratio Test implementation 1 Inference in Logistic Regression 2 Model Definition 3 The Deviance 4 Likelihood Ratio Test implementation 5 References (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 19 / 41

Likelihood Ratio Test implementation Setup We implement the deviance test considering the three different approaches presented above using three of the most common software Stata (version 12.0), R (version 2.13.1) and SAS (version 9.2) In all the softwares considered the default method considers as saturated model the model which contains as many parameters as the number of records available in the data set. Thus, the data structure (individual format or events-trials format) cannot be neglected. (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 20 / 41

Likelihood Ratio Test implementation Data The data used are based the famous Titanic disaster occurred on April 15, 1912. 2201 subjects; outcome: survival (1=survivor, 0=deceased, or number of survivors); covariates: sex (male or female); economic status (first class passenger, second class passenger, third class passenger or crew); age (adult or child). These covariates define 16 different covariate patterns and 14 observed. (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 21 / 41

Likelihood Ratio Test implementation Casewise approach In all the softwares considered the default method adopted assumes as saturated model the one which contains as many covariates as the number of records available in the data set Applying the most common procedures for logistic regression on the individ data set (analytical units=subjects), it is easy to obtain the deviance test considering the casewise definition of saturated model (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 22 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation-casewise approach with Stata. xi:glm survival i.sex i.status, family(binomial) link(logit) Generalized linear models No. of obs = 2201 Optimization : ML Residual df = 2196 Scale parameter = 1 Deviance = 2228.91282 (1/df) Deviance = 1.014988 Pearson = 2228.798854 (1/df) Pearson = 1.014936 Variance function: V(u) = u*(1-u) [Bernoulli] Link function : g(u) = ln(u/(1-u)) [Logit] AIC = 1.017225 Log likelihood = -1114.45641 BIC = -14672.97 ------------------------------------------------------------------------------ OIM survival Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- _Isex_2-2.421328.1390931-17.41 0.000-2.693946-2.148711 _Istatus_2.8808128.1569718 5.61 0.000.5731537 1.188472 _Istatus_3 -.0717844.1709268-0.42 0.675 -.4067948.263226 _Istatus_4 -.7774228.1423145-5.46 0.000-1.056354 -.4984916 _cons 1.187396.1574664 7.54 0.000.878767 1.496024 ------------------------------------------------------------------------------ (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 23 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation-casewise approach with Stata - cont. scalar dev=e(deviance). scalar df=e(df). di "GOF casewise "" D^2="dev " df="df " \\\ p-value= " chiprob(df, dev) GOF casewise D^2=2228.9128 df=2196 p-value=.30705384 (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 24 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation-casewise approach with R > Model<-glm(survival~sex+status,data=individ, + family=binomial(link=logit)) > Model Call: glm(formula = survival ~ sex + status, family = binomial(link = logit), data = individ) Coefficients: (Intercept) sexmale statusfirst statussecond statusthird 1.18740-2.42133 0.88081-0.07178-0.77742 Degrees of Freedom: 2200 Total (i.e. Null); Null Deviance: 2769 Residual Deviance: 2229 AIC: 2239 2196 Residual (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 25 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation-casewise approach with R - cont > dev<-deviance(model) > df<-df.residual(model) > p_value<-1-pchisq(dev,df) > print(matrix(c("gof casewise approach"," ","G^2",round(dev,4),"df",df, +"p-value", round(p_value,4)),nrow=4,ncol=2,byrow=t)) [,1] [,2] [1,] "GOF casewise approach" " " [2,] "G^2" "2228.9128" [3,] "df" "2196" [4,] "p-value" "0.3071" (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 26 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation-casewise approach with SAS proc logistic data=individ; class status sex; model survival/n= status sex /scale=none; (omitting the long estimation section) Deviance and Pearson Goodness-of-Fit Statistics Criterion Value DF Value/DF Pr > ChiSq Deviance 2228.9128 2196 1.0150 0.3071 Pearson 2228.6553 2196 1.0149 0.3084 Number of events/trials observations: 2201 run; (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 27 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Contingency approach The contingency table approach can be obtained applying the procedures already introduced for the casewise approach on the grouped data set (with one record for each covariate pattern observed) Some attention must be spent on the outcome specification that now is expressed in terms of number of survivors. (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 28 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Contingency approach with Stata. xi:glm survival i.sex i.status if n>0, family(binomial n) link(logit) Generalized linear models No. of obs = 14 Optimization : ML Residual df = 9 Scale parameter = 1 Deviance = 131.4183066 (1/df) Deviance = 14.60203 Pearson = 127.8463371 (1/df) Pearson = 14.20515 Variance function: V(u) = u*(1-u/n) [Binomial] Link function : g(u) = ln(u/(n-u)) [Logit] AIC = 13.43138 Log likelihood = -89.01967223 BIC = 107.6668 ------------------------------------------------------------------------------ OIM survival Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- _Isex_2-2.421328.1390931-17.41 0.000-2.693946-2.148711 _Istatus_2.8808128.1569718 5.61 0.000.5731537 1.188472 _Istatus_3 -.0717844.1709268-0.42 0.675 -.4067948.263226 _Istatus_4 -.7774228.1423145-5.46 0.000-1.056354 -.4984916 _cons 1.187396.1574664 7.54 0.000.878767 1.496024 ------------------------------------------------------------------------------. scalar dev=e(deviance). scalar df=e(df). di "GOF contingency "" G^2="dev " df="df " p-value= " chiprob(df, dev) GOF contingency G^2=131.41831 df=9 p-value= 6.058e-24 (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 29 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Contingency approach with R > fail<-grouped$n-grouped$survival > Model<-glm(cbind(survival,fail)~sex+status,data=grouped, + family=binomial(link=logit)) > Model Call: glm(formula = cbind(survival, fail) ~ sex + status, family = binomial(link = logit), data = grouped) Coefficients: (Intercept) sexmale statusfirst statussecond statusthird 1.18740-2.42133 0.88081-0.07178-0.77742 Degrees of Freedom: 13 Total (i.e. Null); Null Deviance: 672 Residual Deviance: 131.4 AIC: 188 9 Residual (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 30 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Contingency approach with R > dev<-deviance(model) > df<-df.residual(model) > p_value<-1-pchisq(dev,df) > print("gof contingency table approach") [1] "GOF contingency table approach" > print(matrix(c("gof contingency approach"," ","G^2",round(dev,4),"df",df, +"p-value", round(p_value,4)),nrow=4,ncol=2,byrow=t)) [,1] [,2] [1,] "GOF contingency approach" " " [2,] "G^2" "131.4183" [3,] "df" "9" [4,] "p-value" "0" (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 31 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Contingency approach with SAS proc logistic data=grouped; class status sex; model survival/n= status sex /scale=none; run; (omitting the long estimation section) Deviance and Pearson Goodness-of-Fit Statistics Criterion Value DF Value/DF Pr > ChiSq Deviance 131.4183 9 14.6020 <.0001 Pearson 127.8383 9 14.2043 <.0001 Number of events/trials observations: 14 (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 32 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Collapsing approach Now the covariate patterns are based only on the covariates involved in the fitted model So we cannot use the default options from the previous procedures Both in Stata and R, new programs or functions (not available in the standard version of these softwares) are available whereas in SAS, we can use the option aggregate (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 33 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Collapsing approach with Stata. xi:logit survival i.sex i.status i.sex _Isex_1-2 (_Isex_1 for sex==female omitted) i.status _Istatus_1-4 (_Istatus_1 for status==crew omitted) h Logistic regression Number of obs = 2201 LR chi2(4) = 540.54 Prob > chi2 = 0.0000 Log likelihood = -1114.4564 Pseudo R2 = 0.1952 ------------------------------------------------------------------------------ survival Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- _Isex_2-2.421328.1390931-17.41 0.000-2.693946-2.148711 _Istatus_2.8808128.1569718 5.61 0.000.5731537 1.188472 _Istatus_3 -.0717844.1709268-0.42 0.675 -.4067948.263226 _Istatus_4 -.7774228.1423145-5.46 0.000-1.056354 -.4984916 _cons 1.187396.1574664 7.54 0.000.878767 1.496024 ------------------------------------------------------------------------------ (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 34 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Collapsing approach with Stata program define ldev version 12.0 tempvar n d d2 j predict d, de predict n, n generate d2 = ( d )^2 sort n quietly by n : generate j = _n quietly summarize d2 if j == 1 display display in green "Logistic model deviance goodness-of-fit test" display display in green " number of observations = " in yellow %7.0f = e(n) display in green " number of covariate patterns = " in yellow %7.0f = r(n) display in green " deviance goodness-of-fit = " in yellow %10.2f = r(sum) display in green " degrees of freedom = " /* */ in yellow %7.0f = (r(n) - e(df_m) - 1) display in green " Prob > chi2 = " /* */ in yellow %12.4f = chiprob((r(n) - e(df_m) - 1),r(sum)) end (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 35 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Collapsing approach with Stata. ldev Logistic model deviance goodness-of-fit test number of observations = 2201 number of covariate patterns = 8 deviance goodness-of-fit = 65.18 degrees of freedom = 3 Prob > chi2 = 0.0000 (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 36 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Collapsing approach with R > collapsing_approach<-function(model){ + y<-model$y + x<-model$model[,-1] + nx<-dim(x)[2] + tostring(nx) + name<-(names(model$model[,-1])) + fmla <- as.formula(paste("y~",paste("(",paste(name,collapse= "+"),")^", nx))) + m<-glm(fmla,data=model$data, + family=binomial(link=logit)) + ls<-loglik(m) + devs<--2*ls + dfs<-attr(ls,"df") + G2<-Model$deviance-devS + df<-dfs-attr(loglik(model),"df") + p_value<-1-pchisq(g2,df) + print(matrix(c("gof collapsing approach"," ","G^2",round(dev,4),"df",df, +"p-value", round(p_value,4)),nrow=4,ncol=2,byrow=t))} (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 37 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Collapsing approach with R > Model<-glm(survival~sex+status,data=individ, + family=binomial(link=logit)) > Model Call: glm(formula = survival ~ sex + status, family = binomial(link = logit), data = individ) Coefficients: (Intercept) sexmale statusfirst statussecond statusthird 1.18740-2.42133 0.88081-0.07178-0.77742 Degrees of Freedom: 2200 Total (i.e. Null); Null Deviance: 2769 Residual Deviance: 2229 AIC: 2239 > collapsing_approach(model) [,1] [,2] [1,] "GOF collapsing approach" " " [2,] "G^2" "65.1798" [3,] "df" "3" [4,] "p-value" "0" 2196 Residual (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 38 / 41

Likelihood Ratio Test implementation Likelihood Ratio Test implementation - Collapsing approach with SAS proc logistic data=individ; class status sex; model survival/n= status sex /scale=none aggregate; run; (omitting the long estimation section) Deviance and Pearson Goodness-of-Fit Statistics Criterion Value DF Value/DF Pr > ChiSq Deviance 65.1798 3 21.7266 <.0001 Pearson 60.8752 3 20.2917 <.0001 Number of unique profiles: 8 (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 39 / 41

Outline References 1 Inference in Logistic Regression 2 Model Definition 3 The Deviance 4 Likelihood Ratio Test implementation 5 References (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 40 / 41

References Agresti, A. (2007), An Introduction to Categorical Data Analysis, John Wiley & Sons, Inc. Dickman, P. (1998), Evaluating the goodness-of-fit of logistic regression models with examples in SAS PROC LOGISTIC, Technical report, Karolinska Institute. Hosmer, D. W.; Hosmer, T.; Le Cessie, S. & Lemeshow, S. (1997), A comparison of Goodness-of-Fit Tests for the Logistic Regression Model Hosmer, D. W.; Taber, S. & Lemeshow, S. (1991), The Importance of Assessing the Fit of Logistic Regression Models: A Case Study Kleinbaum, D. G. & Klein, M. (2010), Logistic Regression: A Self-Learning Text, Springer Kuss, O. (2001), Global Goodness-of-Fit Tests in Logistic Regression with Sparse Data, Technical report, University of Halle-Wittenberg Long, J. S. & Freese, J. (2000), Scalar Measure of Fit for Regression Models, Indiana University and University of Wisconsin-Madison Simonoff, J. S. (1998), Logistic Regression, Categorical Predictors, and Goodness-of-Fit: It Depends on Who You Ask (VIII Italian Stata User Meeting) Goodness of Fit November 17-18, 2011 41 / 41