ABSTRACT Paper 1485-014 SAS Global Forum Measures of Ft for Logstc Regresson Paul D. Allson, Statstcal Horzons LLC and the Unversty of Pennsylvana One of the most common questons about logstc regresson s How do I know f my model fts the data? There are many approaches to answerng ths queston, but they generally fall nto two categores: measures of predctve power (lke R-square) and goodness of ft tests (lke the Pearson ch-square). Ths presentaton looks frst at R-square measures, argung that the optonal R-squares reported by PROC LOGISTIC mght not be optmal. Measures proposed by McFadden and Tjur appear to be more attractve. As for goodness of ft, the popular Hosmer and Lemeshow test s shown to have some serous problems. Several alternatves are consdered. INTRODUCTION One of the most frequent questons I get about logstc regresson s How can I tell f my model fts the data? Often the questoner s expressng a genune nterest n knowng whether a model s a good model or a not-so-good model. But a more common motvaton s to convnce someone else--a boss, an edtor, or a regulator--that the model s OK. There are two very dfferent approaches to answerng ths queston. One s to get a statstc that measures how well you can predct the dependent varable based on the ndependent varables. I ll refer to these knds of statstcs as measures of predctve power. Typcally, they vary between 0 and 1, wth 0 meanng no predctve power whatsoever and 1 meanng perfect predctons. Predctve power statstcs avalable n PROC LOGISTIC nclude R-square, the area under the ROC curve, and several rank-order correlatons. Obvously, the hgher the better, but there s rarely a fxed cut-off that dstngushes an acceptable model from one that s not acceptable. The other approach to evaluatng model ft s to compute a goodness-of-ft statstc. Wth PROC LOGISTIC, you can get the devance, the Pearson ch-square, or the Hosmer-Lemeshow test. These are formal tests of the null hypothess that the ftted model s correct, and ther output s a p-value--agan a number between 0 and 1 wth hgher values ndcatng a better ft. In ths case, however, a p-value below some specfed level (say,.05) would ndcate that the model s not acceptable. What many researchers fal to realze s that measures of predctve power and goodness-of-ft statstcs are testng very dfferent thngs. It s not at all uncommon for models wth very hgh R-squares to produce unacceptable goodness-of-ft statstcs. And conversely, models wth very low R-squares, can ft the data very well accordng to goodness-of-ft tests. As I ll explan n more detal later, what goodness-of-ft statstcs are testng s not how well you can predct the dependent varable, but whether you could do even better by makng the model more complcated, specfcally, addng non-lneartes, addng nteractons, or changng the lnk functon. The goal of ths paper s to dscuss several ssues that I ve been grapplng wth over the years regardng both predctve power statstcs and goodness-of-ft statstcs. Wth one excepton, I make no clam to orgnalty here. Rather, I m smply tryng to make sense out of a rather complcated lterature, and to dstll t nto some practcal recommendatons. I begn wth measures of predctve power, and I m gong to focus exclusvely on R-square measures. I don t mean to mply that these are ether better or worse than alternatve measures lke the area under the ROC curve. But I personally happen to lke R-square statstcs just because they are so famlar from the context of ordnary lnear regresson. R STATISTICS FOR LOGISTIC REGRESSION There are many dfferent ways to calculate R for logstc regresson and, unfortunately, no consensus on whch one s best. Mttlbock and Schemper (1996) revewed 1 dfferent measures; Menard (000) consdered several others. The two methods that are most often reported n statstcal software appear to be one proposed by McFadden (1974) and another that s usually attrbuted to Cox and Snell (1989) along wth ts corrected verson. The Cox-Snell R (both corrected and uncorrected) was actually dscussed earler by Maddala (1983) and by Cragg and Uhler (1970). Cox-Snell s the optonal R reported by PROC LOGISTIC. PROC QLIM reports eght dfferent R measures ncludng both Cox-Snell and McFadden. Among other statstcal packages that I m famlar wth, Statstca reports the Cox-Snell measures. JMP reports both McFadden and Cox-Snell. SPSS reports the Cox-Snell measures for bnary logstc regresson but McFadden s measure for multnomal and ordered logt. 1
For years, I ve been recommendng the Cox-Snell R over the McFadden R, but I ve recently concluded that that was a mstake. I now beleve that McFadden s R s a better choce. However, I ve also learned about another R that has good propertes, a lot of ntutve appeal, and s easly calculated. At the moment, I lke t better than the McFadden R, but I m not prepared to make a defntve recommendaton at ths pont. Here are some detals. Logstc regresson s, of course, estmated by maxmzng the lkelhood functon. Let L 0 be the value of the lkelhood functon for a model wth no predctors, and let L M be the lkelhood for the model beng estmated. McFadden s R s defned as R McF = 1 ln(l M) / ln(l 0 ) where ln(.) s the natural logarthm. The ratonale for ths formula s that ln(l 0 ) plays a role analogous to the resdual sum of squares n lnear regresson. Consequently, ths formula corresponds to a proportonal reducton n error varance. It s sometmes referred to as a pseudo R. The Cox and Snell R s R C&S = 1 (L 0 / L M ) /n where n s the sample sze. The ratonale for ths formula s that, for normal-theory lnear regresson, t s an dentty. In other words, the usual R for lnear regresson depends on the lkelhoods for the models wth and wthout predctors by precsely ths formula. It s approprate, then, to descrbe ths as a generalzed R rather than a pseudo R. By contrast, the McFadden R does not have the OLS R as a specal case. I ve always found ths property of the Cox- Snell R to be very attractve, especally because the formula can be naturally extended to other knds of regresson estmated by maxmum lkelhood, lke negatve bnomal regresson for count data or Webull regresson for survval data. It s well known, however, that the bg problem wth the Cox-Snell R s that t has an upper bound that s less than 1.0. Specfcally, the upper bound s 1 L /n 0. Ths can be a lot less than 1.0, and t depends only on p, the margnal proporton of cases wth events: upper bound = 1 [p p (1-p) (1-p) ] I have not seen ths formula anywhere else, so t may be the only orgnal thng n ths paper. The upper bound reaches a maxmum of.75 when p=.5. By contrast, when p=.9 (or.1), the upper bound s only.48. For those who want an R that behaves lke a lnear-model R, ths s deeply unsettlng. There s a smple correcton, and that s to dvde R C&S by ts upper bound, whch produces the R attrbuted to Nagelkerke (1991) and whch s labeled n SAS output as the max-rescaled R. But ths correcton s purely ad hoc, and t greatly reduces the theoretcal appeal of the orgnal R C&S. I also thnk that the values t typcally produces are msleadngly hgh, especally compared wth what you get from just dong OLS wth the bnary dependent varable. (Some mght vew ths as a feature, however). So, wth some reluctance, I ve decded to cross over to the McFadden camp. As Menard (000) argued, t satsfes almost all of Kvalseth s (1985) eght crtera for a good R. When the margnal proporton s around.5, the McFadden R tends to be a lttle smaller than the uncorrected Cox-Snell R. When the margnal proporton s nearer to 0 or 1, the McFadden R tends to be larger. But there s another R, recently proposed by Tjur (009), that I m nclned to prefer over McFadden s. It has a lot of ntutve appeal, ts upper bound s 1.0, and t s closely related to R defntons for lnear models. It s also easy to calculate. The defnton s very smple. For each of the two categores of the dependent varable, calculate the mean of the predcted probabltes of an event. Then, take the absolute value of the dfference between those two means. That s t! The motvaton should be clear. If a model makes good predctons, the cases wth events should have hgh predcted values and the cases wthout events should have low predcted values. Tjur also showed that hs R (whch he called the coeffcent of dscrmnaton) s equal to the arthmetc mean of two R formulas based on squared resduals, and equal to the geometrc mean of two other R s based on squared resduals. Here s an example of how to calculate Tjur s statstc n SAS. I used a well-known data set on labor force partcpaton of 751 marred women (Mroz 1987). The dependent varable INLF s coded 1 f a woman was n the labor force, otherwse 0. A logstc regresson model was ft wth sx predctors. Here s the code: proc logstc data=my.mroz; model nlf(desc) = kdslt6 age educ huswage cty exper;
output out=a pred=yhat; proc ttest data=a; class nlf; var yhat; run; The OUTPUT statement produces a new data set called A wth predcted probabltes stored n a new varable called YHAT. PROC TTEST s a convenent way to compute the mean of the predcted probabltes for each category of the dependent varable, and to take ther dfference. The output s shown n Table 1. Ignorng the sgn of the dfference, the Tjur R s.575. By contrast, the Cox-Snell R s.477, and the max-rescaled R s.33. McFadden R s.08. The squared correlaton between the observed and predcted values s.57. The TTEST Procedure Varable: yhat (Estmated Probablty) INLF N Mean Std Dev Std Err Mnmum Maxmum 0 35 0.41 0.38 0.014 0.0160 0.959 1 46 0.6787 0.119 0.0103 0.1103 0.960 Dff (1-) -0.575 0.171 0.0160 Table 1. PROC TTEST Output to Compute Tjur s R. One possble objecton to the Tjur R s that, unlke Cox-Snell and McFadden, t s not based on the quantty beng maxmzed, namely, the lkelhood functon. As a result, t s possble that addng a varable to the model could reduce the Tjur R. But Kvalseth (1985) argued that t s actually preferable that R not be based on a partcular estmaton method. In that way, t can legtmately be used to compare predctve power for models that generate ther predctons usng very dfferent methods. For example, one mght want to compare predctons based on logstc regresson wth those based on a lnear model or on a classfcaton tree method. Another potental complant s that the Tjur R cannot be easly generalzed to ordnal or nomnal logstc regresson. For McFadden and Cox-Snell, the generalzaton s trval. CLASSIC GOODNESS-OF-FIT STATISTICS I now turn to goodness-of-ft (GOF) tests, whch can help you decde whether your model s correctly specfed. GOF tests produce a p-value. If t s low (say, below.05), you reject the model. If t s hgh, then your model passes the test. Classc GOF tests are readly avalable for logstc regresson when the data can be aggregated or grouped nto unque profles. Profles are groups of cases that have exactly the same values on the predctors. For example, suppose we ft a model to the Mroz data wth just two predctor varables, CITY (1=urban, 0=nonurban) and NKIDSLT6 whch has nteger values rangng from 0 to 3. There are then eght profles, correspondng to the eght cells n the cross-classfcaton of CITY by NKIDSLT6. After fttng the model, we can get an observed number of events and an expected number of events for each profle. There are two well-known statstcs for comparng the observed number wth the expected number: the devance and Pearson s ch-square. Here s how to get them wth PROC LOGISTIC: proc logstc data=my.mroz; model nlf(desc) = kdslt6 cty / aggregate scale=none; run; The AGGREGATE opton says to aggregate the data nto profles based on the values of the predctor varables. The SCALE=NONE opton requests the devance and the Pearson ch-square, based on those profles. Here are the results. 3
Devance and Pearson Goodness-of-Ft Statstcs Crteron Value DF Value/DF Pr >ChSq Devance 4.1109 5 0.8 0.5336 Pearson 3.9665 5 0.7933 0.5543 Number of unque profles: 8 Table. PROC LOGISTIC Output of GOF Statstcs For both statstcs, the ch-squares are low relatve to the degrees of freedom, and the p-values are hgh. Ths s exactly what we want to see. There s no evdence to reject the null hypothess, whch s that the ftted model s correct. Now let s take a closer look at these two statstcs. The formula for the devance s G j Oj O j log E j where each j s a cell n the -way contngency table wth each row beng a profle and each column beng one of the two categores of the dependent varable. O j s the observed frequency and E j s the expected frequency based on the ftted model. If O j =0, the entre term n the summaton s set to 0. The degrees of freedom s the number of profles mnus the number of estmated parameters. The Pearson ch-square s calculated as X O j E E j j j If the ftted model s correct, both statstcs have approxmately a ch-square dstrbuton, wth the approxmaton mprovng as the sample gets larger. But what exactly are these statstcs testng? Ths s easest to see for the devance, whch s a lkelhood rato test comparng the ftted model to a saturated model that perfectly fts the data. In our example, a saturated model would treat KIDSLT6 as a CLASS varable, and would also nclude the nteracton of KIDSLT6 and CITY. Here s the code for that model, wth the GOF output n Table 3. proc logstc data=my.mroz; class kdslt6; model nlf(desc) = kdslt6 cty kdslt6*cty / aggregate scale=none; run; Devance and Pearson Goodness-of-Ft Statstcs Crteron Value DF Value/DF Pr >ChSq Devance 0.0000 0.. Pearson 0.0000 0.. Table 3. PROC LOGISTIC Output for a Saturated model So the answer to the queston What are GOF tests testng? s smply ths: they are testng whether there are any non-lneartes or nteractons. You can always produce a satsfactory ft by addng enough nteractons and nonlneartes. But do you really need them to properly represent the data? GOF tests are desgned to answer that queston. A related ssue s whether the lnk functon s correct. Is t logt, probt, complementary log-log, or somethng else entrely? Note that n a saturated model, the lnk functon s rrelevant. It s only when you suppress nteractons or non-lneartes that the lnk functon becomes an ssue. For example, t s possble (although unusual) that nteractons 4
that are needed for a logt model could dsappear when you ft a complementary log-log model. Both the devance and the Pearson ch-square have good propertes when the expected number of events and the expected number of non-events for each profle s at least 5. But most contemporary applcatons of logstc regresson use data that do not allow for aggregaton nto profles because the model ncludes one or more contnuous (or nearly contnuous) predctors. That s certanly true for the Mroz data when you nclude age, educaton, husband s wage, and years of experence n the model. When there s only one case per profle, both the devance and Pearson ch-square have dstrbutons that depart markedly from a true ch-square dstrbuton, yeldng p-values that may be wldly naccurate. In fact, wth only one case per profle, the devance does not depend on the observed values at all, makng t utterly useless as a GOF test (McCullagh 1985). What can we do? Hosmer and Lemeshow (1980) proposed groupng cases together accordng to ther predcted values from the logstc regresson model. Specfcally, the predcted values are arrayed from lowest to hghest, and then separated nto several groups of approxmately equal sze. Ten groups s the standard recommendaton. For each group, we calculate the observed number of events and non-events, as well as the expected number of events and non-events. The expected number of events s just the sum of the predcted probabltes for all the ndvduals n the group. And the expected number of non-events s the group sze mnus the expected number of events. Pearson s ch-square s then appled to compare observed counts wth expected counts. The degrees of freedom s the number of groups mnus. As wth the classc GOF tests, low p-values suggest rejecton of the model. For the Mroz data, here s the code for a model wth fve predctors: proc logstc data=my.mroz; model nlf(desc) = kdslt6 age educ huswage cty exper / lackft; run; The LACKFIT opton requests the Hosmer-Lemeshow (HL) test. Results are n Table 4. Partton for the Hosmer and Lemeshow Test Group Total INLF = 1 INLF = 0 Observed Expected Observed Expected 1 75 14 10.05 61 64.95 75 19 19.58 56 55.4 3 75 6 6.77 49 48.3 4 75 4 34.16 51 40.84 5 75 48 41.4 7 33.58 6 75 53 47.3 7.68 7 75 49 5.83 6.17 8 75 54 58.87 1 16.13 9 75 68 65.05 7 9.95 10 76 71 69.94 5 6.06 Hosmer and Lemeshow Goodness-of-Ft Test Ch-Square DF Pr > ChSq 15.6061 8 0.0484 Table 4. Hosmer-Lemeshow Results from PROC LOGISTIC. 5
The p-value s just below.05, suggestng that we may need some nteractons or non-lneartes n the model. The HL test seems lke a clever soluton, and t has become the de facto standard for almost all software packages. But t turns out to have serous problems. The most troublng problem s that results can depend markedly on the number of groups, and there s no theory to gude the choce of that number. Ths problem dd not become apparent untl some software packages (but not SAS) started allowng you to specfy the number of groups, rather than just usng 10. When I estmated ths model n Stata, for example, wth the default number of 10 groups, I got a HL ch-square of 15.5 wth 8 df, yeldng a p-value of.0499 almost the same as what we just got n SAS. But f we specfy 9 groups, the p-value rses to.11. Wth 11 groups, the p-value s.64. Clearly, t s not acceptable for the results to depend so greatly on mnor changes that are completely arbtrary. Examples lke ths are easy to come by. But wat, there s more. One would hope that addng a statstcally sgnfcant nteracton or non-lnearty to a model would mprove ts ft, as judged by the HL test. But often that doesn t happen. Suppose, for example, that we add the square of EXPER (labor force experence) to the model, allowng for non-lnearty n the effect of experence. The squared term s hghly sgnfcant (p=.00). But wth 9 groups, the HL ch-square ncreases from 11.65 (p=.11) n the smpler model to 13.34 (p=.06) n the more complex model. Thus, the HL test suggests that we d be better off wth the model that excludes the squared term. The reverse can also happen. Qute frequently, addng a non-sgnfcant nteracton or non-lnearty to a model wll substantally mprove the HL ft. For example, I added the nteracton of EDUC and EXPER to the basc model above. The product term had a p-value of.68, clearly not statstcally sgnfcant. But the HL ch-square (based on 10 groups) declned from 15.5 (p=.05) to 9.19 (p=.33). Agan, unacceptable behavor. I am certanly not the frst person to pont out these problems. In fact, n a 1997 paper, Hosmer, Lemeshow and others acknowledged that the HL test had several drawbacks, although that hasn t stopped other people from usng t. But f the HL test s not good, then how can we assess the ft of the model? It turns out that there s been qute a lot of work on ths topc, and many alternatve tests have been proposed so many that t s rather dffcult to fgure out whch ones are useful. In the remander of ths paper, I wll revew some of the lterature on these tests, and I wll recommend four of them that I thnk are worthy of consderaton. NEW GOODNESS-OF-FIT TESTS Many of the proposed tests are based on alternatve ways of groupng the data (Tsats 1980, Pgeon and Heyse 1991, Pulkstens and Robnson 00, Xe et al. 008, Lu et al. 01). Once the data have been grouped, a standard Pearson ch-square s calculated to evaluate the dscrepancy between predcted and observed counts wthn the groups. The man problem wth these knds of tests s that the groupng process usually requres sgnfcant effort and attenton by the data analyst, and there s a certan degree of arbtrarness n how t s done. What most analysts want s a test that can be easly and routnely mplemented. And snce there are several tests that fulfll that requrement, I shall restrct my attenton to tests that can be calculated when there s only one case per profle and no groupng of observatons. Based on my readng of the lterature, I am prepared to recommend four statstcs for wdespread use: 1. Standardzed Pearson Test. Wth ungrouped data, the formula for the classc Pearson ch-square test s: X y ˆ (1 ˆ ) ˆ where y s the dependent varable wth values of 0 or 1, and ˆ s the predcted probablty that y =1, based on the ftted model. As we ve just dscussed, the problem wth the classc Pearson GOF test s that t does not have a ch-square dstrbuton when the data are not grouped. But Osus and Rojek (199) showed that X has an asymptotc normal dstrbuton wth a mean and standard devaton that they derved. Subtractng the mean and dvdng by the standard devaton yelds a test statstc that has approxmately a standard normal dstrbuton under the null hypothess. McCullagh (1985) derved a dfferent mean and standard devaton after condtonng on the vector of estmated regresson coeffcents. In practce, these two versons of the standardzed Pearson are nearly dentcal, especally n larger samples. Farrngton (1996) also proposed a modfed X test, but hs test does not work when there s only one case per profle. For the remander of ths paper, I shall refer to the standardzed Pearson test as smply the Pearson test. 6
. Unweghted Sum of Squares. Copas (1989) proposed the test statstc USS n ( y ˆ ) 1 Ths statstc also has an asymptotc normal dstrbuton under the null hypothess, and Hosmer et al. (1997) showed how to get ts mean and standard devaton. As wth the Pearson test, subtractng the mean and dvdng by the standard devaton yelds a standard normal test statstc. 3. Informaton Matrx Test. Whte (198) proposed a general approach to testng for model msspecfcaton by comparng two dfferent estmates of the covarance matrx of the parameter estmates (the negatve nverse of the nformaton matrx), one based on frst dervatves of the log-lkelhood functon and the other based on second dervatves. If the ftted model s correct, the expected values of these two estmators should be the same. Orme (1988, 1990) showed how to apply ths method to test models for bnary data. The test statstc s IM n p 1 j 0 ( y ˆ )(1 ˆ ) x j where the x j s are the p predctor varables n the model and x o =1. After standardzaton wth an approprate varance, ths statstc has approxmately a ch-square dstrbuton wth p+1 degrees of freedom under the null hypothess. 4. Stukel Test. Stukel (1988) proposed a generalzaton of the logstc regresson model that has two addtonal parameters, thereby allowng ether for asymmetry n the curve or for a dfferent rate of approach to the (0,1) bounds. Specal cases of the model nclude (approxmately) the complementary log-log model and the probt model. The logstc model can be tested aganst ths more general model by a smple procedure. Let g be the lnear predctor from the ftted model, that s, g = x b where x s the vector of covarate values for ndvdual and b s the vector of estmated coeffcents. Then create two new varables: z a = g f g>=0, otherwse z a = 0 z b = g f g<0, otherwse z b = 0. Add these two varables to the logstc regresson model and test the null hypothess that both of ther coeffcents are equal to 0. Stukel proposed a score test, but there s no obvous reason to prefer that to a Wald test or a lkelhood rato test. Note that n many data sets, g s ether never greater than 0 or never less than 0. In such cases, only one z varable wll be needed. IMPLEMENTING THE TESTS As we ll see, Stukel s test s easly performed n SAS wthout much dffculty. The others are not qute so easy. Fortunately, Olver Kuss has wrtten a SAS macro that wll calculate these and other tests. In fact, he presented a paper on that macro at SUGI 5 n 001. Currently, the macro can be downloaded at https://gthub.com/frendly/sas-macros/blob/master/goflogt.sas. Unfortunately, there s a major problem wth ths macro that I wll explan later. Let s apply these tests to the Mroz data used earler. Recall that the HL test wth ten groups yelded a p-value of.048, suggestng a need for nteractons or non-lneartes n the model. Here s the code for dong the Stukel test: proc logstc data=my.mroz; model nlf(desc) = kdslt6 age educ huswage cty exper; output out=a xbeta=xb; data b; set a; za=xb***(xb>=0); zb=xb***(xb<0); num=1; proc logstc data=b; model nlf(desc) = kdslt6 age educ huswage cty exper za zb ; test za=0,zb=0; run; We frst ft the model of nterest usng PROC LOGISTIC. The OUTPUT statement produces a new data set A that contans all the varables n the model plus the new varable XB, whch s the lnear predctor based on the ftted 7
model. In the DATA step that follows, the two new varables needed for the Stukel test are created. In addton, NUM=1 creates a new varable that wll be needed for the GOFLOGIT macro. The second PROC LOGISTIC step estmates the extended model wth the two new varables, and tests the null hypothess that both ZA and ZB have coeffcents of 0. Ths produced a Wald ch-square of.1 ( df), yeldng a p-value of.94. A lkelhood rato ch-square (the dfference n the -logl for the two models) produced almost dentcal results. Clearly there s no evdence aganst the model. To calculate the other GOF statstcs, we call the GOFLOGIT macro wth the followng statement: %goflogt(data=b, y=nlf, xlst=kdslt6 age educ huswage cty exper, trals=num) The macro fts the logstc regresson model wth the dependent varable specfed n Y= and the ndependent varables specfed n XLIST=. TRIALS=NUM s necessary because the macro s desgned to calculate GOF statstcs for ether grouped or ungrouped data. For ungrouped data, the number of trals must be set to 1, whch s why I created the NUM=1 varable n the earler DATA step. The output s shown n Table 5. Results from the Goodness-of-Ft Tests TEST Value p-value Standard Pearson Test 751.049 0.41 Standard Devance 813.773 0.038 Osus-Test 0.003 0.499 McCullagh-Test 0.09 0.489 Farrngton-Test 0.000 1.000 IM-Test 11.338 0.15 RSS-Test 136.935 0.876 Table 5. Output from GOFLOGIT Macro The frst two tests are the classc Pearson and Devance statstcs, wth p-values that can t be trusted wth ungrouped data. Osus and McCullagh are two dfferent versons of the standardzed Pearson. As noted earler, the Farrngton test s not approprate for ungrouped data t s always equal to 0. What the macro labels as the RSS test s what I m callng the USS test. The only test yeldng a p-value less than.05 s the standard devance but, as I sad earler, ths test s useless for ungrouped data because t doesn t depend on the observed values of y. The Farrngton test s also useless because, wth ungrouped data, t s always equal to 0. Notce that the Osus and McCullagh tests are very close, whch has been the case wth every data set that I ve looked at. As reported here, the IM test s a ch-square statstc wth df=7 (the number of covarates n the model). The RSS value s just the sum of the squared resduals. Calculaton of ts p- value requres subtractng ts approxmate mean and dvdng by ts approxmate standard devaton, and referrng the result to a standard normal dstrbuton. PROPERTIES OF THE TESTS Smulaton results show that all these tests have about the rght sze. That s, f a correct model s ftted, the proporton of tmes that the model s rejected s about the same as the chosen alpha level, say,.05. So, n that sense, all the tests are properly testng the same null hypothess. But then we must ask two related questons: what sorts of departures from the model are these tests senstve to, and how much power do they have to detect varous alternatves? We can learn a lttle from theory and a lttle from smulaton results. Theory. When the data are naturally grouped, the classc Pearson and devance tests are truly omnbus tests. That s, they respond to any non-lnearty, nteracton or devaton from the specfed lnk functon. The newer tests appear to be more specfc. For example, by ts very desgn, the Stukel test should do well n detectng departures from the logt lnk functon. Smlarly, Osus and Rojek (199) showed that the Pearson test can be derved as a score test for a parameter n dfferent generalzaton of the logt model. They descrbe ths test as a powerful test aganst partcular alternatves concernng the lnk [functon]. For the IM test, Chesher (1984) demonstrated that t s equvalent to a score test for the alternatve hypothess that the regresson coeffcents vary across ndvduals, rather than beng the same for everyone. Smlarly, Copas (1989) showed that the USS test can be derved as a score test for the alternatve hypothess that that the s are 8
ndependent random draws from a dstrbuton wth constant varance and means determned by the x s. These results suggest that both the IM and USS tests should be partcularly senstve to unobserved heterogenety. Smulaton. There have been three major smulaton studes desgned to assess the power of goodness-of-ft tests for logstc regresson: Hosmer et al. (1997), Hosmer and Hjort (00) and Kuss (00). For convenence, I ll refer to them as H+, HH, and K. Of the four statstcs under consderaton here, H+ ncludes the Pearson, USS and Stukel. HH only consders the Pearson and USS. K ncludes Pearson, USS and IM. All three studes use only sample szes of 100 and 500. Here s a summary of ther results for varous knds of departure from the standard logstc model: Quadratc vs. lnear effect of a covarate. H+ report that Pearson and USS have moderate power for N=100 and very good power (above 90% under most condtons) for N=500. Power for Stukel s smlar but somewhat lower. HH get smlar results for Pearson and USS. K, on the other hand, found no power for Pearson, and moderate to good power for USS and IM, wth USS notceably better than IM for N=100. Interacton vs. lnear effects of two covarates. H+ found vrtually no power for all tests under all condtons. But they set up the smulaton ncorrectly, n my judgment. HH reported power of about 40% for both Pearson and USS at N=100 for a very strong nteracton. At N=500, the power was over 90%. For weaker nteractons, power ranged between 5% and 70%. K dd not examne nteractons. None of the smulatons examned the power of Stukel or IM for testng nteractons. Alternatve lnk functons. H+ found that Pearson, USS and Stukel generally had very low power at N=100 and only small to moderate power at N=500. Stukel was the best of the three. HH report smlar results for USS and Pearson. Comparng logstc wth complementary log-log, K found no power for Pearson and moderate power for IM and USS, wth IM somewhat better. Mssng covarate and overdsperson. K found that nether Pearson, USS or IM had any apprecable power to detect these knds of departures. Dscusson. The most alarmng thng about these smulaton studes s the nconsstency between Kuss and the other two studes regardng the performance of the Pearson test. H+ and HH found that Pearson had reasonably good power to test several dfferent knds of specfcatons. Kuss, on the other hand, found that the Pearson test had vrtually no power to test ether a quadratc model or a msspecfed lnk functon. Unfortunately, I beleve that Kuss s smulatons, whch were based on hs GOFLOGIT macro, have a major flaw. For the standardzed Pearson statstcs, he used a one-sded test rather than the two-sded test recommended by Osus and Rojek, and also by Hosmer and Lemeshow n the 013 edton of ther classc text, Appled Logstc Regresson. When I replcated Kuss s smulatons usng a two-sded test, the results (shown below) were consstent wth those of H+ and HH. There s also a flaw n the H+ smulatons. For ther nteracton models, the ftted models removed the nteracton and the man effect of one of the two varables. Ths does not yeld a vald test of the nteracton. NEW SIMULATIONS To address problems wth prevous smulatons, I ran new smulatons that ncluded all the GOF tests consdered here. Whenever approprate, I tred to replcate the basc structure of the smulatons used n prevous studes. As reported n those studes, all the tests rejected the null hypothess at about the nomnal level when the ftted model was correct. So I shall only report the estmated power of the tests to reject the null hypothess when the ftted model s ncorrect. For each condton, 500 samples were drawn. Lnear vs. quadratc. The correct model was logt( ) 0 1x x.values of the coeffcents were the same as those used by Hosmer and Hjort. Coeffcents were vared to emphasze or deemphasze the quadratc component, n four confguratons: very low, low, medum and hgh. The varable x was unformly dstrbuted between -3 and 3. The ftted model deleted x. Sample szes of 100 and 500 were examned. The lnear model was rejected f the p-value for the GOF test fell below.05. In addton to the new GOF tests, I checked the power of the standard Wald ch-square test for, the coeffcent for x.table 6 shows the proporton of tmes that the lnear model was rejected,.e., estmates of the power of each test. When the quadratc effect s very low, none of the tests had any apprecable power. For the low quadratc effect, we see pretty good power at N=500, and some power at N=100. For the medum quadratc effect, N=500 gves near perfect power for all tests, but just moderate power N=100. In ths condton, the Stukel test seems notceably weaker than the others. For the hgh quadratc effect, we see very good power at N=100, and 100 percent rejecton at N=500. The Wald ch-square for the nteracton generally does better than the GOF tests, especally at N=500. 9
Quadratc Effect Very Low Low Medum Hgh N 100 500 100 500 100 500 100 500 Osus 0.068 0.106 0.38 0.840 0.604 0.990 0.83 1.000 McCullagh 0.07 0.108 0.344 0.844 0.616 0.990 0.84 1.000 USS 0.064 0.066 0.348 0.890 0.654 0.994 0.858 1.000 IM 0.048 0.070 0.9 0.87 0.584 0.994 0.86 1.000 Stukel 0.030 0.064 0.19 0.866 0.436 0.99 0.708 1.000 Wald X.0440 0.104 0.380 0.91 0.636 1.000 0.980 1.000 Table 6. Power Estmates for Detectng a Quadratc Effect Lnear vs. nteracton. The correct model was logt( ) 0 1x d 3xd. For the predctor varables, x was unformly dstrbuted between -3 and +3, d was dchotomous wth values of -1 and +1, and the two varables were ndependent. Coeffcents were chosen to represent varyng levels of nteracton. The ftted model deleted the product term xd. Sample szes of 100 and 500 were examned. The lnear model was rejected f the p-value for the GOF test fell below.05. Table 7 shows the proporton of tmes that the lnear model was rejected. Interacton Very Low Low Medum Hgh Very Hgh N 100 500 100 500 100 500 100 500 100 500 Osus 0.083 0.086 0.130 0.6 0.11 0.414 0.338 0.570 0.497 0.639 McCullagh 0.093 0.086 0.138 0.64 0.15 0.416 0.348 0.574 0.501 0.639 USS 0.079 0.086 0.130 0.54 0.11 0.406 0.340 0.566 0.499 0.631 IM 0.03 0.054 0.077 0.310 0.168 0.518 0.30 0.658 0.545 0.745 Stukel 0.059 0.14 0.114 0.664 0.74 0.906 0.41 0.95 0.634 0.964 Wald X 0.10 0.46 0.34 0.950 0.666 1.000 0.864 1.000 0.966 1.000 Table 7. Power Estmates for Detectng an Interacton In Table 7, we see that power to detect nteracton wth GOF tests s generally on the low sde. Of the fve new tests, Stukel clearly outperforms the others, especally at N=500. IM generally comes n second. But by comparson, the standard Wald ch-square test for the nteracton s far superor to any of these tests. Ths llustrates the general prncple that, whle GOF tests may be useful n detectng unantcpated departures from the model, tests that target specfc departures from the model are often much more powerful. Incorrect Lnk Functon. Most software packages for bnary regresson offer only three lnk functons: logt, probt and complementary log-log. So the practcal ssue s whether GOF tests can dscrmnate among these three. Logt and probt curves are both symmetrcal so t s very hard to dstngush them. Instead, I ll focus on logt vs. complementary log-log (whch s asymmetrcal). The true model was lnear n the complementary log-log: log(-log(1, )) x 0 1 wth x unformly dstrbuted between -3 and 3, 0 = 0 and 1 =.81. The ftted model was a standard logstc model. Results for the GOF tests are shown n Table 8. For N=100, none of the tests s any good. For N=500, the standardzed Pearson tests are awful, USS s margnal, and IM and Stukel are half decent. Thngs look a lttle dfferent n the last two columns, however, where I ncreased the sample sze to 1,000. When the coeffcent stays the same, IM and Stukel are stll the best, although the others are much mproved. But when I reduced the coeffcent of x by half, the Pearson statstcs look better than IM and Stukel. Why the reversal? As 10
others have noted, the Pearson statstc may be partcularly senstve to cases where the predcted value s near 1 or 0 and the observed value s n the opposte drecton. That s because each resdual gets weghted by 1 ˆ (1 ˆ ) whch wll be large when the predcted values are near 0 or 1. When 1 =.81, many of the predcted values are near 0 or 1. But when 1 =.405, a much smaller fracton of the predcted values are near 0 or 1. Ths suggests that the earler smulatons should also explore varaton n the range of predcted values. N 100 1 =.81) 500 1 =.81) 1000 1 =.81) 1000 ( 1 =.405) Osus 0 0.11 0.574 0.48 McCullagh 0 0.09 0.548 0.48 USS 0.054 0.90 0.586 0.430 IM 0.076 0.55 0.884 0.350 Stukel 0.036 0.478 0.878 0.35 Table 8. Power Estmates for Detectng an Incorrect Lnk Functon CLOSING POINTS All of the new GOF tests wth ungrouped data are potentally useful n detectng msspecfcaton. For detectng nteracton, the Stukel test was markedly better than the others. But t was somewhat weaker for detectng quadratc effects. None of the tests was great at dstngushng a logstc model from a complementary log-log model. The Pearson tests were much worse than the others when many predcted probabltes were close to 1 or 0, and better than the others when predcted probabltes were concentrated n the mdrange. Ths suggests that more elaborate smulatons are needed for a comparatve evaluaton of these statstcs. Tests for specfc knds of msspecfcaton may be much more powerful than global GOF tests. Ths was partcularly evdent for nteractons. For many applcatons a targeted approach may be the way to go. I recommend usng all these GOF tests. If your model passes all of them, you can feel releved. If any one of them s sgnfcant, t s probably worth dong targeted tests. As wth any GOF tests, when the sample sze s qute large, t may not be possble to fnd any reasonably parsmonous model wth a p-value greater than.05. If you use the GOFLOGIT macro, modfy t to calculate two-sded p-values for the Osus and McCullagh versons of the standardzed Pearson statstc. REFERENCES Chesher A. (1984) Testng for neglected heterogenety. Econometrca 5:865 87. Cragg, J.G. and R.S. Uhler (1970) The demand for automobles. The Canadan Journal of Economcs 3: 386-406. Copas, J.B. (1989) Unweghted sum of squares test for proportons. Appled Statstcs 38:71 80. Cox, D.R. and E.J. Snell (1989) Analyss of Bnary Data. Second Edton. Chapman & Hall. Farrngton, C. P. (1996) On assessng goodness of ft of generalzed lnear models to sparse data. Journal of the Royal Statstcal Socety, Seres B 58: 344 366. Hosmer, D.W. and N.L. Hjort (00) Goodness-of-ft processes for logstc regresson: Smulaton results. Statstcs n Medcne 1:73 738. Hosmer, D.W., T. Hosmer, S. Le Cesse and S. Lemeshow (1997). A comparson of goodness-of-ft tests for the logstc regresson model. Statstcs n Medcne 16: 965 980. 11
Hosmer D.W. and S. Lemeshow (1980) A goodness-of-ft test for the multple logstc regresson model. Communcatons n Statstcs A10:1043-1069. Hosmer D.W. and S. Lemeshow (013) Appled Logstc Regresson, 3 rd Edton. New York: Wley. Kvalseth, T.O. (1985) Cautonary note about R. The Amercan Statstcan: 39: 79-85. Kuss, O. (001) A SAS/IML macro for goodness-of-ft testng n logstc regresson models wth sparse data. Paper 65-6 presented at the SAS User s Group Internatonal 6. Kuss, O. (00) Global goodness-of-ft tests n logstc regresson wth sparse data. Statstcs n Medcne 1:3789 3801. Lu, Y., P.I. Nelson and S.S. Yang (01) An omnbus lack of ft test n logstc regresson wth sparse data. Statstcal Methods & Applcatons 1:437 45. McFadden, D. (1974) Condtonal logt analyss of qualtatve choce behavor. Pp. 105-14 n P. Zarembka (ed.), Fronters n Econometrcs. Academc Press. Maddala, G.S. (1983) Lmted Dependent and Qualtatve Varables n Econometrcs. Cambrdge Unversty Press. McCullagh, P. (1985). On the asymptotc dstrbuton of Pearson s statstcs n lnear exponental famly models. Internatonal Statstcal Revew 53: 61 67. Menard, S. (000) Coeffcents of determnaton for multple logstc regresson analyss. The Amercan Statstcan 54: 17-4. Mttlbock, M. and M. Schemper (1996) Explaned varaton n logstc regresson. Statstcs n Medcne 15: 1987-1997. Mroz, T.A. (1987) The senstvy of an emprcal model of marred women's hours of work to economc and statstcal assumptons. Econometrca 55: 765-799. Orme, C. (1988) The calculaton of the nformaton matrx test for bnary data models. The Manchester School 54(4):370 376. Orme, C. (1990) The small-sample performance of the nformaton-matrx test. Journal of Econometrcs 46: 309-331. Osus, G., and Rojek, D. (199) Normal goodness-of-ft tests for multnomal models wth large degrees-of-freedom. Journal of the Amercan Statstcal Assocaton 87: 1145 115. Nagelkerke, N.J.D. (1991) A note on a general defnton of the coeffcent of determnaton. Bometrka 78: 691-69. Pgeon, J. G., and Heyse, J. F. (1999) An mproved goodness of ft test for probablty predcton models. Bometrcal Journal 41: 71 8. Press, S.J. and S. Wlson (1978) Choosng between logstc regresson and dscrmnant analyss. Journal of the Amercan Statstcal Assocaton 73: 699-705. Pulkstens, E., and T. J. Robnson (00) Two goodness-of-ft tests for logstc regresson models wth contnuous covarates. Statstcs n Medcne 1: 79 93. Stukel, T. A. (1988) Generalzed logstc models. Journal of the Amercan Statstcal Assocaton 83: 46 431. Tjur, T. (009) Coeffcents of determnaton n logstc regresson models A new proposal: The coeffcent of dscrmnaton. The Amercan Statstcan 63: 366-37. Tsats, A. A. (1980) A note on a goodness-of-ft test for the logstc regresson model. Bometrka, 67: 50 51. Whte H. (198) Maxmum lkelhood estmaton of msspecfed models. Econometrca 50:1 5. Xe, X.J., J. Pendergast and W. Clarke (008) Increasng the power: A practcal approach to goodness-of-ft test for logstc regresson models wth contnuous predctors. Computatonal Statstcs & Data Analyss 5: 703 713. 1
CONTACT INFORMATION Your comments and questons are valued and encouraged. Contact the author at: Name: Paul D. Allson Organzaton: Unversty of Pennsylvana and Statstcal Horzons LLC Address: 3718 Locust Walk Cty, State ZIP: Phladelpha, PA 19104-699 Work Phone: 15-898-6717 Emal: allson@statstcalhorzons.com Web: www.pauldallson.com SAS and all other SAS Insttute Inc. product or servce names are regstered trademarks or trademarks of SAS Insttute Inc. n the USA and other countres. ndcates USA regstraton. 13