Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES Valdato ad Performace Aalyss of Bary Logstc Regresso Model SOHEL RANA, HABSHAH MIDI, AND S. K. SARKAR 3 [,,3] Laboratory of Aled ad Comutatoal Statstcs, Isttute for Mathematcal Research, Uversty Putra Malaysa, 43400 Serdag, Selagor, MALAYSIA E-mal: sraa_stat@yahoo.com, habshahmd@gmal.com, 3 saroeu@yahoo.com Abstract: Alcato of logstc regresso modelg techques wthout subsequet erformace aalyss regardg redctve ablty of the ftted model ca result oorly fttg results that accurately redct outcomes o ew subects. Model valdato s ossbly the most mortat ste the model buldg sequece. Model valdty refers to the stablty ad reasoableess of the logstc regresso coeffcets, the lausblty ad usablty of the ftted logstc regresso fucto, ad the ablty to geeralze fereces draw from the aalyss. The am of ths study s to evaluate ad measure how effectvely the ftted logstc regresso model descrbes the outcome varable both the samle ad the oulato. A straghtforward ad farly oular slt-samle aroach has bee used here to valdate the model. Dfferet summary measures of goodess-of-ft ad other sulemetary dces of redctve ablty of the ftted model dcate that the ftted bary logstc regresso model ca be used to redct the ew subects. Keywords: Valdato, trag samle, devace, redcto error rate, ROC curve. Itroducto Over the last decade, bary logstc regresso model has become, may felds, the stadard method of data aalyss. A mortat roblem s whether results of the logstc regresso aalyss o the samle ca be exteded to the corresodg oulato. If ths haes, the we say that the model has a good ft ad we refer to ths questo as a model valdato aalyss [6]. Alcato of modelg techques wthout subsequet erformace aalyss of the obtaed models ca result oorly fttg results that accurately redct outcomes o ew subects. Model valdato s ossbly the most mortat ste the model buldg sequece. It s also oe of the most overlooked sectos. Model valdty refers to the stablty ad reasoableess of the logstc regresso coeffcets, the lausblty ad usablty of the ftted logstc regresso fucto, ad the ablty to geeralze fereces draw from the aalyss. Ofte the valdato of a model seems to cosst of othg more tha quotg the Cox ad Sell [4] R or Nagelkerke [9] adusted R statstc as well as Correct Classfcato Rate (CCR) from the ft whch measures the fracto of the total varablty the resose that s accouted for by the model. Ufortuately, a hgh R value ad hgh ercetage of CCR logstc regresso model do ot guaratee that the model fts the data well. Use of a model that does ot ft the data well caot rovde good aswer to the uderlyg redcto or scetfc questos uder vestgato. Hece valdato s a useful ad ecessary art of the model-buldg rocess [7]. There are may statstcal tools for model valdato bary logstc regresso, but the rmary tool for most rocess modelg alcatos s summary measures of goodess-of-ft aalyss. Dfferet tyes of summary measures of goodess-offt from a ftted model rovde formato o the adequacy of dfferet asects of the model. The logstc regresso wth bary data s the area whch grahcal resdual aalyss ca be dffcult to terret as a model valdato [3]. The most accredted methods for obtag a good teral valdato of a model erformace are dataslttg, reeated data-slttg, ackkfe techque ad bootstrag. I order to valdate the ftted model the study used the data-slttg techque. Ths s a straghtforward ad farly oular aroach whch the trag data s radomly slt to two arts; oe to develo the model, ad aother to measure ts erformace. The urose of ths study s to reset a comrehesve aroach to the teral valdato of ISSN: 790-55 5 ISBN: 978-960-474-70-0
Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES logstc regresso as a redctve model. Our focus s to measure the redctve erformace of a model,.e. ts ablty to accurately redct the outcome varable o ew subects. Thus the am of ths study s to assess the goodess-of-ft of a gve model, ad to determe whether the model ca be used to redct the outcome of a ew subect ot cluded the orgal or trag samle. Materals ad Methods The Bagladesh Demograhc ad Health Survey (BDHS-004) s a art of the worldwde Demograhc ad Health Surveys rogram ad a source of oulato ad health data for olcymakers ad the research commuty. I the survey a total of,440 elgble wome were furshed ther resoses. But ths aalyss there are oly, elgble wome who have two lvg chldre ad able to bear ad desre more chldre are cosdered durg the erod of global two chldre camag. The varable age of the resodet, fertlty referece, lace of resdece, hghest year of educato, workg status ad exected umber of chldre are cosdered the aalyss. The varable fertlty referece volvg resoses corresodg to the questo, would you lke to have (a/aother) chld? The resoses are coded 0 for o more ad for have aother s cosdered the bary resose varable (Y) the aalyss. The age of the resodet ( ), lace of resdece ( ) s coded 0 for urba ad for rural, hghest year of educato ( 3 ), workg status of resodet ( 4 ) s coded 0 for ot workg ad for workg ad exected umber of chldre ( 5 ) s coded 0 for two or less ad for more tha two are cosdered as covarates the bary logstc regresso model. Data slttg aroach has bee used to valdate the ftted model. Sce the samle sze s large eough, the data are slt to two sets. The study selected 349 (60%) observatos radomly as a trag samle ad the rest 863 (40%) observatos as a valdato samle [6], because the valdato data set wll eed to be smaller tha the model-buldg or trag data set. Frstly, we use the trag samle to ft the model. The we take the ftted model as t s, aly t to the valdato samle, ad evaluate the model s erformace by dfferet summary measures of goodess-of-ft. 3 Fttg of the model for Trag Samle Cosder a collecto of exlaatory varables be deoted by the vector '=(, ) ad the codtoal robablty that the outcome s reset be deoted by P(Y= ) =π. The the logt of havg Y= s modeled as a lear fucto of the exlaatory varables as l π = β + 0 β + β + + β ; 0 π π () where the fucto ex( β0+ β+ β + + β ) π = s + exβ + β + β + + β ( ) 0 kow as logstc fucto. Suose (y, y y ) be the deedet radom observatos corresodg to the radom varables (Y, Y Y ). Sce the Y s a Beroull radom varable, the robablty fucto of Y Y Y s f( Y) = π ( π ) ; Y = 0 or ; =,. As the Y s are assumed to be deedet, the lkelhood fucto s gve by Y ( ) ( ),, = π = π Y g Y Y Y ad the loglkelhood fucto L (β 0, β β ) =l (say) = = ( β0 β β β ) = Y + + + + { ex( β0 β β β ) } l + + + + + () Well kow Newto-Rahso teratve method ca be used to solve the equato () whch s kow as Iteratvely Reweghted Least Square (IRLS) algorthm. Table shows the coeffcets β s, ther stadard errors, the Wald ch-square statstc, assocated - values, ad odds rato ex (β). I order to determe the worth of the dvdual regressor logstc regresso, the Wald statstc defed as ˆ β W = []. Uder the ull hyothess [ S. E( ˆ β )] ( =,, 5) H0 : β = 0,, the statstc W s aroxmately dstrbuted as ch-square wth sgle degree of freedom. The Wald ch square statstcs from Table agree reasoably well wth the assumto that all the dvdual redctors have sgfcat cotrbuto to redct the resose varable. ISSN: 790-55 5 ISBN: 978-960-474-70-0
Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES Varable Coeffcet β Table Aalyss of maxmum lkelhood estmates Stadard error Wald chsquare statstcs df -value Odds Rato Ex(β) -0.053 0.0.534 0.000 0.949 0.45 0.46 9.55 0.00.57 3-0.085 0.08.690 0.000 0.99 4-0.449 0.67 7.76 0.007 0.638 5.453 0.58 4.058 0.000.68 Itercet 0.389 0.343.90 0.56.476 The lkelhood rato test s erformed to test the overall sgfcace of all coeffcets the model o the bass of test statstc G = [( l L0) ( l L) ] (3) where L 0 s the lkelhood of the ull model ad L s the lkelhood of the saturated model. Uder the ull hyothess, H 0 : β = β = = β5 = 0 the statstc G follows a ch-square dstrbuto wth 5 degrees of freedom ad measure how well the deedet varables affect the resose varable. I the study, G=403.733 wth < 0.00, whch dcate that as a whole the deedet varables have sgfcat cotrbuto to redct the resose varable. I order to fd the overall goodess-of-ft, Hosmer ad Lemeshow [5] ad Lemeshow ad Hosmer [0] roosed groug based o the values of the estmated robabltes. Usg ths groug strategy, the Hosmer-Lemeshow goodess-of-ft statstc uder usual otatos, Ĉ s as follows g ( o k k k) Cˆ π = (4) k= kπ k( π k) Hosmer ad Lemeshow [5] demostrated that uder the ull hyothess that the ftted logstc regresso model s the correct model, the dstrbuto of the statstc Ĉ s well aroxmated by the ch-square dstrbuto wth g- degrees of freedom. Ths test s more relable ad robust tha the tradtoal ch-square test []. The value of the Hosmer-Lemeshow goodess-of-ft statstc comuted from the frequeces s Ĉ =5.09 ad the corresodg -value comuted from the ch-square dstrbuto wth 8 degrees of freedom s 0.74. The large -value sgfes that there s o sgfcat dfferece betwee the observed ad the redcted values of the outcome. Ths dcates that the model seems to ft qute reasoable. The other sulemetary summary measures of goodess-of-ft lke Cox ad Sell R s 0.6, Nagelkerke adusted R s 0.35, redcted correct classfcato rate s 77.4% dcate that the model ft the data at a accetable level. Thus the ftted bary logstc resose fucto from the trag samle s ˆ π = [+ ex( 0.389+ 0.053 0.45 (5) + 0.085 3+ 0.449 4.453 5)] Suose that the valdato samle cossts of v observatos (y, x ), =, v, whch may be groued to J v covarate atters. If some subects have the same value of x, the J v < v. We deote the umber of subects wth x=x by m, =, J v. It follows that m = v. Let y deote the umber of ostve resoses amog the m subects wth covarate atter x=x for =, J v. For the valdato samle uder study, the umber of covarate atters J v =66. The logstc robablty for the th covarate atter s π, the value of the revously estmated logstc model obtaed equato (5) usg the covarate atter x, from the valdato samle. These quattes become the bass for the comutato of the summary measures of ft lke Hosmer-Lemeshow goodess-of-ft, redcto error rate, area uder Recever Oeratg Characterstc (ROC) curve. Each of these summary measures of goodess-of-ft s cosdered tur the followg. 3. Hosmer-Lemeshow Goodess-of-ft Test Hosmer-Lemeshow goodess-of-ft test may be used to obta the summary measure of test statstc for the valdato samle. Let deote aroxmately v /g or v /0 subects the th decle. Let O = y be the umber of ostve resoses amog the covarate atters fallg the th decle. The estmate of the exected value of O uder the assumto that the ftted model s correct s E = m π. Thus the Hosmer- Lemeshow test statstc s obtaed as the Pearso ch- ISSN: 790-55 53 ISBN: 978-960-474-70-0
Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES Table Hosmer-Lemeshow goodess-of-ft ch-square statstc Decle () Mea redcted Total observato Observed ostve Exected ostve χ -value Prob. ( ) resose (O ) resose (E ).077734 63 7 4.89594.378498 63 8.68454 3.0333 63 6.8093 4.3775 6 4.36649 5.34397 63 0.66644 5.57 0.85 6.537998 6 35 33.359 7.84834 63 49 5.44379 8.750874 63 43 45.68 9.99760 6 5 5.65 0.98966 6 76 75.53588 Table 3 Predcted classfcato table based o Trag samle ad Valdato samle takg 0.5 as cutoff Trag Samle Valdato Samle Exected (Y) Exected (Y) Observed (Y) 0 Total Observed (Y) 0 Total No more (0) 785 66 85 No more(0) 307 48 Have aother () 39 59 498 Have aother () 58 50 08 square statstc comuted from the observed ad exected frequeces as g ( O E ) Cv = (8) π π whereπ ( ) m πˆ / = =. The subscrt v has bee added to C to emhasze that the statstc has bee calculated from a valdato samle. Uder the hyothess that the model s correct, ad the assumto that each E s suffcetly large for each term C v to be dstrbuted as χ (), t follows that C v s dstrbuted as χ (0). Results reseted Table dcate that the model seems to ft qute well. 3. Valdato of Predcto Error Rate The classfcato table may the be used to comute statstc such as redcto error rate, area uder the ROC curve, ostve ad egatve redctve ower. The relablty of the redcto error rate observed the trag data set s examed by alyg the chose redcto rule to a valdato data set. If the ew redcto error rate s about the same as that for the trag data set, the the latter gves a relable dcato of the redctve ablty of the ftted bary logstc regresso model ad the chose redcto rule. If the ew data lead to a cosderably hgher redcto error rate, the the ftted bary logstc regresso ad the chose redcto rule do ot redct ew observatos as well as orgally dcated [8]. I the curret study, the ftted logstc resose fucto based o the trag samle gve (5) was used to calculate the estmated robabltes for the 66 cases of valdato data set. The chose redcto rule s aled to the estmated robabltes as redct f ˆ π 0.5 ad redct 0 f ˆ π < 0. 5. The ercet redcto error rate for the valdato samle gve Table 3 s 6.9 whle the rate for the trag samle was.6. Thus the total redcto error rate for the valdato samle s ot cosderably hgher tha the trag samle ad we may coclude that t s a relable dcator of the redctve caablty of the ftted logstc regresso model. The area uder the ROC curve s aother summary measure of the model s redctve ower. I the reset study the area uder the ROC curve for the trag samle was 0.80 whle the area for the valdato samle s 0.7. The area uder ROC curve ISSN: 790-55 54 ISBN: 978-960-474-70-0
Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES for the valdato samle s smaller tha the trag samle ad t may be cosdered that the redctve ablty of the ftted logstc resose fucto for the ew subect s accetable. 4 Dscusso ad Cocluso Model valdato s doe to ascerta whether redcted values from the model are lkely to accurately redct resoses o future subects. Iteral valdato volves fttg ad valdatg the model by carefully slttg oe seres of subects to trag set ad valdatg set. The study evaluated the model erformace o the valdatg data set based o the model develoed the trag set. Comrehesve aroaches to the valdato of the redctve logstc regresso model have bee troduced the study. Dfferet summary measures of goodess-of-ft ad dces have bee used to calbrate the model. The summary measures lke Hosmer-Lemeshow goodess-of-ft test suggest that the ftted logstc regresso model has sgfcat redctve ablty for future subects. Predcto error rate for valdato of the model s ot so hgh. The area uder the ROC curve for the trag samle was 0.80 ad t was decreased by 0.08 to 0.7 for the valdato samle whch dcates that the redctve ablty of the ftted model s good. Thus dfferet summary measures of goodess-of-ft ad others sulemetary dces of redctve ablty of the ftted model dcate that the ftted bary logstc regresso model ca be used to redct the future subects. Refereces [] A. Agrest, Categorcal data aalyss, Wley IterScece, New York, 00. [] A. Wald, Test of statstcal hyotheses cocerg several arameters whe the umber of observatos s large, Trasactos of the Amerca Mathematcal Socety, Vol.54, 943,. 46-48. [3] B. Efro ad R. J. Tbshra, A Itroducto to the Bootstra, Chama ad Hall/CRC, 983. [4] D. R. Cox ad E. J. Sell, The Aalyss of Bary Data, d edto, Chama ad Hall, Lodo, 989. [5] D. W. Hosmer ad S. Lemeshow, A goodess-offt test for the multle logstc regresso models, Commucatos Statstcs, Vol.A0, 980,. 043-069. [6] F. E. Harrell, K. L. Lee ad D. B. Mark, Tutoral Bostatstcs: Multvarable rogostc models: Issues develog models, evaluatg assumtos ad measurg ad reducg errors, Statstcs Medce, Vol.5, 996,. 36-387. [7] J. Shao, Lear Model Selecto by Cross- Valdato, Joural of the Amerca Statstcal Assocato, Vol.80, No.4, 993,. 486-494. [8] M. H. Kuter C. J. Nachtshem, J. Neter ad W. L, Aled Lear Statstcal Models, Ffth Edto, McGraw-Hll, Irw, 005. [9] N. J. D. Nagelkerke, A ote o the geeral defto of the coeffcet of determato. Bometrka, Vol.78, 99,. 69-69. [0] S. Lemeshow ad D. W. Hosmer, The use of goodess-of-ft statstcs the develomet of logstc regresso models, Amerca Joural of Edemology, Vol.5, 98,. 9-06. ISSN: 790-55 55 ISBN: 978-960-474-70-0