Validation and Performance Analysis of Binary Logistic Regression Model



Similar documents
ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

The simple linear Regression Model

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Simple Linear Regression

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

Average Price Ratios

Approximation Algorithms for Scheduling with Rejection on Two Unrelated Parallel Machines

MDM 4U PRACTICE EXAMINATION

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki

Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software

Credibility Premium Calculation in Motor Third-Party Liability Insurance

Regression Analysis. 1. Introduction

Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

APPENDIX III THE ENVELOPE PROPERTY

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev

6.7 Network analysis Introduction. References - Network analysis. Topological analysis

On formula to compute primes and the n th prime

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information

OPTIMAL KNOWLEDGE FLOW ON THE INTERNET

Green Master based on MapReduce Cluster

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

A Study of Unrelated Parallel-Machine Scheduling with Deteriorating Maintenance Activities to Minimize the Total Completion Time

n. We know that the sum of squares of p independent standard normal variables has a chi square distribution with p degrees of freedom.

Preparation of Calibration Curves

On Error Detection with Block Codes

Business Bankruptcy Prediction Based on Survival Analysis Approach

ROULETTE-TOURNAMENT SELECTION FOR SHRIMP DIET FORMULATION PROBLEM

CALCULATION OF THE VARIANCE IN SURVEYS OF THE ECONOMIC CLIMATE

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

Load and Resistance Factor Design (LRFD)

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

1. The Time Value of Money

Key players and activities across the ERP life cycle: A temporal perspective

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

CHAPTER 13. Simple Linear Regression LEARNING OBJECTIVES. USING Sunflowers Apparel

Chapter Eight. f : R R

Conversion of Non-Linear Strength Envelopes into Generalized Hoek-Brown Envelopes

Report 52 Fixed Maturity EUR Industrial Bond Funds

Measuring the Quality of Credit Scoring Models

Proceedings of the 2010 Winter Simulation Conference B. Johansson, S. Jain, J. Montoya-Torres, J. Hugan, and E. Yücesan, eds.

A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS

Curve Fitting and Solution of Equation

An IG-RS-SVM classifier for analyzing reviews of E-commerce product

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity

CSSE463: Image Recognition Day 27

Taylor & Francis, Ltd. is collaborating with JSTOR to digitize, preserve and extend access to The Journal of Experimental Education.

Forecasting Trend and Stock Price with Adaptive Extended Kalman Filter Data Fusion

IP Network Topology Link Prediction Based on Improved Local Information Similarity Algorithm

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

International Journal of Business and Social Science Vol. 2 No. 21 [Special Issue November 2011]

ON SLANT HELICES AND GENERAL HELICES IN EUCLIDEAN n -SPACE. Yusuf YAYLI 1, Evren ZIPLAR 2. yayli@science.ankara.edu.tr. evrenziplar@yahoo.

ENTROPİ OPTİMİZASYON ÖLÇÜSÜ İLE OPTİMAL PORTFÖY SEÇİMİ VE BİST ULUSAL-30 ENDEKSİ ÜZERİNE BİR ÇALIŞMA

Relaxation Methods for Iterative Solution to Linear Systems of Equations

Compressive Sensing over Strongly Connected Digraph and Its Application in Traffic Monitoring

T = 1/freq, T = 2/freq, T = i/freq, T = n (number of cash flows = freq n) are :

Optimal replacement and overhaul decisions with imperfect maintenance and warranty contracts

Optimal Packetization Interval for VoIP Applications Over IEEE Networks

Projection model for Computer Network Security Evaluation with interval-valued intuitionistic fuzzy information. Qingxiang Li

VIDEO REPLICA PLACEMENT STRATEGY FOR STORAGE CLOUD-BASED CDN

Web Services Wind Tunnel: On Performance Testing Large-scale Stateful Web Services

Geometric Motion Planning and Formation Optimization for a Fleet of Nonholonomic Wheeled Mobile Robots

Study on prediction of network security situation based on fuzzy neutral network

DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT

The Digital Signature Scheme MQQ-SIG

CHAPTER 2. Time Value of Money 6-1

Performance Attribution. Methodology Overview

Speeding up k-means Clustering by Bootstrap Averaging

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0

Software Reliability Index Reasonable Allocation Based on UML

A Parallel Transmission Remote Backup System

Integrating Production Scheduling and Maintenance: Practical Implications

Numerical Methods with MS Excel

FINANCIAL MATHEMATICS 12 MARCH 2014

A particle Swarm Optimization-based Framework for Agile Software Effort Estimation

Questions? Ask Prof. Herz, General Classification of adsorption

Common p-belief: The General Case

A NON-PARAMETRIC COPULA ANALYSIS ON ESTIMATING RETURN DISTRIBUTION FOR PORTFOLIO MANAGEMENT: AN APPLICATION WITH THE US AND BRAZILIAN STOCK MARKETS 1

ANALYTICAL MODEL FOR TCP FILE TRANSFERS OVER UMTS. Janne Peisa Ericsson Research Jorvas, Finland. Michael Meyer Ericsson Research, Germany

Near Neighbor Distribution in Sets of Fractal Nature

Transcription:

Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES Valdato ad Performace Aalyss of Bary Logstc Regresso Model SOHEL RANA, HABSHAH MIDI, AND S. K. SARKAR 3 [,,3] Laboratory of Aled ad Comutatoal Statstcs, Isttute for Mathematcal Research, Uversty Putra Malaysa, 43400 Serdag, Selagor, MALAYSIA E-mal: sraa_stat@yahoo.com, habshahmd@gmal.com, 3 saroeu@yahoo.com Abstract: Alcato of logstc regresso modelg techques wthout subsequet erformace aalyss regardg redctve ablty of the ftted model ca result oorly fttg results that accurately redct outcomes o ew subects. Model valdato s ossbly the most mortat ste the model buldg sequece. Model valdty refers to the stablty ad reasoableess of the logstc regresso coeffcets, the lausblty ad usablty of the ftted logstc regresso fucto, ad the ablty to geeralze fereces draw from the aalyss. The am of ths study s to evaluate ad measure how effectvely the ftted logstc regresso model descrbes the outcome varable both the samle ad the oulato. A straghtforward ad farly oular slt-samle aroach has bee used here to valdate the model. Dfferet summary measures of goodess-of-ft ad other sulemetary dces of redctve ablty of the ftted model dcate that the ftted bary logstc regresso model ca be used to redct the ew subects. Keywords: Valdato, trag samle, devace, redcto error rate, ROC curve. Itroducto Over the last decade, bary logstc regresso model has become, may felds, the stadard method of data aalyss. A mortat roblem s whether results of the logstc regresso aalyss o the samle ca be exteded to the corresodg oulato. If ths haes, the we say that the model has a good ft ad we refer to ths questo as a model valdato aalyss [6]. Alcato of modelg techques wthout subsequet erformace aalyss of the obtaed models ca result oorly fttg results that accurately redct outcomes o ew subects. Model valdato s ossbly the most mortat ste the model buldg sequece. It s also oe of the most overlooked sectos. Model valdty refers to the stablty ad reasoableess of the logstc regresso coeffcets, the lausblty ad usablty of the ftted logstc regresso fucto, ad the ablty to geeralze fereces draw from the aalyss. Ofte the valdato of a model seems to cosst of othg more tha quotg the Cox ad Sell [4] R or Nagelkerke [9] adusted R statstc as well as Correct Classfcato Rate (CCR) from the ft whch measures the fracto of the total varablty the resose that s accouted for by the model. Ufortuately, a hgh R value ad hgh ercetage of CCR logstc regresso model do ot guaratee that the model fts the data well. Use of a model that does ot ft the data well caot rovde good aswer to the uderlyg redcto or scetfc questos uder vestgato. Hece valdato s a useful ad ecessary art of the model-buldg rocess [7]. There are may statstcal tools for model valdato bary logstc regresso, but the rmary tool for most rocess modelg alcatos s summary measures of goodess-of-ft aalyss. Dfferet tyes of summary measures of goodess-offt from a ftted model rovde formato o the adequacy of dfferet asects of the model. The logstc regresso wth bary data s the area whch grahcal resdual aalyss ca be dffcult to terret as a model valdato [3]. The most accredted methods for obtag a good teral valdato of a model erformace are dataslttg, reeated data-slttg, ackkfe techque ad bootstrag. I order to valdate the ftted model the study used the data-slttg techque. Ths s a straghtforward ad farly oular aroach whch the trag data s radomly slt to two arts; oe to develo the model, ad aother to measure ts erformace. The urose of ths study s to reset a comrehesve aroach to the teral valdato of ISSN: 790-55 5 ISBN: 978-960-474-70-0

Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES logstc regresso as a redctve model. Our focus s to measure the redctve erformace of a model,.e. ts ablty to accurately redct the outcome varable o ew subects. Thus the am of ths study s to assess the goodess-of-ft of a gve model, ad to determe whether the model ca be used to redct the outcome of a ew subect ot cluded the orgal or trag samle. Materals ad Methods The Bagladesh Demograhc ad Health Survey (BDHS-004) s a art of the worldwde Demograhc ad Health Surveys rogram ad a source of oulato ad health data for olcymakers ad the research commuty. I the survey a total of,440 elgble wome were furshed ther resoses. But ths aalyss there are oly, elgble wome who have two lvg chldre ad able to bear ad desre more chldre are cosdered durg the erod of global two chldre camag. The varable age of the resodet, fertlty referece, lace of resdece, hghest year of educato, workg status ad exected umber of chldre are cosdered the aalyss. The varable fertlty referece volvg resoses corresodg to the questo, would you lke to have (a/aother) chld? The resoses are coded 0 for o more ad for have aother s cosdered the bary resose varable (Y) the aalyss. The age of the resodet ( ), lace of resdece ( ) s coded 0 for urba ad for rural, hghest year of educato ( 3 ), workg status of resodet ( 4 ) s coded 0 for ot workg ad for workg ad exected umber of chldre ( 5 ) s coded 0 for two or less ad for more tha two are cosdered as covarates the bary logstc regresso model. Data slttg aroach has bee used to valdate the ftted model. Sce the samle sze s large eough, the data are slt to two sets. The study selected 349 (60%) observatos radomly as a trag samle ad the rest 863 (40%) observatos as a valdato samle [6], because the valdato data set wll eed to be smaller tha the model-buldg or trag data set. Frstly, we use the trag samle to ft the model. The we take the ftted model as t s, aly t to the valdato samle, ad evaluate the model s erformace by dfferet summary measures of goodess-of-ft. 3 Fttg of the model for Trag Samle Cosder a collecto of exlaatory varables be deoted by the vector '=(, ) ad the codtoal robablty that the outcome s reset be deoted by P(Y= ) =π. The the logt of havg Y= s modeled as a lear fucto of the exlaatory varables as l π = β + 0 β + β + + β ; 0 π π () where the fucto ex( β0+ β+ β + + β ) π = s + exβ + β + β + + β ( ) 0 kow as logstc fucto. Suose (y, y y ) be the deedet radom observatos corresodg to the radom varables (Y, Y Y ). Sce the Y s a Beroull radom varable, the robablty fucto of Y Y Y s f( Y) = π ( π ) ; Y = 0 or ; =,. As the Y s are assumed to be deedet, the lkelhood fucto s gve by Y ( ) ( ),, = π = π Y g Y Y Y ad the loglkelhood fucto L (β 0, β β ) =l (say) = = ( β0 β β β ) = Y + + + + { ex( β0 β β β ) } l + + + + + () Well kow Newto-Rahso teratve method ca be used to solve the equato () whch s kow as Iteratvely Reweghted Least Square (IRLS) algorthm. Table shows the coeffcets β s, ther stadard errors, the Wald ch-square statstc, assocated - values, ad odds rato ex (β). I order to determe the worth of the dvdual regressor logstc regresso, the Wald statstc defed as ˆ β W = []. Uder the ull hyothess [ S. E( ˆ β )] ( =,, 5) H0 : β = 0,, the statstc W s aroxmately dstrbuted as ch-square wth sgle degree of freedom. The Wald ch square statstcs from Table agree reasoably well wth the assumto that all the dvdual redctors have sgfcat cotrbuto to redct the resose varable. ISSN: 790-55 5 ISBN: 978-960-474-70-0

Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES Varable Coeffcet β Table Aalyss of maxmum lkelhood estmates Stadard error Wald chsquare statstcs df -value Odds Rato Ex(β) -0.053 0.0.534 0.000 0.949 0.45 0.46 9.55 0.00.57 3-0.085 0.08.690 0.000 0.99 4-0.449 0.67 7.76 0.007 0.638 5.453 0.58 4.058 0.000.68 Itercet 0.389 0.343.90 0.56.476 The lkelhood rato test s erformed to test the overall sgfcace of all coeffcets the model o the bass of test statstc G = [( l L0) ( l L) ] (3) where L 0 s the lkelhood of the ull model ad L s the lkelhood of the saturated model. Uder the ull hyothess, H 0 : β = β = = β5 = 0 the statstc G follows a ch-square dstrbuto wth 5 degrees of freedom ad measure how well the deedet varables affect the resose varable. I the study, G=403.733 wth < 0.00, whch dcate that as a whole the deedet varables have sgfcat cotrbuto to redct the resose varable. I order to fd the overall goodess-of-ft, Hosmer ad Lemeshow [5] ad Lemeshow ad Hosmer [0] roosed groug based o the values of the estmated robabltes. Usg ths groug strategy, the Hosmer-Lemeshow goodess-of-ft statstc uder usual otatos, Ĉ s as follows g ( o k k k) Cˆ π = (4) k= kπ k( π k) Hosmer ad Lemeshow [5] demostrated that uder the ull hyothess that the ftted logstc regresso model s the correct model, the dstrbuto of the statstc Ĉ s well aroxmated by the ch-square dstrbuto wth g- degrees of freedom. Ths test s more relable ad robust tha the tradtoal ch-square test []. The value of the Hosmer-Lemeshow goodess-of-ft statstc comuted from the frequeces s Ĉ =5.09 ad the corresodg -value comuted from the ch-square dstrbuto wth 8 degrees of freedom s 0.74. The large -value sgfes that there s o sgfcat dfferece betwee the observed ad the redcted values of the outcome. Ths dcates that the model seems to ft qute reasoable. The other sulemetary summary measures of goodess-of-ft lke Cox ad Sell R s 0.6, Nagelkerke adusted R s 0.35, redcted correct classfcato rate s 77.4% dcate that the model ft the data at a accetable level. Thus the ftted bary logstc resose fucto from the trag samle s ˆ π = [+ ex( 0.389+ 0.053 0.45 (5) + 0.085 3+ 0.449 4.453 5)] Suose that the valdato samle cossts of v observatos (y, x ), =, v, whch may be groued to J v covarate atters. If some subects have the same value of x, the J v < v. We deote the umber of subects wth x=x by m, =, J v. It follows that m = v. Let y deote the umber of ostve resoses amog the m subects wth covarate atter x=x for =, J v. For the valdato samle uder study, the umber of covarate atters J v =66. The logstc robablty for the th covarate atter s π, the value of the revously estmated logstc model obtaed equato (5) usg the covarate atter x, from the valdato samle. These quattes become the bass for the comutato of the summary measures of ft lke Hosmer-Lemeshow goodess-of-ft, redcto error rate, area uder Recever Oeratg Characterstc (ROC) curve. Each of these summary measures of goodess-of-ft s cosdered tur the followg. 3. Hosmer-Lemeshow Goodess-of-ft Test Hosmer-Lemeshow goodess-of-ft test may be used to obta the summary measure of test statstc for the valdato samle. Let deote aroxmately v /g or v /0 subects the th decle. Let O = y be the umber of ostve resoses amog the covarate atters fallg the th decle. The estmate of the exected value of O uder the assumto that the ftted model s correct s E = m π. Thus the Hosmer- Lemeshow test statstc s obtaed as the Pearso ch- ISSN: 790-55 53 ISBN: 978-960-474-70-0

Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES Table Hosmer-Lemeshow goodess-of-ft ch-square statstc Decle () Mea redcted Total observato Observed ostve Exected ostve χ -value Prob. ( ) resose (O ) resose (E ).077734 63 7 4.89594.378498 63 8.68454 3.0333 63 6.8093 4.3775 6 4.36649 5.34397 63 0.66644 5.57 0.85 6.537998 6 35 33.359 7.84834 63 49 5.44379 8.750874 63 43 45.68 9.99760 6 5 5.65 0.98966 6 76 75.53588 Table 3 Predcted classfcato table based o Trag samle ad Valdato samle takg 0.5 as cutoff Trag Samle Valdato Samle Exected (Y) Exected (Y) Observed (Y) 0 Total Observed (Y) 0 Total No more (0) 785 66 85 No more(0) 307 48 Have aother () 39 59 498 Have aother () 58 50 08 square statstc comuted from the observed ad exected frequeces as g ( O E ) Cv = (8) π π whereπ ( ) m πˆ / = =. The subscrt v has bee added to C to emhasze that the statstc has bee calculated from a valdato samle. Uder the hyothess that the model s correct, ad the assumto that each E s suffcetly large for each term C v to be dstrbuted as χ (), t follows that C v s dstrbuted as χ (0). Results reseted Table dcate that the model seems to ft qute well. 3. Valdato of Predcto Error Rate The classfcato table may the be used to comute statstc such as redcto error rate, area uder the ROC curve, ostve ad egatve redctve ower. The relablty of the redcto error rate observed the trag data set s examed by alyg the chose redcto rule to a valdato data set. If the ew redcto error rate s about the same as that for the trag data set, the the latter gves a relable dcato of the redctve ablty of the ftted bary logstc regresso model ad the chose redcto rule. If the ew data lead to a cosderably hgher redcto error rate, the the ftted bary logstc regresso ad the chose redcto rule do ot redct ew observatos as well as orgally dcated [8]. I the curret study, the ftted logstc resose fucto based o the trag samle gve (5) was used to calculate the estmated robabltes for the 66 cases of valdato data set. The chose redcto rule s aled to the estmated robabltes as redct f ˆ π 0.5 ad redct 0 f ˆ π < 0. 5. The ercet redcto error rate for the valdato samle gve Table 3 s 6.9 whle the rate for the trag samle was.6. Thus the total redcto error rate for the valdato samle s ot cosderably hgher tha the trag samle ad we may coclude that t s a relable dcator of the redctve caablty of the ftted logstc regresso model. The area uder the ROC curve s aother summary measure of the model s redctve ower. I the reset study the area uder the ROC curve for the trag samle was 0.80 whle the area for the valdato samle s 0.7. The area uder ROC curve ISSN: 790-55 54 ISBN: 978-960-474-70-0

Proceedgs of the WSEAS Iteratoal Coferece o ENVIRONMENT, MEDICINE ad HEALTH SCIENCES for the valdato samle s smaller tha the trag samle ad t may be cosdered that the redctve ablty of the ftted logstc resose fucto for the ew subect s accetable. 4 Dscusso ad Cocluso Model valdato s doe to ascerta whether redcted values from the model are lkely to accurately redct resoses o future subects. Iteral valdato volves fttg ad valdatg the model by carefully slttg oe seres of subects to trag set ad valdatg set. The study evaluated the model erformace o the valdatg data set based o the model develoed the trag set. Comrehesve aroaches to the valdato of the redctve logstc regresso model have bee troduced the study. Dfferet summary measures of goodess-of-ft ad dces have bee used to calbrate the model. The summary measures lke Hosmer-Lemeshow goodess-of-ft test suggest that the ftted logstc regresso model has sgfcat redctve ablty for future subects. Predcto error rate for valdato of the model s ot so hgh. The area uder the ROC curve for the trag samle was 0.80 ad t was decreased by 0.08 to 0.7 for the valdato samle whch dcates that the redctve ablty of the ftted model s good. Thus dfferet summary measures of goodess-of-ft ad others sulemetary dces of redctve ablty of the ftted model dcate that the ftted bary logstc regresso model ca be used to redct the future subects. Refereces [] A. Agrest, Categorcal data aalyss, Wley IterScece, New York, 00. [] A. Wald, Test of statstcal hyotheses cocerg several arameters whe the umber of observatos s large, Trasactos of the Amerca Mathematcal Socety, Vol.54, 943,. 46-48. [3] B. Efro ad R. J. Tbshra, A Itroducto to the Bootstra, Chama ad Hall/CRC, 983. [4] D. R. Cox ad E. J. Sell, The Aalyss of Bary Data, d edto, Chama ad Hall, Lodo, 989. [5] D. W. Hosmer ad S. Lemeshow, A goodess-offt test for the multle logstc regresso models, Commucatos Statstcs, Vol.A0, 980,. 043-069. [6] F. E. Harrell, K. L. Lee ad D. B. Mark, Tutoral Bostatstcs: Multvarable rogostc models: Issues develog models, evaluatg assumtos ad measurg ad reducg errors, Statstcs Medce, Vol.5, 996,. 36-387. [7] J. Shao, Lear Model Selecto by Cross- Valdato, Joural of the Amerca Statstcal Assocato, Vol.80, No.4, 993,. 486-494. [8] M. H. Kuter C. J. Nachtshem, J. Neter ad W. L, Aled Lear Statstcal Models, Ffth Edto, McGraw-Hll, Irw, 005. [9] N. J. D. Nagelkerke, A ote o the geeral defto of the coeffcet of determato. Bometrka, Vol.78, 99,. 69-69. [0] S. Lemeshow ad D. W. Hosmer, The use of goodess-of-ft statstcs the develomet of logstc regresso models, Amerca Joural of Edemology, Vol.5, 98,. 9-06. ISSN: 790-55 55 ISBN: 978-960-474-70-0