Useful and unuseful summaries of regression models

Transcription

1 Tutorial 5 b. Todeschii Useful ad uuseful summaries of regressio models oberto Todeschii Milao Chemometrics ad QSA esearch Group - Dept. of Evirometal Scieces, Uiversit of Milao-Bicocca, P.za della Scieza Milao (Ital) I the scietific papers, it is ver commo to preset the summar of a regressio model i the followig form: logp = V V = 15 r = s = F = 18.1 where logp is the studied experimetal respose ad V1 ad V are two idepedet variables for which a quatitative relatioship with the respose has bee searched for. The umbers i the equatio are called regressio coefficiets, beig the first oe the itercept of the regressio model. is the umber of samples used for buildig the regressio equatio, r is the multiple correlatio coefficiet, s is the residual stadard deviatio (or error stadard deviatio), F the calculated value of the F-ratio test. I several cases, r = is replaced b = or b = or b = %, beig the last two idices (coefficiet of determiatio) the squared r () value ad the last oe the correspodig percetage of the variace explaied b the regressio model. For evaluatig the qualit of the summar, mathematical defiitios of the quatities ivolved i the previous example are give. esidual Sum of Squares (SS) It is the sum of the squared differece betwee the experimetal respose ad the respose calculated b the regressio model: ( ˆ ) i i (1) SS = If SS is equal to zero the model is perfect, i.e. for all the samples, the calculated resposes coicide with the experimetal resposes. Obviousl, SS also depeds o the measure uit used for the respose. I practice, for the same model, if ou multipl the experimetal respose for 10, SS is 100 times greater, beig a squared quatit. Total Sum of Squares (TSS) pag. 1

2 Tutorial 5 b. Todeschii It is the total variace that a regressio model ca explai ad is used as a referece quatit to calculate stadardized qualit parameters. Also deoted as SSY, it is the sum of the squared differeces betwee the experimetal resposes ad the average experimetal respose: ( i ) TSS SSY = () TSS is assumed as a theoretical referece model where for each experimetal respose a costat value is calculated as the average experimetal respose. As for SS, also TSS depeds o the measure uit used for the respose. Derived regressio parameters for evaluatig the goodess of fit From the previous defiitios of SS ad TSS, the followig quatities are usuall defied: TSS = MSS + SS (3) where TSS ad SS are the quatities defied above ad MSS is the Model Sum of Squares. All these three quatities are sums of squares ad the are alwas positive quatities: TSS 0 MSS 0 SS 0 The coefficiet of determiatio ad the multiple correlatio coefficiet are defied as: ( ) ( ˆ i i ) MSS SS = = 1 = TSS TSS r ( i ) (4) or = 0 1 The Mea square error (or residual mea square) s ad the residual stadard deviatio (or residual stadard error) s are defied as: s SS SS = s = p ' p ' (5) where is the umber of samples, p' the umber of model parameters (ofte give b p variables plus the itercept). Other smbols that ca be commol used for the residual stadard deviatio are SD, SE, ad s. Fiall, the F-ratio test i regressio is defied as the ratio betwee the variace explaied b the model to the residual variace, both scaled b the correspodig degrees of freedom: / ( p ' 1) ( ) MSS F = SS / p ' It must be observed that, for the same umber of objects ad variables, if SS decreases (i.e. a better model is obtaied), the both ad F icrease mootoicall with SS, while the residual mea square s decreases mootoicall with SS. This meas that the best goodess of fit ca be equall evaluated b usig a of the three parameters (max( ), max(f) ad mi(s ). (6) pag.

3 Tutorial 5 b. Todeschii Moreover, it must be oted that oe of these parameters is related i a wa to the model predictio power: the are all ol related to the goodess of fit. About r, ad First of all, it must be observed that, due to the mathematical properties of, the goodess of fit of the followig ested models ( ) ( ) ( ) logp = f V1, V (model 1) logp = f V1, V, V 3 (model ) logp = f V1, V, V 3, V 4 (model 3) ( p ) logp = f V1, V, V 3, V 4,, V (model p) ca alwas raked as: (mod. p) (mod.3) (mod.) (mod.1) i.e. the value of a model A costituted b the same variables of aother model B plus a variable is alwas greater tha (or at least equal to) the value of the model B. This effect is due to the possible presece (ad usuall probable) of chace correlatio, i.e. eve a completel uuseful variable (or a radom variable) ca ehace the value, but ever decreases it. Therefore, whe the aim is to select the best model i a populatio of models or select the best variables withi a set of variables or compare differet models, or get models cotaiig ol relevat variables, the parameter caot be used (either r or, obviousl). I order to overcome these drawbacks, Adjusted ( ADJ) ca be used i place of, sice icreasig umber of uecessar variables gives a pealt to the. It is defied as: = ( ) p 1 ADJ 1 1 Note that Adjusted coicides with for model cotaiig just oe idepedet variable. However, remember that eve adjusted is ot related i a wa to the real model predictio abilit. (7) About the error stadard deviatio The error stadard deviatio s is a statistical estimate of the error also accoutig for the model degrees of freedom. This quatit is i the same uit of the respose ad is based o the fittig performace of the model, i.e. SS. A more direct measure of the average error of the respose estimates is the Stadard Deviatio Error i Calculatio (SDEC or SEC): SDEC SEC = SS (8) pag. 3

4 Tutorial 5 b. Todeschii where is the umber of samples. About the F-ratio test I the few last decades, the majorit of regressio model summaries iclude the F value. Aalogousl to adjusted, higher the F value better the model. This use of the F value i evaluatig a regressio model is correct, but some problems arise whe oe thik that the F value was origiall proposed as a statistical test, that is the calculated F eeds to be compared with a critical value at some probabilit level i order to draw decisios. I effect, the ull hpothesis H 0 of the F-ratio test i regressio states that all the regressio coefficiets are equal to zero (i.e. o regressio model is obtaied) agaist the alterative hpothesis H 1 that at least oe of the regressio coefficiets is differet from zero (i.e. a regressio model is obtaied). Therefore, i multivariate aalsis, where more tha oe idepedet variable is used, this test is ver weak ad, i practice, completel uuseful: i fact, we wat all the variables icluded i the model to be relevat (or statisticall sigificat) for the respose we are modellig. egressio parameters related to the predictio power (goodess of predictio) I statistics ad chemometrics several validatio techiques have bee proposed i the last 0 ears i order to estimate the model predictio capabilit. The model predictio capabilit is somethig differet from the model fitess capabilit, i.e. the abilit of the model to estimate the respose for objects that do ot participate to the calculated model. Here, the attetio is stressed o the most simple wa to obtai some measures of model abilit, i.e. the leave-oe-out cross-validatio techique. This techique is ofte too optimistic with respect to the "true" predictio power of the models; however, it ca be easil carried out ad allows model compariso as well as variable selectio. The measures of model predictive abilit are ver similar to those defied for the goodess of fit ad are based o the Predictive Error Sum of Squares (PESS) which substitutes the esidual Sum of Squares (SS): Predictive Error Sum of Squares (PESS) It is the sum of the squared differeces betwee the experimetal respose ad the respose predicted b the regressio model, i.e. for a object that was ot used for model estimatio. It is defied as: PESS = ( ˆ ) i i / i (9) where the otatio i/i idicates that the respose is predicted b a model estimated whe the i-th sample was left out from the traiig set. pag. 4

5 Tutorial 5 b. Todeschii Therefore, a equivalet parameter to ca be defied, usig i place of SS the quatit PESS. This parameter is called cross-validated ad the accepted smbols are CV or Q : ( ˆ i i / i ) PESS CV Q = 1 = 1 Q 1 TSS ( i ) For bad predictive models, Q ca assume eve egative values whe PESS is greater tha TSS, meaig that i predictio the model performs worse tha the o-model estimate, i.e. the mea respose of the traiig set. (10) Note. A discussio about differet Q fuctios was recetl published i the refereces give below, where a modified geeral Q fuctio is also proposed. 1) Cosoi, V., Ballabio, D. ad Todeschii,. (009) Commets o the defiitio of the Q parameter for QSA validatio. Joural of Chemical Iformatio ad Modelig, 49, ) Cosoi, V., Ballabio, D. ad Todeschii,. (010) Evaluatio of model predictive abilit b exteral validatio techiques. J. Chemometrics, 4, The predictive squared error (PSE) ad the stadard deviatio error i predictio (SDEP or SEP) o the modeled respose are give b: PESS PESS PSE = SDEP SEP = The characteristic behaviour of ad Q is show i the figure below. (11) As it ca be oted, values alwas icrease whe oe variable at-a-time is added to the previous model variables; o the other side, Q values icrease util useful variables are added to the pag. 5

6 Tutorial 5 b. Todeschii previous model variables; whe ois variables are added to the model, Q values decrease, i.e. the predictio power of the model is lowered. Therefore, the optimal complexit of the model the best model predictive power - ca be chose b simpl lookig at the maximum value of Q (obviousl, the same model rakig is obtaied lookig at the miimum values of PSE or SDEP). Examples of some questioable regressio summaries I order to better explai the differet meaigs of the parameters used for a summar of regressio models, some examples take from literature are give below. Example 1 Compariso / selectio of regressio models I a paper, five differet models have bee calculated for modellig bidig affiities of 49 compouds, usig five differet sets of variables, each of te differet descriptors. The summar of the five models is collected i the Table below. Here Q is the so-called "qualit idex" ad ot the cross-validated Q parameter (compare equatios 10 ad 1 ad example give below). 1. = 49 r = s = F = Q = = CV = = 49 r = 0.93 s = F = 5.56 Q = 3.06 = CV = = 49 r = s = 0.43 F = 19.3 Q =.160 = CV = = 49 r = s = F = 8.45 Q =.940 = CV = = 49 r = 0.94 s = 0.41 F = 4.40 Q =.86 = CV = First of all, it there is redudat iformatio due to the presece of both r ad values. Moreover, havig all the models the same umber of samples (49) ad model parameters (10+itercept), the degrees of freedom for the umerator ad deomiator of the F-ratio are 10 ad 38, respectivel, with a F critical value (at 5%, 1% ad 0.1% of cofidece) of.09,.81 ad 3.90, respectivel: usig the F test, all the models are more tha acceptable at a level of cofidece (see also the above paragraph About the F-ratio test). O the basis of (ad r), the models ca be raked (from the best to the worst) with the followig order: 5, 4,, 3, 1. The same order should be obtaied b rakig the models o the basis of s (lower the value, better the model) ad b usig the qualit idex Q (see example ). As for, also these two parameters deped o SS, TSS beig a costat, provided that the measure uit of the respose was ot chaged from oe model to aother. However, the model rakig is:, 4, 5, 3, 1 i both cases, thus idicatig the presece of some mistakes i reportig the results or i the calculatios. Disregardig the previous wrog rakigs, the uique reasoable model rakig is that obtaied b CV: 4,, 5, 3, 1, i.e. evaluatig the model predictive power. pag. 6

7 Tutorial 5 b. Todeschii The Authors claim that model 5 is better tha model 4 ad that it is the best oe because "Itroductio of the ew... [idex]... i QSA model for σ receptor icreased the predictive value of the model.": this setece is false. Example - The qualit idex Aother iterestig model, reported i literature, is the followig: logp = W = 16 r = s = Q = I 1994, a qualit idex Q for regressio was defied as: Q = s where ad s are the measures of goodess of fit defied i equatios (4) ad (5), respectivel. Several people have used this qualit idex without a criticism (see also example 1). However, it ca be easil demostrated that it is based o a ver doubtful statistics. I effect, Q ( p ') ( ) ( ) ( ') (1) MSS = = TSS MSS p ' p ' 1 p ' p ' 1 = = F = F s SS SS TSS p TSS TSS where F' is ( ) ( ) ( p ) ( ) / ' 1 / 1 F ' = MSS p ' 1 = MSS SS / p ' SS / p ' ( ) ( ) F ' Q = TSS Therefore, the Q idex is related to a modified F test, which takes ito accout ol the degrees of freedom of the model residual sum of squares. Moreover, Q also depeds o the total sum of squares TSS, which is costat for a give respose. However, if the measure uit of the respose is chaged (for example, multiplied b 10), the TSS value will icrease 100 times ad the qualit decreases cosequetl. O the cotrar, if the respose is divided b 10, the qualit icreases proportioall. I fact, if the respose is defied as multiplied b a scalig factor c, the total sum of squares is: ( i ) ( i ) (xx) TSS = c c = c Therefore, the qualit idex Q ca be also expressed i a simpler form as: Q 1 F ' Q = c TSS ( ') ( ') p p = = = SS / ( p ') ( 1 ) TSS ( 1 ) c ( i ) pag. 7

8 Tutorial 5 b. Todeschii thus, showig that, for objects ad p variables, Q icreases mootoicall (but ot liearl) with (as well as with the decreasig of s ), beig TSS costat. However, if a chage i the respose scale is performed, the qualit chages as 1/c. But, i the case of the model show above, aother problem arises: i fact, the Authors cofoud the qualit idex Q with the cross-validated Q (or Q or CV) parameter, the attributig to the qualit idex predictive properties which o the cotrar are completel abset: "This qualit factor Q is defied as the ratio of the correlatio coefficiet (r) to the stadard deviatio (S) i.e., Q = r/s. This factor accouts for the predictive power of the model.". As it ca be easil observed, oe of the parameters i the qualit idex defiitio is i some wa related to the predictio power of the model, but is (of course) related to. Example 3 - Negative r values Look at the model summar below: logk = Sz i = 16 r = s = F = I this model, r (the multiple correlatio coefficiet) is preseted with the mius sig, due to the iverse pairwise correlatio betwee the variables logk i (the respose) ad Sz (the descriptor). The reported r is wrog, beig the multiple correlatio coefficiet (see equatios 3 ad 4) alwas positive because it represets (or is related to) the explaied variace of the model. The correct expressio is r = , while the iformatio related to the existig iverse correlatio betwee logk i ad Sz is well highlighted b the sig of the regressio coefficiet. This is aother error which sometimes appears i usig r or due to the coufoudig of the multiple correlatio coefficiet (just r or ) with the bivariate correlatio r, which represets the correlatio betwee a pair of variables ad rages betwee 1 ad +1, defied as: xi i (, ) = xi r x i (13) Of course, for bivariate models, the absolute value of r(x,) coicides with. A suggested summar of a regressio model From the above cosideratios, it seems quite reasoable to characterize the models with a simple but effective summar, separatig the predictio ad the fitess qualities of the models. Moreover, i order to avoid cofusios, it is suggested to use capital i place of r, leavig to the latter the meaig of pairwise correlatio betwee two variables. Moreover, the square values of both ad Q give a explicit idea of the modeled variace. Oce the model has bee accepted, a better pag. 8

9 Tutorial 5 b. Todeschii iformatio about the average error (i fittig ad predictio) o the modeled respose is give b the parameters SDEC ad SDEP. Fiall, i order to avoid pit discussios with some referees, sometimes the F-ratio test ca be also icluded, givig the degrees of freedom ad the correspodig probabilit. logp = V V = 15 Q = 93.6% SDEP = 0.79 LOO = 97.65% SDEC = 0.81 Obviousl, other estimates of the predictio power performed o a exteral test set (reportig also the umber of objects i the test set) or usig leave-ma-out (together with the percetage of objects left out i each step) or boostrap or other validatio techiques are welcome ad should be also reported, together with other specific iformatio about the adopted validatio techique. ( ) ( ) = 15 Q 0% = 91.13% Q 30% = 90.3% Q = 90.56% LMO LMO BOOT = 5 Q = SDEP = 0.87 EXT EXT pag. 9