. Itroducto Regresso aalyss s a statstcal methodology that utlzes the relato betwee two or more quattatve varables so that oe varable ca be predcted from the other, or others. Ths methodology s wdely used busess, the socal ad behavoral sceces, the bologcal sceces cludg agrculture ad fshery research. For example, fsh weght at harvest ca be predcted by utlzg the relatoshp betwee fsh weghts ad other growth affectg factors lke water temperature, dssolved oxyge, free carbo doxde etc. There are other stuatos fshery where relatoshp amog varables ca be exploted through regresso aalyss. Regresso aalyss serves three major purposes: () descrpto () cotrol ad (3) predcto. We frequet use equatos to summarze or descrbe a set of data. Regresso aalyss s helpful developg such equatos. For example we may collect a cosderable amout of fsh growth data ad data o a umber of botc ad abotc factors, ad a regresso model would probably be a much more coveet ad useful summary of those data tha a table or eve a graph. Besdes predcto, regresso models may be used for cotrol purposes. A cause ad effect relatoshp may ot be ecessary f the equato s to be used oly for predcto. I ths case t s oly ecessary that the relatoshps that exsted the orgal data used to buld the regresso equato are stll vald. A fuctoal relato betwee two varables s expressed by a mathematcal formula. If X deotes the depedet varable ad Y the depedet varable, a fuctoal relato s of the form Y = f(x) Gve a partcular value of X, the fucto f dcates the correspodg value of Y. A statstcal relato, ulke a fucto s ot a perfect oe. I geeral, the observatos for a statstcal relato do ot fall drectly o the curve of relatoshp. Depedg o the ature of the relatoshps betwee X ad Y, regresso approach may be classfed to two broad categores vz., lear regresso models ad olear regresso models. The respose varable s geerally related to other causal varables through some parameters. The models that are lear these parameters are kow as lear models, whereas olear models parameters are appear olearly. Lear models are geerally satsfactory approxmatos for most regresso applcatos. There are occasos, however, whe a emprcally dcated or a theoretcally justfed olear model s more approprate. I the preset lecture we shall cosder fttg of lear models oly.. Lear Regresso Models We cosder a basc lear model where there s oly oe predctor varable ad the regresso fucto s lear. Model wth more tha oe predctor varable s straght forward. The model ca be stated as follows:
() Y = β + β X + ε 0 Where Y s the value of the respose varable the th tral β 0 ad β are parameters, X s a kow costat, amely, the value of the predctor varable the th tral, ε s a radom error term wth mea zero ad varace σ ad ε ad ε j are ucorrelated so that ther covarace s zero. Regresso model () s sad to be smple, lear the parameters, ad lear the predctor varable. It s smple that there s oly oe predctor varable, lear the parameters because o parameters appears as a expoet or ts multpled or dvded by aother parameter, ad lear predctor varable because ths varable appears oly the frst power. A model that s lear the parameters ad the predctor varable s also called frst order model.. Meag of Regresso Parameters The parameters β 0 ad β regresso model () are called regresso coeffcets, β s the slope of the regresso le. It dcates the chage the mea of the probablty dstrbuto of Y per ut crease X. The parameter β 0 Y tercept of the regresso le. Whe the scope of the model cludes X = 0, β 0 gves the mea of the probablty dstrbuto of Y at X = 0. Whe the scope of the model does ot cover X = 0, β 0 does ot have ay partcular meag as a separate term the regresso model.. Method of Least Squares To fd good estmates of the regresso parameters β 0 ad β, we employ the method of least squares. For each observatos (X, Y ) for each case, the method of least squares cosders the devato of Y from ts expected value, Y β0 βx. I partcular, the method of least squares requres that we cosder the sum of the squared devatos. Ths crtero s deoted by Q: Q = ( Y β0+ β X) () Accordg to the method of least squares, the estmators of β 0 ad β are those values b 0 ad b, respectvely, that mmze the crtero Q for the gve observatos. Usg the aalytcal approach, t ca be show for regresso model () that the values of b 0 ad b that mmzes Q for ay partcular set of sample data are gve by the followg smultaeous equatos: Y = b0 + b X XY = b0 X + b X. These two equatos are called ormal equatos ad ca be solved for b 0 ad b : 66
b ( X X )( Y Y ) = ( X X ) b0 = ( Y b X) = Y bx, where X ad Y are the meas of the X ad the Y observatos, respectvely..3 Propertes of Ftted Regresso Le Oce the parameters estmates are obtaed, the ftted le would be Yˆ = b0 + b X (3) The th resdual s the dfferece betwee the observed value Y ad the correspodg ftted value Yˆ,.e., e = Y Yˆ. The estmated regresso le (3) ftted by the method of least squares has a umber of propertes worth otg.. The sum of the resduals s zero, e = 0.. Sum of the squared resduals, e s a mmum. 3. Sum of the observed values Y equals the sum of the ftted values Yˆ, Y = Yˆ. 4. Sum of the weghted resduals s zero, weghted by the level of the predctor varable the th tral: X e = 0. 5. Sum of the weghted resduals s zero, weghted by the ftted value of the respose varable the th tral: Yˆ e = 0. 6. The regresso le always goes through the pots ( X, Y )..4 Estmato of Error Term Varace σ The varace σ of the error terms ε regresso model () eeds to be estmated to obta a dcato of the varablty of the probablty dstrbuto of Y. I addto, a varety of fereces cocerg the regresso fucto ad the predcto of Y requre a estmate of σ. Deote by SSE = ( Y Yˆ ) = e, s the error sum of squares or resdual sum of squares. The a estmate of σ s gve by, 67
SSE σˆ =, p (4) where p s the total umber of parameters volved the model. We also deote ths quatty by MSE..5 Ifereces Lear Models Frequetly, we are terested drawg fereces about β, the slope of the regresso le. At tmes, tests cocerg β are of terest, partcularly oe of the form: H 0 = β = 0 H = β 0 The reaso for terest testg whether or ot β = 0s that, whe β = 0, there s o lear assocato betwee Y ad X. For ormal error regresso model, the codto β = 0 mples eve more tha o lear assocato betwee Y ad X. β = 0 for the ormal error regresso model mples ot oly that there s o lear assocato betwee Y ad X but also that there s o relato of ay kd betwee Y ad X, sce the probablty dstrbuto of Y are the detcal at all levels of X. A explct test of the alteratves s based o the test statstc: b t =, s( b ) where s b ) s the stadard error of b ad calculated as s b ) = ( ( ( X X ) The decso rule wth ths test statstc whe cotrollg level of sgfcace at α s f t t( α / ; p), coclude H 0, f t > t( α / ; p), coclude H. Smlarly testg for other parameters ca be carred out..6 Predcto of New Observatos MSE The ew observato o Y to be predcted s vewed as the result of a ew tral, depedet of the trals o whch the regresso aalyss s based. We deote the level of X for the ew tral as X h ad the ew observato o Y as Y h. Of course, we assume that the uderlyg regresso model applcable for the basc sample data cotues to be approprate for the ew observato. The dstcto betwee estmato of the mea respose, ad predcto of a ew respose, s basc. I the former case, we estmate the mea of the dstrbuto of Y. I the preset case, we predct a dvdual outcome draw from the dstrbuto of Y. Of course, the great majorty of dvdual outcomes devate from the mea respose, ad ths must be take to accout by the procedure for predctg Y h(ew). We deote by Yˆ h, the predcted ew observato ad by σ ( Yˆ h ) the varace of Yˆ h. A ubased estmator of σ ( Yˆ h ) s gve by. 68
σ ˆ ( Yˆ h ) = ˆ σ + s ( Yˆ h ), where s ( Yˆ h ) s the estmate of varace of predcto at X h gve by s ( Yˆ ( X h ) = σ + h X ) ˆ (. ( X X ) (5) Cofdece terval of Yˆ h ca be costructed by usg t-statstc amely, ˆ Y h ± t( α / ; p) σ ( ˆ ). Y h ad.7 Measure of Fttg, R There are tmes whe the degree of lear assocato s of terest ts rght. Here we descrbe oe descrptve measure that s frequetly used practce to descrbe the degree of lear assocato betwee Y ad X. Deote by SSTO = ( Y Y ), total sum of squares whch measures the varato the observato Y, or the ucertaty predctg Y, whe o accout of the predctor varable X s take. Thus SSTO s a measure of ucertaty predctg Y whe X s ot cosdered. Smlarly, SSE measures the varato the Y whe a regresso model utlzg the predctor varable X s employed. A atural measure of the effect of X reducg the varato Y,.e., reducg the ucertatty predctg Y, s to express the reducto varato (SSTO-SSE=SSR) as a proporto of the total varato: SSR SSE R = = SSTO SSTO (6) The measure R s called coeffcet of determato, 0 R. I practce R s ot lkely to be 0 or but somewhere betwee these lmts. The closer t s to, the greater s sad to be the degree of lear assocato betwee X ad Y..8 Dagostcs ad Remedal Measures Whe a regresso model s cosdered for a applcato, we ca usually ot be certa advace that the model s approprate for that applcato, ay oe, or several, of the features of the model, such as learty of the regresso fucto or ormalty of the error terms, may ot be approprate for the partcular data at had. Hece, t s mportat to exame the aptess of the model for the data before fereces based o that model are udertake. I ths secto we dscuss some smple graphc methods for studyg the approprateess of a model, as well as some remedal measures that ca be helpful whe the data are ot accordace wth the codtos of the regresso model..8. Departures From Model to be Studed We shall cosder followg sx mportat types of departures from lear regresso model wth ormal errors: () The learty of regresso fucto. 69
() () (v) (v) (v) (v) The costacy of error varace. The depedecy of error terms. Presece of oe or a few outler observatos. The ormal dstrbuto of error terms. Oe or several mportat predctor varables have bee omtted from the model. Presece of multcollearty..8. Graphcal Tests for Model Departures Nolearty of Regresso Model Whether a lear regresso fucto s approprate for the data beg aalyzed ca be studed from a resdual plot agast the predctor varable or equvaletly from a resdual plot agast the ftted values. Fgure (a) shows a prototype stuato of the resdual plot agast X whe a lear regresso model s approprate. The resduals the fall wth a horzotal bad cetred aroud 0, dsplayg o systematc tedeces to be postve ad egatve. Fgure (b) shows a prototype stuato of a departure from the lear regresso model that dcates the eed for a curvlear regresso fucto. Here the resduals ted to vary a systematc fasho betwee beg postve ad egatve. Fg. (a) Fg. (b) Fg. (c) Fg. (d) 70
Nocostacy of Error Varace Plots of resduals agast the predctor varable or agast the ftted values are ot oly helpful to study whether a lear regresso fucto s approprate but also to exame whether the varace of the error terms s costat The prototype plot Fgure (a) exemplfes resdual plots whe error term varace s costat. Fgure (c) shows a prototype pcture of resdual plot whe the error varace creases wth X. I may bologcal scece applcatos, departures from costacy of the error varace ted to be of the meghaphoe type. Presece of Outlers Outlers are extreme observatos. Resdual outlers ca be detfed from resdual plots agast X or Yˆ. Nodepedece of Error Terms Wheever data are obtaed a tme sequece or some other type of sequece, such as for adjacet geographcal areas, t s good dea to prepare a sequece plot of the resduals. The purpose of plottg the resduals agast tme or some other type of sequece s to see f there s ay correlato betwee error terms that are ear each other the sequece. A prototype resdual plot showg a tme related tred effect s preseted Fgure (d), whch portrays a lear tme related tred effect. Whe the error terms are depedet, we expect the resduals a sequece plot to fluctuate a more or less radom patter aroud the base le 0. Noormalty of Error Terms Small departures from ormalty do ot create ay serous problems. Major departures, o the other had, should be of cocer. The ormalty of the error terms ca be studed formally by examg the resduals a varety of graphc ways. Comparso of frequeces: whe the umber of cases s reasoably large s to compare actual frequeces of the resduals agast expected frequeces uder ormalty. For example, oe ca determe whether, say, about 90% of the resduals fall betwee ±.645 MSE. Normal probablty plot: Stll aother possblty s to prepare a ormal probablty plot of the resduals. Here each resdual s plotted agast ts expected value uder ormalty. A plot that s early lear suggests agreemet wth ormalty, whereas a plot that departs substatally from learty suggests that the error dstrbuto s ot ormal. Omsso of Importat Predctor Varables Resduals should also be plotted agast varables omtted from the model that mght have mportat effects o the respose. The purpose of ths addtoal aalyss s to determe whether there are ay key varables that could provde mportat addtoal descrptve ad predctve power to the model. The resduals are plotted agast the addtoal predctor 7
varable to see whether or ot the resduals ted to vary systematcally wth the level of the addtoal predctor varable..8.3 Statstcal Tests for Model departures Graphcal aalyss of resduals s heretly subjectve. Nevertheless, subjectve aalyss of a varety of terrelated resduals plots wll frequetly reveal dffcultes wth the model more clearly tha partcular formal tests. Tests for Radomess A ru test s frequetly used to test for lack of radomess the resduals arraged tme order. Aother test, specally desged for lack of radomess least squares resduals, s the Durb-Watso test: The Durb-Watso test assumes the frst order autoregressve error models. The test cossts of determg whether or ot the autocorrelato coeffcet ( ρ, say) s zero. The usual test alteratves cosdered are: H 0 : ρ = H : ρ 0 > 0 0 The Durb-Watso test statstc D s obtaed by usg ordary least squares to ft the regresso fucto, calculatg the ordary resduals: et = Yt Yˆ t, ad the calculatg the statstc: D = t= ( e t e t= e ) t t (7) Exact crtcal values are dffcult to obta, but Durb-Watso have obtaed lower ad upper boud d L ad d U such that a value of D outsde these bouds leads to a defte decso. The decso rule for testg betwee the alteratves s: f D > d U, coclude H0 f D <dl, coclude H f d D, test s coclusve. L d U Small value of D lead to the cocluso that ρ >0. Tests for Normalty Correlato Test for Normalty: I addto to vsually assessg the approprate learty of the pots plotted a ormal probablty plot, a formal test for ormalty of the error terms ca be coducted by calculatg the coeffcet of correlato betwee resduals e ad 7
ther expected values uder ormalty. A hgh value of the correlato coeffcet s dcatve of ormalty. Kolmogorov-Smrov test : The Kolmogorov-Smrov test s used to decde f a sample comes from a populato wth a specfc dstrbuto. The Kolmogorov-Smrov (K-S) test s based o the emprcal dstrbuto fucto (ECDF). Gve N ordered data pots Y, Y,..., Y N, the ECDF s defed as EN = ( )/ N, where () s the umber of pots less tha Y ad the Y are ordered from smallest to largest value. Ths s a step fucto that creases by /N at the value of each ordered data pot. The graph below s a plot of the emprcal dstrbuto fucto wth a ormal cumulatve dstrbuto fucto for 00 ormal radom umbers. The K-S test s based o the maxmum dstace betwee these two curves. A attractve feature of ths test s that the dstrbuto of the K-S test statstc tself does ot deped o the uderlyg cumulatve dstrbuto fucto beg tested. Aother advatage s that t s a exact test (the ch-square goodess-of-ft test depeds o a adequate sample sze for the approxmatos to be vald). Despte these advatages, the K-S test has several mportat drawbacks:. It oly apples to cotuous dstrbutos.. It teds to be more sestve ear the ceter of the dstrbuto tha at the tals. 3. Perhaps the most serous lmtato s that the dstrbuto must be fully specfed. That s, f locato, scale, ad shape parameters are estmated from the data, the crtcal rego of the K-S test s o loger vald. It typcally must be determed by smulato. Due to lmtatos ad 3 above, may aalysts prefer to use the Aderso-Darlg goodess-of-ft test. The Kolmogorov-Smrov test s defed by: 73
H 0 : The data follow a specfed dstrbuto H: The data do ot follow the specfed dstrbuto The Kolmogorov-Smrov test statstc s defed as D= max ( FY ( ), FY ( )) (9) N N N where F s the theoretcal cumulatve dstrbuto of the dstrbuto beg tested whch must be a cotuous dstrbuto (.e., o dscrete dstrbutos such as the bomal or Posso), ad t must be fully specfed (.e., the locato, scale, ad shape parameters caot be estmated from the data). The hypothess regardg the dstrbutoal form s rejected f the test statstc, D, s greater tha the crtcal value obtaed from a table. There are several varatos of these tables the lterature that use somewhat dfferet scalgs for the K-S test statstc ad crtcal regos. These alteratve formulatos should be equvalet, but t s ecessary to esure that the test statstc s calculated a way that s cosstet wth how the crtcal values were tabulated. Aderso-Darlg Test: The Aderso-Darlg test s used to test f a sample of data came from a populato wth a specfc dstrbuto. It s a modfcato of the Kolmogorov- Smrov (K-S) test ad gves more weght to the tals tha does the K-S test. The K-S test s dstrbuto free the sese that the crtcal values do ot deped o the specfc dstrbuto beg tested. The Aderso-Darlg test makes use of the specfc dstrbuto calculatg crtcal values. Ths has the advatage of allowg a more sestve test ad the dsadvatage that crtcal values must be calculated for each dstrbuto. Curretly, tables of crtcal values are avalable for the ormal, logormal, expoetal, Webull, extreme value type I, ad logstc dstrbutos. The Aderso-Darlg test s defed as: H 0 : The data follow a specfed dstrbuto. H: The data do ot follow the specfed dstrbuto The Aderso-Darlg test statstc s defed as A = N S, N ( ) where, S= [l FY ( ) + l( FY ( N+ )] N (0) F s the cumulatve dstrbuto fucto of the specfed dstrbuto. Note that the Y are the ordered data. The crtcal values for the Aderso-Darlg test are depedet o the specfc dstrbuto that s beg tested. Tabulated values ad formulas are avalable lterature for a few specfc dstrbutos (ormal, logormal, expoetal, Webull, logstc, extreme value type ). The test s a oe-sded test ad the hypothess that the dstrbuto s of a specfc form s rejected f the test statstc, A, s greater tha the crtcal value. 74
Tests for Costacy of Error Varace Modfed Levee Test : The test s based o the varablty of the resduals. Let e deotes the th resdual for group ad e deotes the th resdual for group. Also we deote ad to deote the sample szes of the two groups, where: + =. Further, we shall use ~e ad ~e to deote the medas of the resduals the two groups. The modfed Levee test uses the absolute devatos of the resduals aroud ther meda, to be deoted by d ad d : d = e ~e d = e ~e, Wth ths otato, the two-sample t test statstc becomes: () * t L = s d d + Where dad d are the sample meas of the d ad d, respectvely, ad the pooled varace s s: ( d ) + ( ) d d d s =. If the error terms have costat varace ad ad * are ot too small, t L follows approxmately the t dstrbuto wth - degrees of freedom. Large absolute values of t dcate that the error terms do ot have costat varace. * L Whte Test I statstcs, the Whte test s a statstcal test that establshes whether the resdual varace of a varable a regresso model s costat: that s for homoscedastcty. Ths test, ad a estmator for heteroscedastcty-cosstet stadard errors, were proposed by Halbert Whte 980. These methods have become extremely wdely used, makg ths paper oe of the most cted artcles ecoomcs. To test for costat varace oe udertakes a auxlary regresso aalyss: ths regresses the squared resduals from the orgal regresso model oto a ew set of regressors, whch the cotas the orgal regressors, the cross-products of the regressors ad the squared regressors. Oe the spects the R. The LM test statstc s the product of the R value ad sample sze: LM =. R (3) Ths follows a ch-square dstrbuto, wth degrees of freedom equal to the umber of estmated parameters ( the auxlary regresso) mus oe. Tests for Outlyg Observatos () Elemets of Hat Matrx : The Hat matrx s defed as H X X = ( X) X, X s the matrx for explaatory varables. The larger values reflect data pots are outlers. 75
() WSSD : WSSD s a mportat statstc to locate pots that are remote x-space. WSSD measures the weghted sum of squared dstace of the th pot from the ceter of the data. Geerally f the WSSD values progress smoothly from small to large, there are probably o extremely remote pots. However, f there s a sudde jump the magtude of WSSD, ths ofte dcates that oe or more extreme pots are preset. () Cook's D : Cook's D s desged to measure the shft ŷ whe th obsevato s ot used the estmato of parameters. D follows approxmately F( p, p ) (- α). Lower 0% pot of ths dstrbuto s take as a reasoable cut off (more coservatve users suggest the 50% pot). The cut off for 4 D ca be take as. ( ) (v) DFFITS : DFFIT s used to measure dfferece th compoet of ŷ ŷ( ). It s p + suggested that DFFITS may be used to flag off fluetal observatos. (v) DFBETAS j() : Cook's D reveals the mpact of th observato o the etre vector of the estmated regresso coeffcets. The fluetal observatos for dvdual regresso coeffcet are detfed by DFBETAS j ( ), j =,,..., p +, where each DFBETAS j( ) s the stadardzed chage b j whe the th observato s deleted. (v) COVRATIO :The mpact of the th observato o varace-covarace matrx of the estmated regresso coeffcets s measured by the rato of the determats of the two varace-covarace matrces. Thus, COVRATIO reflects the mpact of the th observato o the precso of the estmates of the regresso coeffcets. Values ear dcate that the th observato has lttle effect o the precso of the estmates. A value of COVRATIO greater tha dcates that the deleto of the th observato decreases the precso of the estmates; a rato less tha dcates that the deleto of the observato creases the precso of the estmates. Ifluetal 3( p + ) pots are dcated by COVRATIO >. (v) FVARATIO : The statstc detects chage varace of ŷ whe a observato s deleted. A value ear dcates that the th observato has eglgble effect o varace of y. A value greater tha dcates that deleto of the th observato decreases the precso of the estmates, a value less tha oe creases the precso of the estmates. Tests for Multcollearty The use ad terpretato of a multple regresso model depeds mplctly o the assumpto that the explaatory varables are ot strogly terrelated. I most regresso applcatos the explaatory varables are ot orthogoal. Usually the lack of orthogoalty s ot serous eough to affect the aalyss. However, some stuatos the explaatory 76
varables are so strogly terrelated that the regresso results are ambguous. Typcally, t s mpossble to estmate the uque effects of dvdual varables the regresso equato. The estmated values of the coeffcets are very sestve to slght chages the data ad to the addto or deleto of varables the equato. The regresso coeffcets have large samplg errors whch affect both ferece ad forecastg that s based o the regresso model. The codto of severe o-orthogoalty s also referred to as the problem of multcollearty. The presece of multcollearty has a umber of potetally serous effects o the least squares estmates of regresso coeffcets as metoed above. Some of the effects may be easly demostrated. Multcollearty also teds to produce least squares estmates b j that are too large absolute value. Detecto of Multcollearty j R = ad R = ( r ) deote smple correlato matrx ad ts verse. Let λ, =,,..., p ( λp λp... λ ) deote the ege values of R. The followg are commo dcators of relatoshps amog depedet varables. Let ( ) r j. Smple par-wse correlatos r j =. The squared multple correlato coeffcets R = > 0.9, where R deote the squared multple correlato coeffcets r for the regresso of x I o the remag x varables. 3. The varace flato factors, VIF = r > 0 ad 4. ege values, λ = 0. The frst of these dcators, the smple correlato coeffcets betwee pars of depedet varables r j, may detect a smple relatoshp betwee x ad x j. Thus r j = mples that the th ad j th varables are early proportoal. The secod set of dcators, R, the squared multple correlato coeffcet for the regresso of x o the remag x varables dcates the degree to whch x s explaed by a lear combato of all of the other put varables. The thrd set of dcators, the dagoal elemets of the verse matrx, whch have bee labeled as the Varace Iflato Factors, VIF. The term arses by otg that wth stadardzed data (mea zero ad ut sum of squares), the varace of the least squares estmate of the th coeffcet s proportoal to r, VIF > 0 s probably based o the smple relato betwee R ad VIF. That s 0 VIF > correspods to R > 0. 9. 77
.8.4 Overvew of Remedal Measures If the smple regresso model () s ot approprate for a data set, there are two basc choces:. Abado regresso model ad develop ad use a more approprate model.. Employ some trasformato o the data so that regresso model () s approprate for the trasformed data. Each approach has advatages ad dsadvatages. The frst approach may etal a more complex model that could yeld better sghts, but may also lead to more complex procedure for estmatg the parameters. Successful use of trasformatos, o the other had, lead to relatvely smple methods of estmato ad may volve fewer parameters tha a complex model, a advatage whe the sample sze s small. Yet trasformato may obscure the fudametal tercoectos betwee the varables, though at tmes they may llumate them. Nolearty of Regresso Fucto Whe the regresso fucto s ot lear, a drect approach s to modfy regresso model () by alterg the ature of the regresso fucto. For stace, a quadratc regresso fucto mght be used. Y = β 0 + β X + β X + ε or a expoetal regresso fucto: X Y = γ0 γ + ε. Whe the ature of the regresso fucto s ot kow, exploratory aalyss that does ot requre specfyg a partcular type of fucto s ofte useful. Nocostacy of Error Varace Whe the error varace s ot costat but vares a systematc fasho, a drect approach s to modfy the method to allow for ths ad use the method of weghted least squares to obta the estmates of the parameters. Trasformatos s aother way stablzg the varace. We frst cosder trasformato for learzg a olear regresso relato whe the dstrbuto of the error terms s reasoably close to a ormal dstrbuto ad the error terms have approxmately costat varace. I ths stuato, trasformato o X should be attempted. The reaso why trasformato o Y may ot be desrable here s that a trasformato o Y, such as Y = Y, may materally chage the shape of the dstrbuto ad may lead to substatally dfferg error term varace. Followg trasformatos are geerally appled for stablzg varace. () whe the error varace s rapdly creasg = log Y or Y = Y Y 0 78
() whe the error varace s slowly creasg, Y = Y or Y = Exp(Y ) (3) whe the error varace s decreasg, Y = / Y or Y = Exp( Y ). Box - Cox Trasformatos: It s dffcult to determe, whch trasformato of Y s most approprate for correctg skewess of the dstrbutos of error terms, uequal error varace, ad olearty of the regresso fucto. The Box-Cox trasformato automatcally detfes a trasformato from the famly of power trasformatos o Y. The famly of power trasformatos s of λ the form: Y = Y, where s a parameter to be determed from the data. Usg stadard computer programme t ca be determed easly. Nodepedece of Error Terms Whe the error terms are correlated, a drect approach s to work wth a model that calls for error terms. A smple remedal trasformato that s ofte helpful s to work wth frst dffereces. Noormalty of Error terms Lack of ormalty ad o-costat error varace frequetly go had had. Fortuately, t s ofte the case that the same trasformato that helps stablze the varace s also helpful approxmately ormalzg the error terms. It s therefore, desrable that the trasformato for stablzg the error varace be utlzed frst, ad the the resduals studed to see f serous departures from ormalty are stll preset. Omsso of Importat Varables Whe resdual aalyss dcates that a mportat predctor varable has bee omtted from the model, the soluto s to modfy the model. Outlyg Observatos Outlers ca create great dffculty. Whe we ecouter oe, our frst suspco s that the observato resulted from a mstake or other extraeous effect. O the other had, outlers may covey sgfcat formato, as whe a outler occurs because of a teracto wth aother predctor omtted from the model. A safe rule frequetly suggested s to dscard a outler oly f there s drect evdece that t represets error recordg, a mscalculato, a malfuctog of equpmet, or a smlar type of crcumstaces. Whe outlyg observatos are preset, use of the least squares ad maxmum lkelhood estmates for regresso model () may lead to serous dstortos the estmated regresso fucto. Whe the outlyg observatos do ot represet recordg errors ad should ot be dscarded, t may be desrable to use a estmato procedure that places less emphass o such outlyg observatos. Robust Regresso falls uder such methods. 79
Multcollearty () Collecto of addtoal data: Collectg addtoal data has bee suggested as oe of the methods of combatg multcollearty. The addtoal data should be collected a maer desged to break up the multcollearty the exstg data. () Model respecfcato: Multcollearty s ofte caused by the choce of model, such as whe two hghly correlated regressors are used the regresso equato. I these stuatos some respecfcato of the regresso equato may lesse the mpact of multcollearty. Oe approach to respecfcato s to redefe the regressors. For example, f x, x ad x 3 are early learly depedet t may be possble to fd some fucto such as x = (x +x )/x 3 or x = x x x 3 that preserves the formato cotet the orgal regressors but reduces the multcollearty. () Rdge Regresso: Whe method of least squares s used, parameter estmates are ubased. A umber of procedures have bee developed for obtag based estmators of regresso coeffcets to tackle the problem of multcollearty. Oe of these procedures s rdge regresso. The rdge estmators are foud by solvg a slghtly modfed verso of the ormal equatos. Each of the dagoal elemets of X X matrx are added a small quatty..0 example Table Case X X X3 Y.980 0.37 9.998 57.70 4.95.08 6.776 59.96 3 5.53 5.305.947 56.66 4 5.33 4.738 4.0 55.767 5 5.34 7.038.053 5.7 6 7.49 5.98-0.055 60.446 7 5.46.737 4.657 60.75 8.80 0.663 3.048 37.447 9 7.039 5.3 0.57 60.974 0 3.7.039 8.738 55.70 6.5.7.0 59.89 4.340 4.077 5.545 54.07 3.93.643 9.33 53.99 4 4.3 0.40.04 4.896 5 5..0 6.49 63.64 6 5.740 0.6 -.69 45.798 7 4.958 4.85 4. 58.699 8 4.5 3.53 8.453 50.086 9 6.39 9.698 -.74 48.890 0 6.45 3.9.45 6.3 3.535 7.65 3.85 45.65 4.99 4.474 5. 53.93 3 5.837 5.753.087 55.799 4 6.565 8.546 8.974 56.74 5 3.3 8.589 4.0 43.45 6 5.949 8.90-0.48 50.706 80
Table : Idcators of Ifluetal Observatos Case r t ti*=s.t/s h D WSSD 0.460 0.89 0.8 0.5 0.005 39*.53 0.73 0.74 0.093 0.03 3 0.377 0.5 0.0 0.048 0.00 4 0.044 0.05 0.06 0.04 0.000 5-0.56-0.46-0.4 0.053 0.000 3 6.00 0.6 0.60 0.55 0.07 0 7 0.389 0.6 0. 0.08 0.00 7 8 0.3 0.088 0.086 0.30 0.00 4 9 0.43 0.6 0.56 0.55 0.003 8 0 0.589 0.355 0.347 0.47 0.005 3-3.30 -.0 -.93 0.73 0.4 4-0.406-0.3-0.6 0.053 0.00 3 3 0.94 0.8 0.7 0.63 0.00 4 4-0.68-0.64-0.6 0.75 0.00 3 5 0.80 0.476 0.469 0. 0.007 5 6-0.48-0.95-0.89 0.77 0.005 6 7 3.756.34.343 0.04 0.048 0 8-6.07-3.589-5.436 0.4 0.4 8 9 -.98-0.77-0.79 0.60 0.05 4 0.6 0.666 0.658 0.4 0.04 0.449 0.66 0.59 0.9 0.003 0.79 0.453 0.444 0.055 0.003 3 3-0.060-0.035-0.03 0.059 0.000 3 4 0.574.8.88 0.97 4.409 9 5 0.68 0.63 0.58 0.59 0.00 9 6-0.606-0.356-0.350 0.0 0.004 Case Table 3: Idcators of Ifluetal Observatos Cov Rato Dffts Itercep t X X X3 DFBETAS.5 0.48 0.056-0.053-0.006 0.006.03 0.3 0.06-0.04-0.04-0.050 3.54 0.047-0.005 0.00-0.008-0.007 4.57 0.005 0.000 0.000-0.00 0.000 5.67-0.033-0.00-0.00-0.006 0.006 6.33 0.58-0.095 0.3-0.04-0.050 7.99 0.068-0.005 0.05-0.036-0.005 8.7 0.057 0.07-0.034 0.06-0.006 9.408 0.09-0.030 0.048-0.035-0.03 0.380 0.44 0.058-0.058-0.04 0.06 8
0.639 -.004-0.54-0.045 0.776 0.55.60-0.054-0.07 0.04 0.04 0.000 3.435 0.05 0.07-0.9-0.004 0.03 4.45-0.074-0.06 0.03-0.35 0.05 5.35 0.75-0.008 0.033-0.05 0.00 6.44-0.34-0.04 0.04-0.044 0.047 7 0.496 0.48 0.06-0.7-0.07-0.046 8 0.40 -.945 0.36-0.308-0.0 -.77 9.30-0.34 0.03-0.045-0.080 0.094 0.5 0.36-0.055 0.097-0.05-0.05.350 0.095 0.054-0.06 0.04-0.08.8 0.08 0.05-0.048-0.08-0.00 3.79-0.008 0.00-0.00 0.00 0.00 4.75 4.30-3.64 3.76 3.80 3.934 5.46 0.069 0.03-0.039 0.09-0.003 6.309-0.7 0.000-0.007-0.06 0.043 Table 4: Regresso Coeffcets ad Summary Statstcs Descrpto b0 b b b3 s R Max M Max VIF e.v. R All Data (=6) 8. 3.56 -.63 0.34.80 0.94.8 0.0 0.65 Delete (, 7, 8) 7.7 3.66 -.79 0.40 0.5 0.99.85 0.0 0.65 Delete (4) 30.9.39 -.4-0.36.78 0.94 30.64 0.07 0.97 Delete (, 7, 8, 4.7.79 -. -0.6 0.50 0.99 7.9 0.003 0.99 4) 0 Rdge k=0.05 4.8 3. -.73 0.5 0.66 0.99 0.0 0.053 0.90 (=) Delete X3 (=) 9.50 3.03 -.00 0.49 0.99.0 0.863 0.0 Some Selected Refereces Belsley, D.A., Kuh, E. ad Welsch, R.E. (004). Regresso dagostcs Idetfyg fluetal data ad sources of collearty, New York.: Wley Barett, V. ad Lews, T. (984). Outlers Statstcal Data, New York: Wley Ltd. Chatterjee, S. ad Prce, B (977). Regresso aalyss by example, New York: Joh Wley & sos Draper, N.R. ad Smth, H. (998). Appled Regresso aalyss, New York: Wley Easter Ltd. Klebaum, D.G. & Kupper, L.L. (978). Appled Regresso aalyss ad other multvarate methods, Massachusetts: Duxbury Press Motgomery, D.C., Peck, E. ad Vg, G. (003). Itroducto to lear regresso aalyss, 3rd Edto, New York: Joh Wley ad Sos Ic. 8