SIMPLE LINEAR REGRESSION

Transcription

1 SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION Documets prepared for use i course B0.305, New York Uiversity, Ster School of Busiess Fictitious eample, = 0. Page 3 This shows the arithmetic for fittig a simple liear regressio. Summary of simple regressio arithmetic page 4 This documet shows the formulas for simple liear regressio, icludig the calculatios for the aalysis of variace table. Aother eample of regressio arithmetic page 8 This eample illustrates the use of wolf tail legths to assess weights. Yes, these data are fictitious. A illustratio of residuals page 0 This eample shows a eperimet relatig the height of suds i a dishpa to the quatity of soap placed ito the water. This also shows how you ca get Miitab to list the residuals. The simple liear regressio model page This sectio shows the very importat liear regressio model. It s very helpful to uderstad the distictio betwee parameters ad estimates. Regressio oise terms page 4 What are those epsilos all about? What do they mea? Why do we eed to use them? More about oise i a regressio page 8 Radom oise obscures the eact relatioship betwee the depedet ad idepedet variables. Here are pictures showig the cosequeces of icreasig oise stadard deviatio. There is a techical discussio of the cosequeces of measuremet oise i a idepedet variable. This etire discussio is doe for simple regressio, but the ideas carry over i a complicated way to multiple regressio. Does regressio idicate causality? page 6 This shows a covicig relatioship betwee X ad Y. Do you thik that this should be iterpreted as cause ad effect? A iterpretatio for residuals page 8 The residuals i this eample have a very cocrete iterpretatio.

2 SIMPLE LINEAR REGRESSION Elasticity page 3 The ecoomic otio of elasticity is geerally obtaied from liear regressio. Here s how. Summary of regressio otios for oe predictor page 34 This is a quick oe-page summary as to what we are tryig to do with a simple regressio. The residual versus fitted plot page 35 Checkig the residual versus fitted plot is ow stadard practice i doig liear regressios. A eample of the residual versus fitted plot page 39 This shows that the methods eplored o pages ca be useful for real data problems. Ideed, the epadig residuals situatio is very commo. Trasformig the depedet variable page 44 Why does takig the log of the depedet variable cure the problem of epadig residuals? The math is esoteric, but these pages lay out the details for you. The correlatio coefficiet page 48 These pages provide the calculatio formulas for fidig the correlatio coefficiet. There is also a discussio of iterpretatio, alog with a detailed role of the correlatio coefficiet i makig ivestmet diversificatio decisios. O page 4 is a prelude to the discussio of the regressio effect (below). Covariace page 53 The covariace calculatio is part of the arithmetic used to obtai a correlatio. Covariaces are ot that easy to iterpret. The regressio effect page 55 The regressio effect is everywhere. What is it? Why does it happe? The correlatio coefficiet has the very importat role of determiig the rate of regressio back to average. Cover photo: Motauk lighthouse Revised 4 AUG 004 Gary Simo, 004

3 FICTITIOUS EXAMPLE, = 0 Cosider a set of 0 data poits: : Y: Begi by fidig Σ i = = 43 Σ = i = 7 Σ y i = = 0 Σ y = i =,094 Σ i y i = = 476 The fid = 43 0 = 4.3 ad y = 0 = Net S = Σ - ( Σ i ) i ( )( ) S y = Σ i y i - Σ Σy i i = = = = 37.4 S yy = Σ - ( Σ yi ) y i =, = 53.6 This leads to b = S y S = ad the b 0 = y - b = The regressio lie ca be reported as Y = If the spurious precisio aoys you, report the lie istead as Y = The quatity S yy was ot used here. It has may other uses i regressio calculatios, so it is worth the trouble to fid it early i the work. 3

4 SUMMARY OF SIMPLE REGRESSION ARITHMETIC Here are the calculatios eeded to do a simple regressio. Aside: The word simple here refers to the use of just oe to predict y. Problems i which two or more variables are used to predict y are called multiple regressio. The iput data are (, y ), (, y ),, (, y ). The outputs i which we are itereseted (so far) are the values of b (estimated regressio slope) ad b 0 (estimated regressio itercept). These will allow us to write the fitted regressio lie Y = b 0 + b. () Fid the five sums i, y i, i, y i, y i i. i= i= i= () Fid the five epressios, y, S = S yy = i= y i F HG i= y i I KJ i=, S y = y i= i i i= i F HG F HG i= i= i I KJ I KJ F H G i i= i=, y i I K J. (3) Give the slope estimate as b = S y S b 0 = y - b. ad the itercept estimate as (4) For later use, record S yy = S yy d Sy i. S Virtually all the calculatios for simple regressio are based o the five quatities foud i step (). The regressio fittig procedure is kow as least squares. It gets this ame ( [ ]) i 0 i because the resultig values of b 0 ad b miimize the epressio y b + b. This is a good criterio to optimize for may reasos, but uderstadig these reasos will force us to go ito the regressio model. i= 4

5 SUMMARY OF SIMPLE REGRESSION ARITHMETIC As a eample, cosider a data set with = 0 ad with It follows that Σ i = 00 Σ i = 4,50 Σ y i =,000 Σ y i = 06,50 Σ i y i = 0,750 = 00 0 = 0 y =, 000 = 00 0 S = 4, = 50 S y = 0,750-00,000 0 = 750 It follows et that b = S y S = = 3 ad b 0 = y - b = 00-3(0) = 40. The fitted regressio lie would be give as Y = We could ote also S yy = 06,50 - = 4,000.,000 0 = 6,50. The S yy = 6, We use S yy to get s ε, the estimate of the oise stadard deviatio. The relatioship is s ε = S yy, ad here that value is 4,000 0 =

6 SUMMARY OF SIMPLE REGRESSION ARITHMETIC I fact, we ca use these simple quatities to compute the regressio aalysis of variace table. The table is built o the idetity SS total = SS regressio + SS residual The quatity SS residual is ofte amed SS error. The subscripts are ofte abbreviated. Thus, you will see referece to SS tot, SS regr, SS resid, ad SS err. For the simple regressio case, these are computed as SS tot = S yy ( Sy SS regr = ) S SS resid = S yy ( S ) y S The aalysis of variace table for simple regressio is set up as follows: Source of Variatio Degrees of freedom Sum of Squares ( S ) y Regressio ( S ) y Residual - Total - S yy S yy S S Mea Squares S yy ( S ) y S ( S ) y S MS MS F Regressio Resid 6

7 SUMMARY OF SIMPLE REGRESSION ARITHMETIC For the data set used here, the aalysis of variace table would be Source of Variatio Degrees of freedom Sum of Squares Mea Squares Regressio,50, Residual 8 4, Total 9 6,50 F Just for the record, let s ote some other computatios commoly doe for regressio. The iformatio give et applies to regressios with K predictors. To see the forms for simple regressio, just use K = as eeded. The estimate for the oise stadard deviatio is the square root of the mea square i the residual lie. This is , as oted previously. The symbol s is frequetly used for this, as are s Y X ad s ε. The R statistic is the ratio SS SS regr tot, which is here, 50 6, 50 = The stadard deviatio of Y ca be give as SS tot, which is here 6,50 9 It is sometimes iterestig to compare s ε (the estimate for the oise stadard deviatio) to s Y (the stadard deviatio of Y). It ca be show that the ratio of these is sε = s K Y ( R ) s ε The quatity = ( R ) is called the adjusted R sy K statistic, R adj. 7

8 ANOTHER EXAMPLE OF REGRESSION ARITHMETIC The followig data are foud i the file X:\SOR\B0305\M\WOLVES.MTP: TLegth Weight These refer to the tail legths (i iches) ad the weights (i pouds) of 0 wolves. The idea is predict weight from tail legths. Here are some useful summaries: Descriptive Statistics Variable N Mea Media Tr Mea StDev SE Mea TLegth Weight Variable Mi Ma Q Q3 TLegth Weight Correlatios (Pearso) Correlatio of TLegth ad Weight = Here are the results of a regressio request: Regressio Aalysis: Weight versus TLegth The regressio equatio is Weight = TLegth Predictor Coef SE Coef T P Costat TLegth S =.36 R-Sq = 36.0% R-Sq(adj) = 8.0% Aalysis of Variace Source DF SS MS F P Regressio Residual Error Total Uusual Observatios Obs TLegth Weight Fit SE Fit Residual St Resid R R deotes a observatio with a large stadardized residual 8

9 ANOTHER EXAMPLE OF REGRESSION ARITHMETIC We must, of course, eamie scatterplots. Formally, the regressio activity is usig the model WEIGHT i = β 0 + β TLENGTH i + ε i, where i =,,, 0, where β 0 ad β are ukow parameters, ad where ε, ε,, ε 0 are statistical oise terms. It is assumed that the oise terms are idepedet with mea 0 ad ukow stadard deviatio σ. The fitted regressio equatio is that obtaied from the computer output. Namely, it s WEIGHT = TLENGTH. Here b 0 = 40 is the estimate of β 0, ad b = 3 is the estimate of β. (We sometimes replace the symbols b 0 ad b by ˆβ 0 ad β ˆ.) If you wish to check the computatioal formulas, use for the Tlegth variable ad use y for the Weight variable. The, it happes that Σ i = 00 Σ i = 4,50 Σ y i =,000 Σ y i = 06,50 Σ i y i = 0,750 It follows that = 00 0 = 0 y =, 000 = 00 0 S = 4, = 50 S y = 0,750-00, = 750 It follows the that b = S y S = = 3 ad b 0 = y - b = 00-3(0) = 40. 9

10 AN ILLUSTRATION OF RESIDUALS The data below give the suds height i millimeters as a fuctio of grams of soap used i a stadard dishpa. SOAP SUDS Let s fit the ordiary regressio model ad eamie the residuals. You ca arrage to have the residuals saved by doig Stat Regressio Regressio Storage [ Residuals OK ] Here is the regressio output: Regressio Aalysis The regressio equatio is SUDS = SOAP Predictor Coef StDev T P Costat SOAP S =.835 R-Sq = 98.0% R-Sq(adj) = 97.8% Aalysis of Variace Source Regressio DF SS MS F P Error Total The fitted model is SUDS = SOAP. Usig this fitted model we ca get the residuals as e i = SUDS i - [ SOAP i ] = [ Actual SUDS value for poit i ] - [ Retro-fit SUDS value for poit i ] For istace, for poit, this value is [ (3.5) ]. 0

11 AN ILLUSTRATION OF RESIDUALS Actually, our Storage request to Miitab did the arithmetic. The residuals were left i a ew colum i Miitab s Data widow uder the ame RESI. (Residuals from subsequet regressios would have differet ames, ad you also have the optio of editig the ame RESI.) Here are the actual values for this data set: SOAP SUDS RESI

12 THE SIMPLE LINEAR REGRESSION MODEL The data for a Y-o-X regressio problem come i the form (, Y ), (, Y ),., (, Y ). These may be coveietly laid out i a matri or spreadsheet: Case Y Y Y Y The word case might be replaced by poit or data poit or sequece umber or might eve be completely abset. The labels ad Y could be other ames, such as year or sales. I a data file i Miitab, the values for the s ad y s will be actual umbers, rather tha algebra symbols. I a Ecel spreadsheet, these could be either umbers or implicit values. If a computer program is asked for the regressio of Y o, the umeric calculatios will be doe. These calculatios have somethig to say about the regressio model, which we discuss ow. The most commo liear regressio model is this. The values,,..., are kow o-radom quatities which are measured without error. If i fact the values really are radom, the we assume that they are fied oce we have observed them. This is a verbal sleight of had; techically we say we are doig the aalysis coditioal o the s. The Y-values are idepedet of each other, ad they are related to the s through the model equatio Y i = β 0 + β i + ε i for i =,, 3,, The symbols β 0 ad β i the model equatio are oradom ukow parameters. The symbols ε,ε,, ε are called statistical oise or errors. The ε-values prevet us from seeig the eact liear relatioship betwee ad Y. These ε-values are uobserved radom quatities. They are assumed to be statistically idepedet of each other, ad they are assumed to have epected value zero. It is also assumed that (usig SD for stadard deviatio) SD(ε ) = SD(ε ) = = SE(ε ) = σ ε. The symbol σ ε is aother oradom ukow parameter.

13 THE SIMPLE LINEAR REGRESSION MODEL The calculatios that we will do for a regressio will make statemets about the model. Sy For eample, the estimated regressio slope b = is a estimate of the parameter β. S Here is a summary of a few regressio calculatios, alog with the statemets that they make about the model. Calculatio Sy b = S What it meas (β ˆ Estimate of regressio slope β used also) b 0 = y - b ( ˆβ 0 used also) Estimate of regressio itercept β 0 Residual mea square Estimate of σ Root mea square residual (stadard error Estimate of σ of regressio) Stadard error of a estimated coefficiet Estimate of the stadard deviatio of that coefficiet t (of a estimated coefficiet) Estimated coefficiet, divided by its stadard error 3

14 NOISE IN A REGRESSION The liear regressio model with oe predictor says that Y i = β 0 + β i + ε i for i =,,, The ε s represet oise terms. These are assumed to be draw from a populatio with mea 0 ad with stadard deviatio σ. Let s make the iitial observatio that if σ = 0, the all the ε s are zero ad we should see the lie eactly. Here is such a situatio: Y X

15 NOISE IN A REGRESSION Ideed, if you do the regressio computatios, you ll get to see the true lie eactly. Regressio Plot Y = X S = 0 R-Sq = 00.0 % R-Sq(adj) = 00.0 % Y X The equatio is revealed as Y =

16 NOISE IN A REGRESSION Now, what if there really were some oise? Suppose that σ = 0. The picture below shows what might happe. Regressio Plot Y = X S = R-Sq = 94.0 % R-Sq(adj) = 93. % 450 Y X The poits stray from the true lie. As a result, the fitted lie we get, here Y = , is somewhat differet from the true lie. 6

17 NOISE IN A REGRESSION What would happe if we had large, disturbig oise? Suppose that σ = 50. The picture below shows this problem: Regressio Plot Y = X S = 83.8 R-Sq =.8 % R-Sq(adj) = 0.7 % Y X You might otice the chage i the vertical scale! We did t do a very good job of fidig the correct lie. The poits i this picture are so scattered that it s ot eve clear that we have ay relatioship at all betwee Y ad. 7

18 øøøøøøøøøøø MORE ABOUT NOISE IN A REGRESSION ø øøøøøøøøøø There are may cotets i which regressio aalysis is used to estimate fied ad variable costs for complicated processes. The followig data set ivolves the quatities produced ad the costs for the productio of a livestock food mi for each of 0 days. The quatities produced were measured i the obvious way, ad the costs were calculated directly as labor costs + raw material costs + lightig + heatig + equipmet costs. The equipmet costs were computed by amortizig purchase costs over the useful lifetimes, ad the other costs are reasoably straightforward. I fact, the actual fied cost (per day) was $,500, ad the variable cost was $00/to. Thus the eact relatioship we see should be Cost = $, $ to Quatity. Here is a picture of this eact relatioship: 000 True cost Quatity (tos) It happes, however, that there is statistical oise i assessig cost, ad this oise has a stadard deviatio of $00. Schematically, we ca thik of our origial picture as beig spread out with vertical oise: ø 8

19 øøøøøøøøøøø MORE ABOUT NOISE IN A REGRESSION ø øøøøøøøøøø 000 C Quatity (tos) Here the are the data which we actually see: Quatity Cost Quatity Cost The quatities are i tos, ad the costs are i dollars. ø 9

20 øøøøøøøøøøø MORE ABOUT NOISE IN A REGRESSION ø øøøøøøøøøø Here is a scatterplot for the actual data: Costs i dollars to produce feed quatities i tos 0800 Cost(00) Quatity (There is a oise stadard deviatio of $00 i computig costs.) The footote shows that i the process of assessig costs, there is oise with a stadard deviatio of $00. I spite of this oise, the picture is fairly clea. The fitted regressio lie is Côst = $, $ to Quatity. The value of R is 9.7%, so we kow that this is a good regressio. We would assess the daily fied cost at $,088, ad we would assess the variable cost at $0/to. Please bear i mid that this discussio higes o kowig the eact fied ad variable costs ad kowig about the $00 oise stadard deviatio; i other words, this is a simulatio i which we really kow the facts. A aalyst who sees oly these data would ot kow the eact aswer. Of course, the aalyst would compute s ε = $83.74, so that Quatity True value Value estimated from data Fied cost $,500 b 0 = $,088 Variable cost $00/to b = $0/to Noise stadard deviatio $00 s ε = $83.74 All i all, this is ot bad. ø 0

21 øøøøøøøøøøø MORE ABOUT NOISE IN A REGRESSION ø øøøøøøøøøø As a etesio of this hypothetical eercise, we might ask how the data would behave with a $00 stadard deviatio associated with assessig costs. Here is that scatterplot: Cost i dollars to produce feed quatities i tos 000 Cost(00) Quatity (tos) (There is a oise stadard deviatio of $00 i computig costs.) 4 4 $ For this scatterplot, the fitted regressio equatio is Côst = $3, to Quatity. Also for this regressio we have R = 55.4%. Our estimates of fied ad variable costs are still statistically ubiased, but they are ifected with more oise. Thus, our fied cost $ estimate of $3,90 ad our variable cost estimate of 65 to are ot all that good. Of course, oe ca overcome the larger stadard deviatio i computig the cost by takig more data. For this problem, the aalyst would see s ε = $0.0. Quatity True value Value estimated from data Fied cost $,500 b 0 = $3,90 Variable cost $00/to b = $65/to Noise stadard deviatio $00 s ε = $0.0 This is ot early as good as the above, but this may be more typical. It is importat to ote that oise i assessig cost, the vertical variable, still gives us a statistically valid procedure. The ucertaity ca be overcome with a larger sample size. ø

22 øøøøøøøøøøø MORE ABOUT NOISE IN A REGRESSION ø øøøøøøøøøø We will ow make a distictio betwee oise i the vertical directio (oise i computig cost) ad oise i the horizotal directio (oise i measurig quatity). A more serious problem occurs whe the horizotal variable, here quatity produced, is ot measured eactly. It is certaily plausible that oe might make such measurig errors whe dealig with merchadise such as livestock feed. For these data, the set of 0 quatities has a stadard deviatio of.39 tos. This schematic illustrates the otio that our quatities, the horizotal variable, might ot be measured precisely: 000 True cost Quatity (tos) Here is a picture showig the hypothetical situatio i which costs eperieced a stadard deviatio of measuremet of $00 while the feed quatities had a stadard deviatio of measuremet of.5 tos. ø

23 øøøøøøøøøøø MORE ABOUT NOISE IN A REGRESSION ø øøøøøøøøøø Cost i dollars to produce feed quatities i tos 0800 Cost(00) Qtty(SD.5) (There is a oise stadard deviatio of $00 i computig costs ad quatities have bee measured with a SD of.5 tos.) 45 For this picture the relatioship is much less covicig. I fact, the fitted regressio $ equatio is Côst = $7, to Quatity. Also, this has s ε = $5.60. This has ot helped: Quatity True value Value estimated from data Fied cost $,500 b 0 = $7,5 Variable cost $00/to b = $74.0/to Noise stadard deviatio $00 s ε = $5.60 The value of R here is 34.0%, which suggests that the fit is ot good. Clearly, we would like both cost ad quatity to be assessed perfectly. However, oise i measurig costs leaves our procedure valid (ubiased) but with imprecisio that ca be overcome with large sample sizes oise i measurig quatities makes our procedure biased The data do ot geerally provide clues as to the situatio. ø 3

24 øøøøøøøøøøø MORE ABOUT NOISE IN A REGRESSION ø øøøøøøøøøø Here the is a summary of our situatio. Suppose that the relatioship is True cost = β 0 + β True quatity Suppose that we observe ad where β 0 is the fied cost ad β is the variable cost Y = True cost + ε where ε represets the oise i measurig or assessig the cost, with stadard deviatio σ ε = True quatity + ζ where ζ represets the oise i measurig or assessig the quatity, with stadard deviatio σ ζ Let us also suppose that the True quatities themselves are draw from a populatio with mea µ ad stadard deviatio σ. You will do least squares to fid the fitted lie Y = b 0 + b. σ It happes that b, the sample versio of the variable cost, estimates β. σ + σ ζ Of course, if σ ζ = 0 (o measurig error i the quatities), the b estimates β. It is importat to observe that if σ ζ > 0, the b is biased closer to zero. It happes that b 0, the sample versio of the fied cost, estimates σ β +β µ. σ +σ ζ 0 ζ If σ ζ = 0, the b 0 correctly estimates the fied cost β 0. The impact i accoutig problems is that we will ted to uderestimate the variable cost ad overestimate the fied cost. ø 4

25 øøøøøøøøøøø MORE ABOUT NOISE IN A REGRESSION ø øøøøøøøøøø ζ You ca see that the critical ratio here is σ, the ratio of the variace of the oise i σ relative to the variace of the populatio from which the s are draw. I the real situatio, you ve got oe set of data, you have o idea about the values of β 0, β, σ, σ ζ, or σ ε. If you have a large value of R, say over 90%, the you ca be pretty sure that b ad b 0 are useful as estimates of β ad β 0. If the value of R is ot large, you simply do ot kow whether to attribute this to a large σ ε, to a large σ ζ, or to both. Quatity X idepedet variable Cost Y depedet variable Small σ ε (cost measured precisely) Large σ ε (cost measured imprecisely) Small σ ζ /σ (quatity measured precisely relative to its backgroud variatio) b 0 ad b early ubiased with their ow stadard deviatios low; R will be large b 0 ad b early ubiased but their ow stadard deviatios may be large; R will ot be large Large σ ζ /σ (quatity measured imprecisely relative to its backgroud variatio) b seriously biased dowward ad b 0 seriously biased upward; R will ot be large b seriously biased dowward ad b 0 seriously biased upward; R will ot be large Do you have ay recourse here? If you kow or suspect that σ ε will be large, meaig poor precisio is assessig costs, you ca simply recommed a larger sample size. If you kow or suspect that σ ζ will be large relative to σ, there are two possible actios: By obtaiig multiple readigs of for a sigle true quatity, it may be possible to estimate σ ζ ad thus udo the bias. You will eed to obtai the services of a serious statistical epert, ad he or she should certaily be well paid. You ca spread out the -values so as to elarge σ (presumably without alterig the value of σ ζ ). I the situatio of our aimal feed eample, it may be procedurally impossible to do this. ø 5

26 DOES REGRESSION SHOW CAUSALITY? The followig data file shows iformatio o 30 male MBA cadidates at the Uiversity of Pittsburgh. The first colum gives height i iches, ad the secod colum gives the mothly icome of the iitial post-mba job. (These appeared i the Wall Street Joural, 30 DEC 86.) Here is a scatterplot: Icome Height 75 This certaily suggests some form of relatioship! The results of the regressio are these: Regressio Aalysis The regressio equatio is Icome = Height Predictor Costat Coef -45. StDev 48.5 T -.08 P 0.90 Height S = R-Sq = 7.4% R-Sq(adj) = 70.3% Aalysis of Variace Source DF SS MS F P Regressio Error Total Uusual Observatios 6

27 DOES REGRESSION SHOW CAUSALITY? Obs Height Icome Fit StDev Fit Residual St Resid R R R deotes a observatio with a large stadardized residual The sectio below gives fitted values correspodig to HEIGHT ew = Note that Miitab lists the 66.5 value i its output; this is to remid you that you asked for this particular predictio. Predicted Values for New Observatios New Obs Fit SE Fit 95.0% CI 95.0% PI ( 833.3, 938.3) ( 68.6, ) Values of Predictors for New Observatios New Obs Height 66.5 We see that the fitted equatio is INCÔME = HEIGHT. The obvious iterpretatio is that each additioal ich of height is worth $50.0 per moth. Ca we believe ay cause-ad-effect here? The R value is reasoably large, so that this is certaily a useful regressio. Suppose that you wated a 95% cofidece iterval for the true slope β. This would be give as b ± t α/;- SE(b), which is ± , or ±.307. You should be able to locate the values ad i the listig above. Suppose that you d like to make a predictio for a perso with height Miitab will give this to you, ad remids you by repeatig the value 66.5 i its Sessio widow. You ca see from the above that the fit (or poit predictio) is, You could of course have obtaied this as , Miitab provides several other facts ear this,885.8 figure. The oly thig likely to be useful to you is idetified as the 95.0% PI, meaig 95% predictio iterval. This is (,68.5, 3,090.), meaig that you re 95% sure that the INCOME for a perso 66.5 iches tall would be betwee $,68.5 ad $3,090.. The residual-versus-fitted plot must of course be eamied. It s ot show here, just to save space, but it should be oted that this plot showed o difficulties. 7

28 AN INTERPRETATION FOR RESIDUALS The data listed below give two umbers for each of 50 middle-level maagers at a particular compay. The first umber is aual salary, ad the secod umber is years of eperiece. We are goig to eamie the relatioship betwee salary ad years of eperiece. The we ll use the residuals to idetify idividuals whose salary is out of lie with their eperiece Note that the depedet variable is SALARY, ad the idepedet variable is YEARS, meaig years of eperiece. Here is a scatterplot showig these data: SALARY YEARS Certaily there seems to be a relatioship! 8

29 AN INTERPRETATION FOR RESIDUALS We ca get the regressio work by doig this: Stat Regressio Regressio [ Respose: SALARY Predictors: YEARS Storage [ Residuals OK ] OK ] There are other iterestig features of these values, ad this eercise will ot ehaust the work that might be doe. Here are the regressio results: Regressio Aalysis The regressio equatio is SALARY = YEARS Predictor Coef SE Coef T P Costat YEARS S = 864 R-Sq = 78.7% R-Sq(adj) = 78.% Aalysis of Variace Source DF SS MS F P Regressio Error Total Uusual Observatios Obs YEARS SALARY Fit SE Fit Residual St Resid X R R R deotes a observatio with a large stadardized residual X deotes a observatio whose X value gives it large ifluece. There s more iformatio we eed right ow, but you ca at least read the fitted regressio lie as SALÂRY =,369 +,4 YEARS with the iterpretatio that a year of eperiece is worth $,4. Let s give a 95% cofidece iterval for the regressio slope. This particular topic will pursued vigorously later. The slope is estimated as,4.3, ad its stadard error is give as A stadard error, or SE, is simply a data-based estimate of a stadard deviatio of a statistic. Thus, the 95% cofidece iterval is,4.3 ± t 0.05; , or,4.3 ± , or,4.3 ± This precisio is clearly ot called for, so we ca give the iterval as simply,4 ± 33. You might observe that this is very close to (estimate) ± SE. The value of S, meaig s ε, is 8,64. The iterpretatio is that the regressio eplais salary to withi a oise with stadard deviatio $8,64. The stadard deviatio of 9

30 AN INTERPRETATION FOR RESIDUALS SALARY (for which the computatio is ot show) is $8,530; the regressio would thus have to be appraised as quite successful. The fitted value for the i th perso i the list is SALÂRY i =,369 +,4 YEARS i ad the residual for the i th perso is SALARY i - SALÂRY i, which we will call e i. Clearly the value of e i idicates how far above or below the regressio lie this perso is. The Miitab request Storage [ Residuals ] caused the residuals to be calculated ad saved. These will appear i a ew colum i the spreadsheet, RESI. (Subsequet uses would create RESI, RESI3, ad so o.) Here are the values, displayed et to the origial data: SALARY YEARS RESI SALARY YEARS RESI SALARY YEARS RESI It s easy to pick out the most etreme residuals. These are -$7,665.4 (poit 45, a 0-year perso with salary $36,530) ad $7,83. (poit 35, a 8-year perso with $99,39). These two poits are reported after the regressio as beig uusual observatios, ad they are oted to have large residuals. Poit 3 is a high ifluece poit, but that issue is ot the subject of this documet. 30

31 ÐÐÐÐÐÐÐÐÐÐÐÐ ELASTICITY Ð ÐÐÐÐÐÐÐÐÐÐÐ Let s thik of role of b i fitted model Y = b0 + b. This says that as goes up by oe uit, Y (teds to) go up by b. I fact, it s precisely what we call dy i calculus. d This measures the sesitivity of Y to. Suppose that is the price at which somethig is sold ad Y is the quatity that clears the market. It helps to rewrite this as Q = b 0 + b P. We d epect b to be egative, sice the quatities cosumed will usually decrease as the price rises. If curretly the price is P 0 ad curretly the quatity is Q 0, the Q 0 = b 0 + b P 0. Now a chage of price from P 0 to P 0 + θ leads to a chage i quatity from Q 0 to b 0 + b (P 0 + θ) = b 0 + b P 0 + b θ = Q 0 + b θ. Thus, the chage i quatity is chage i Q Q0 b θ. The proportioal chage ratio - is called a elasticity. (We use chage i P P0 mius sig sice they ted to move i opposite directio, ad we like positive elasticities.) Of course, we ca simplify this a bit, writig it fially as - b P 0. Q0 Suppose that a demad curve has equatio Q = 400,000 -,000 P. We put curve i quotes as this equatio is actually a straight lie. Ideed the illustratios i most microecoomics tets use straight lies. Suppose that the curret price is P 0 = $60. The quatity correspodig to this is Q 0 = 400,000 -, = 80,000. Now suppose that the price rises by %, to $ The quatity is ow 400,000 -, = 78,800. This is a decrease of,00 i quatity cosumed, which is a decrease of, = 0.43%. Thus a % icrease i price led to a decrease i quatity of 80, %, so we would give the elasticity as This is less tha, so that the demad is ielastic. We could of course obtai this as - b P 0 Q0 =, , Oe approach follows through o the % chage i price idea. The other approach simply uses - b P 0. These will ot give eactly the Q0 same umerical result, but the values will be very close. Observe that the elasticity calculatio depeds here o where we started. Suppose that we had started at P 0 = $80. The quatity cosumed at this price of $80 is Q 0 = 400,000 -, = 40,000. A icrease i price by %, meaig to $80.80, leads to a ew quatity of 400,000 -, = 38,400. This is a decrease of,600 i Ð 3

32 ÐÐÐÐÐÐÐÐÐÐÐÐ ELASTICITY Ð ÐÐÐÐÐÐÐÐÐÐÐ quatity cosumed. I percetage terms, this decrease is We would give the elasticity as 0.67., , = 0.67%. This is also foud as - b P 0 Q0 =, , We ca see that the elasticity for this demad curve Q = 400,000 -,000 P depeds o where we start Elasticity = 0.43 here QTTY Elasticity = 0.67 here PRICE You might check that at startig price P 0 = $0, the elasticity would be computed as.50, ad we would ow claim that the demad is (highly) elastic. Details: Fid Q 0 = 400,000 -,000 0 = 60,000. The % icrease i price would lead to a cosumptio decrease of. 000 =,400. This is a, 400 percetage decrease of = 0.05 =.5%. 60, 000 What is the shape of a demad curve o which the elasticity is the same at all prices? dq This is equivalet to askig about the curve for which Q dp P = -c, where c is costat. (The mius sig is used simply as a coveiece; of course c will be positive.) This coditio ca be epressed as dq Q = c dp P Ð 3

33 ÐÐÐÐÐÐÐÐÐÐÐÐ ELASTICITY Ð ÐÐÐÐÐÐÐÐÐÐÐ a f af for which the solutio is log Q = -c log P + d. This uses the calculus fact that d f t log faf t =. The result ca be reepressed as dt f t or e logq Q = clog = e + mp c P d where m = e d. The picture below shows the graph of Q = 4,000,000 P This curve has elasticity 0.80 at every price QTTY PRICE The equatio log Q = -c log P + d is a simple liear relatioship betwee log Q ad log P. Thus, if we base our work o a regressio of log(quatity) o log(price), the we ca claim that the resultig elasticity is the same at every price. If you bega by takig logs of BOTH variables ad fitted the regressio (log-o-log), you d get the elasticity directly from the slope (with o eed to worry about P 0 or Q 0 ). That is, i a log-o-log regressio, the elasticity is eactly -b. Now, where would we get the iformatio to actually perform oe of these regressios? It would be woderful to have prices ad quatities i a umber of differet markets which are believed to be otherwise homogeeous. For eample, we could cosider per capita cigarette sales i states with differet ta rates. Ufortuately, it is rather rare that we ca get clealy orgaized price ad quatity o a geographic basis. It is more commo by far to have price ad quatity iformatio over a large area at may poits i time. I such a case, we would also have to worry about the possibility that the market chages over time. Ð 33

34 ````` SUMMARY OF REGRESSION NOTIONS WITH ONE PREDICTOR ` ```` Measures of quality for a regressio: R wat big (as close as possible to 00%) s ε wat small s ε /s Y wat small (as close as possible to 0%) F wat big (up to ) t statistic for b wat far away from 0 Fially, here s the summary of what to do with the regressio game with oe idepedet variable. Begi by makig a scatterplot of (X, Y). You might see the eed to trasform. Idicatios: Ecessive curvature or etreme clusterig of poits i oe regio of the plot. Note SD(Y). Perform regressio of Y o X. Note R, t for b, ad s ε. Aother useful summary is the correlatio betwee ad Y. Formally, this is computed Sy as r =. It happes that b ad r have the same sig. S S yy Plot residual versus fitted. If you have pathologies, correct them ad start over. Use the regressio to make your relevat iferece. We ll check up o this later. ` 34

35 XXXXX THE RESIDUAL VERSUS FITTED PLOT X XXXX It is ow stadard practice to eamie the plot of the residuals agaist the fitted values to check for appropriateess of the regressio model. Patters i this plot are used to detect violatios of assumptios. The plot is obtaied easily i most computer packages. If you are workig i Miitab, click o Stat Regressio Regressio, idicate the depedet ad idepedet variable(s), the click o Graphs, ad check off Residuals versus fits. I the resultig plot, you ca idetify idividual poits by usig Editor Brush. You HOPE that plot looks like this: Residuals Versus the Fitted Values (respose is C) 0 0 Residual Fitted Value This picture would be described as patterless. There are a umber of commo pathologies that you will ecouter. The et picture shows curvature: Residuals Versus the Fitted Values (respose is C3) 5 Residual Fitted Value The cure for curvature cosists of revisig the model to be curved i oe or more of the predictor variables. Geerally this is doe by usig X (i additio to X), but it ca also be doe by replacig X by log X, by X, or some other oliear fuctio. X 35

36 XXXXX THE RESIDUAL VERSUS FITTED PLOT X XXXX A very frequet problem is that of residuals that epad rightward o the residual versus fitted plot. Residuals Versus the Fitted Values (respose is C3) Residual Fitted Value O this picture there is cosiderably more scatter o the right side. It appears that large values of Y also have widely scattered residuals. Sice the residuals are supposed to be idepedet of all other parts of the problem, this picture suggests that there is a violatio of assumptios. This problem ca be cured by replacig Y by its logarithm. The picture below shows a destructive outlier. Residuals Versus the Fitted Values (respose is C) 0-0 Residual Fitted Value The poit at the lower right has a very uusual value i the X-space, ad it also has a iappropriate Y. This poit must be eamied. If it is icorrect, it should be corrected. If the correct value caot be determied, the poit must certaily be removed. If the poit is i fact correct as coded, the it must be set aside. You ca observe that this poit is maskig a straight-lie relatioship which would otherwise be icorporated i the regressio coefficiet estimates. X 36

37 XXXXX THE RESIDUAL VERSUS FITTED PLOT X XXXX I the picture below, there is a very uusual poit, but it is ot destructive. Residuals Versus the Fitted Values (respose is C) 30 0 Residual Fitted Value 80 The uusual poit i this plot must be checked, of course, but it is ot particularly harmful to the regressio. At worst, it slightly elevates the value of b 0, the estimated itercept. O the picture below, there are several vertical stripes. This idicates that the X-values cosisted of a small fiite umber (here 4) of differet patters. Residuals Versus the Fitted Values (respose is C) 0 Residual Fitted Value The required actio depeds o the discreteess i the X-values that caused the stripes. If the X-values are quatitative, the the regressio is appropriate. The user should check for curvature, of course. If the X-values are umerically-valued steps of a ordered categorical variable, the the regressio is appropriate. The user should check for curvature. If the X-values are umerically-valued levels of a qualitative o-ordial categorical variable, the the problem should be recast as a aalysis of variace or as a aalysis of covariace. X 37

38 XXXXX THE RESIDUAL VERSUS FITTED PLOT X XXXX Fially, we have a patter i which imperfect stripes go across the residual-versus-fitted patter. Residuals Versus the Fitted Values (respose is C5) 0.5 Residual Fitted Value These stripes ca be geerally horizotal, as above, or they ca be oblique: Residuals Versus the Fitted Values (respose is C4) 0.5 Residual Fitted Value.0 These pictures suggest that there are oly two values of Y. I such a situatio, the appropriate tool is logistic regressio. If there are three or more such stripes, idicatig the same umber of possible Y-values, the the recommeded techique is multiple logistic regressio. Multiple logistic regressio ca be either ordial or omial accordig to whether Y is omial or ordial. X 38

39 AN EXAMPLE OF THE RESIDUAL VERSUS FITTED PLOT The file X:\SOR\B0305\M\SALARY.MTP gives SALARY ad YEARS of eperiece for a umber of middle-level eecutives. Let s first fid the regressio of SALARY o YEARS. The data set cosists of 50 poits ad looks like this: CASE SALARY YEARS CASE SALARY YEARS Here is a scatterplot: SALARY YEARS 0 30 The regressio results from Miitab are these: The regressio equatio is SALARY = YEARS Predictor Coef SE Coef T P Costat YEARS S = 864 R-Sq = 78.7% R-Sq(adj) = 78.% Aalysis of Variace Source DF SS MS F P Regressio Residual Error Total Uusual Observatios Obs YEARS SALARY Fit SE Fit Residual St Resid X R R R deotes a observatio with a large stadardized residual X deotes a observatio whose X value gives it large ifluece. 39

40 AN EXAMPLE OF THE RESIDUAL VERSUS FITTED PLOT The fitted regressio equatio is certaily SALAR Y =,369 +,4 YEARS. What iterpretatio ca we give for the estimated regressio slope? The slope,4 suggests that each year of eperiece traslates ito $,4 of salary. Here is the residual versus fitted plot. It shows a commo pathological patter. Residuals Versus the Fitted Values (respose is SALARY) Residual Fitted Value The residuals spread out to a greater degree at the right ed of the graph. This is very commo. This same observatio might have bee made i the origial plot of Salary vs Years. I the case of simple regressio (just oe idepedet variable) the residual-versus-fitted plot is a rescaled ad tilted versio of the origial plot. We ll hadle this violatio of model assumptios by makig a trasformatio o SALARY. The usual solutio is to use the logarithm of salary. Here, we ll let LSALARY be the log (base e) of SALARY. The output from Miitab is this: Regressio Aalysis The regressio equatio is LSALARY = YEARS Predictor Coef SE Coef T P Costat YEARS S = 0.54 R-Sq = 86.4% R-Sq(adj) = 86.% Aalysis of Variace Source DF SS MS F P Regressio Error Total

41 AN EXAMPLE OF THE RESIDUAL VERSUS FITTED PLOT Uusual Observatios Obs YEARS LSALARY Fit StDev Fit Residual St Resid R X R R deotes a observatio with a large stadardized residual X deotes a observatio whose X value gives it large ifluece. The residual-versus-fitted plot for this revised regressio is show et: Residuals Versus the Fitted Values (respose is LSALARY) Residual Fitted Value.0 This plot shows that the problem of epadig residuals has bee cured. Let s also ote that i the origial regressio, R = 78.7%, whereas i the regressio usig LSALARY, R = 86.4%. The fitted regressio is ow LSALARY = YEARS, which ca be rouded to LSALARY = YEARS. This suggests the iterpretatio that each year of eperiece is worth 0.05 i the logarithm of salary. You ca epoetiate the epressio above to get ˆ LSALARY YEARS e = e = SA = e ( e ) + ˆLARY YEARS I this form, icreasig YEARS by causes the fitted salary to be multiplied by e You ca use a calculator to get a umber out of this, but we have a simple approimatio e Thus, accumulatig a year of eperiece suggests that salary is to be multiplied by.05; this is a 5% raise. You might be helped by this useful approimatio. If t is ear zero, the e t + t. Thus, gettig a coefficiet of 0.05 i this regressio leads directly to the 5% raise iterpretatio. 4

42 AN EXAMPLE OF THE RESIDUAL VERSUS FITTED PLOT Let s recosider the cosequeces of our work i regressig SALARY o YEARS. The regressio model for this problem (i origial uits) was SALARY i = β 0 + β YEARS i + ε i [] The oise terms ε, ε,, ε were assumed to be a sample from a populatio with mea zero ad with stadard deviatio σ. Our fitted regressio was ˆ SALARY =,369 +,4.3 YEARS The estimate of σ was computed as 8,64.9. This was labeled S by Miitab, but you ll also see the symbols s or s Y or s ε. Predicted salaries ca be obtaied by direct plug-i. For someoe with 5 years of eperiece, the predictio would be,369 + (,4.3 5) =, Here is a short table showig some predictios; this refers to the colum Predicted SALARY with basic model []. YEARS Predicted SALARY with basic model [] Predicted SALARY with logarithm model [] 95% predictio iterval for SALARY, usig basic model [] 95% predictio iterval for SALARY, usig logarithm model [] 5,076 4,30 4,00.8 to 40,30.7 7,487.6 to 33, ,78 30,980 5,037.6 to 50,57.,577.3 to 4, ,488 39,775 5,90.5 to 6,067. 9,07. to 54, ,95 5,066 36,635.5 to 7, ,335.5 to 69, ,90 65,563 47,. to 8, ,84.8 to 89,877. It happes that model [] will lead to a residual versus fitted plot with the very commo patter of epadig residuals. The cure comes i replacig SALARY with LSALARY ad cosiderig the model LSALARY i = β 0 + β YEARS i + ε i [] For this model, the residual versus fitted plot shows a beig patter, ad we are able to believe the assumptio that the ε i s are a sample from a populatio with mea 0 ad stadard deviatio σ. The LOG here is base e. Please ote that we have recycled otatio. The parameters β 0, β, ad σ, alog with the radom variables ε through ε, do ot have the same meaigs i [] ad []. 4

43 AN EXAMPLE OF THE RESIDUAL VERSUS FITTED PLOT The fitted model correspodig to [] is ˆ LSALARY = YEARS For this model, the predicted log-salary for someoe with 5 years of eperiece would be ( ) = I origial uits, this would be e ,30. This is the first etry i the colum Predicted SALARY with logarithm model []. So what s the big deal about makig a distictio betwee models [] ad []? The fitted values are differet, with model [] givig larger values at the ed ad smaller values i the middle. Here are the major aswers: (a) (b) (c) The differece of about $4,000 i the model is ot trivial. The two models are certaily ot givig very similar aswers. Model [] has assumed equal oise stadard deviatios throughout the etire rage of YEARS values. The plot of (SALARY, YEARS) suggests that this is ot realistic. The residual versus fitted plot for model [] makes this paifully obvious. The cosequeces will be see i predictios. Eamie the colum 95% predictio iterval for SALARY, usig basic model []. Each of these predictio itervals is about $36,000 wide. This may be realistic for those people with high seiority (high values of YEARS), but it s clearly off the mark for those with low seiority. Model [] has assumed equal oise stadard deviatios, i terms of LSALARY, throughout the etire rage of YEARS values. This is more believable. I the colum 95% predictio iterval for SALARY, usig logarithm model [] the legths vary, with the loger itervals associated with loger seiority. 43

44 TRANSFORMING THE DEPENDENT VARIABLE Cosider the liear regressio problem with data (, Y ), (, Y ),, (, Y ). We form the model Y i = β 0 + β i + ε i i =,,, The assumptios which accompay this model iclude these statemets about the ε i s : The oise terms ε, ε,, ε are idepedet of each other ad of all the other symbols i the problem. The oise terms ε, ε,, ε are draw from a populatio with mea zero. The oise terms ε, ε,, ε are draw from a populatio with stadard deviatio σ. Situatios i which SD(ε i ) appears to systematically vary would violate these assumptios. The residual versus fitted plot ca detect these situatios; this plot shows the poits (e, Y ˆ ), (e, Y ˆ ),, (e, Y ˆ ). We will sometimes see that the residuals have greater variability whe the fitted values are large. The recommeded cure is that Y i be replaced by log(y i ). If some of the Y i s are zero or egative, we d use log(y i + c) for some c big eough to make all values of Y i + c positive. Why does this work? It s a bit of isaely tricky math, but it s fu. Suppose that the residual versus fitted plot suggests that SD(Y i ) is big whe β 0 + β i is big. We ll use the symbol µ i = β 0 + β i for the epected value of Y i. The observatio is that SD(Y i ) is big whe µ i is big. The are may ways of operatioalizig the statemet SD(Y i ) is big whe µ i is big. The descriptio that will work for us is SD(Y i ) = αµ i That is, the stadard deviatio grows proportioal to the mea. The symbol α is just a proportioality costat, ad it s quite irrelevat to the ultimate solutio. Let s seek a fuctioal trasformatio Y g(y) that will solve our problem. The symbol g represets a fuctio to be foud; perhaps we ll decide g(t) = t 3 or t g(t) = cos(πt) or g(t) = or somethig else. t + 44

45 TRANSFORMING THE DEPENDENT VARIABLE It s coveiet to drop the symbol i for ow. We ll be talkig about the geeral pheomeo rather that about our specific data poits. By solve our problem we are suggestig these two otios: If µ k µ the E( g(y k ) ) E( g(y ) ); that is, the fuctio g preserves differetess of meas (epected values). If µ k µ the SD( g(y k ) ) SD( g(y ) ); that is, the fuctio g allows Y k ad Y to have uequal meas but approimately equal stadard deviatios. We re goig to use two mathematical facts ad oe statistical fact. MATH FACT : If g is ay well-behaved fuctio, the g(y) g(µ) + g (µ) (y - µ) This is just Taylor s theorem. It ca be thought of as a approimate versio of the mea value theorem i calculus. The symbol µ ca be ay coveiet value. Whe we use this with a radom variable Y, we ll take µ as the mea of the radom variable. MATH FACT : If the fuctio g has a derivative for which g (t) = k t the the fuctio g(t) = log t is oe possible solutio. (There are may possible solutios, but this oe is simplest. Also, we ll use base-e logs. The most detailed solutio would be g(t) = A + k log t.) 45