Introduction to Regression


 Claire Griffin
 2 years ago
1 Introducton to Regresson
2 Regresson a means of predctng a dependent varable based one or more ndependent varables. Ths s done by fttng a lne or surface to the data ponts that mnmzes the total error.  The lne or surface s called the regresson model or equaton.  In ths frst secton we wll only work wth smple lnear bvarate regresson (lnes).
4 Homoscedastcty equal varances. Heteroscedastcty unequal varances. The value of one varable ncreases at an ncreasng or decreases at a decreasng (nonlnear) rate. These data may be transformed to meet the lnearty assumpton.
5 yˆ a bx a s the ntercept b s the slope x s the observed value ŷ s the predcted value
6 More formally: In smple bvarate lnear regresson there are the followng parameters: ŷ : the predcted value. a : the yaxs ntercept (also called the constant). b: the slope of the lne (the change n the dependent or y varable for each unt change n the ndependent or x varable). x : the observed value of the ndependent varable.
7 True Lne Sample Lne
8 Each observaton of the dependent (y) varable may be expressed as the predcted value + a resdual (error). y a bx e yˆ e where y s the actual value, resdual (error). ŷ s the predcted value, and e s the Resdual the dfference between the true value and the predcted value. e y yˆ
9 Regresson Resduals: Unless the r s 100%, there wll be some amount of varaton n y whch remans unexplaned by x. The unexplaned varaton s the error component of the regresson equaton. That error s the sum of the dfferences between each observed value and ts value as predcted by the regresson equaton.
10 Key Ponts: We are tryng to ft a lne through a set of plotted ponts that mnmzes the resduals (errors). Ths lne s called the lne of best ft. We ft ths lne n such a way that the sum of the squared resduals s mnmzed.
11 Ths dstance s the resdual. }
12 The way we determne whch lne (there are an nfnte number of potental lnes) s the best ft s easy  We need to defne a lne that passes through the pont determned by the mean x value and the mean y value.  The slope of ths lne needs to mnmze the resdual error.
13 Notce that not all of the ponts fall on the lne.
14 Explaned and Unexplaned Varaton The varaton n the dependent (y) varable can be parttoned. Ths s smlar to the TSS, BSS, and WSS terms n AOV.  Total varaton n the dependent (y) varable.  Varaton n the dependent (y) varable explaned by the ndependent (x) varable.  The varaton n the dependent (y) varable NOT explaned by the ndependent (x) varable (resdual).
15 A = total varaton n y B = explaned varaton n y C = resdual
16 1 1 1 ˆ) ( ) ˆ ( ) ( n n n y y y y y y In the form of an equaton: Total sum of squares Regresson (explaned) sum of squares Resdual (unexplaned) sum of squares
17 Lne of best ft y
18 1 1 ) ( ) ˆ ( n n y y y y r The proporton of the total explaned varaton n y s called the coeffcent of determnaton or r : Total sum of squares Regresson (explaned) sum of squares
19 Key Ponts The coeffcent of determnaton (r ) s equal to the square of the correlaton coeffcent. The r s equal to the explaned sum of squares dvded by the total sum of squares. r s also equal to 1 mnus the rato of the resdual sum of squares dvded by the total sum of squares. What are the unts of r? What s the range of r?
20 Assumptons of Regresson 1. The relatonshp between y (dependent) and x (ndependent) s lnear.. The errors (resduals) do not vary wth x. 3. The resduals are ndependent, meanng that the value of one resdual does not nfluence the value of another. 4. The resduals are normally dstrbuted.
21 Machne calculaton of a and b (ntercept and slope) x xy b where n Y X X Y xy n X X x bx Y a
22 Machne calculaton of r (coeffcent of determnaton) TSS Y Y xy n RSS x ESS TSS RSS r RSS TSS
23 Sgnfcance Testng n Regresson There are several hypotheses that are tested n regresson: 1. That the varaton explaned by the model s not due to chance (F test).. That the slope of the regresson lne s sgnfcantly dfferent than zero (t test of the β parameter). 3. That the y ntercept s sgnfcantly dfferent than zero (t test of the constant parameter). Ths test result can be gnored unless there s some reason to beleve that the y ntercept should be zero.
24 We can test the sgnfcance of model usng the Fstatstc n the followng form: RSS F v 1 df v 1, n ESS n where v = the number of parameters and n s the sample sze. Snce n bvarate lnear regresson we are estmatng only parameters (a and b) v wll always be.
25 We can test the null hypothess that β = 0 usng the t test n the followng form: t b s b where s b s the standard devaton of the slope.. s b ESS n x df n When β=0, for each unt change n x there s no change n y.
26 Key ponts regardng the F and t statstcs: The F statstc tells you whether the assocaton s due to chance. The t statstc tells you whether a specfc ndependent varable s contrbutng to the model.
27 A few ponts concernng regresson analyss: Be sure to specfy the correct dependent varable snce the procedure makes an assumpton concernng causalty. Data may be transformed to meet the lnearty requrement. Do not predct y values beyond the data range. If there s no causalty, correlaton s the correct method. For example, human leg and arm lengths are lnearly assocated, but one does not cause the other.
28 Samplng for regresson should span the entre range of values whenever possble. Here the red dots mark a sample from a much larger hypothetcal populaton. If only the red observatons are used there appears to be a causal assocaton. If the entre range of data are used the causal assocaton dsappears... Y does not change as X ncreases.
29 Resdual plot examnaton: Normally dstrbuted resduals appear scattered randomly about the mean resdual lne. Heteroscedastc resduals fan out from the resdual mean lne. If an mportant explanatory varable s mssng the predcted values ncrease as the observed values ncrease. Nonlnear assocaton between the varables appears as an arc runnng through the mean resdual lne. The last three of the above (heteroscedastcty, mssng varable, nonlnear relastonshp) pont to data problems.
30 Resdual Plots
31 InClass Example Isle of Mann TT Fataltes by Slope per Locaton a Y bx Locaton Fataltes Slope (Deg) Alpne Cottage Waterworks Corner.1 Quarry Bends Greeba Castle Mountan Box Ballaugh Brdge Glentramman Stonebreaker s Hut Appledene Handley s Corner Glen Helen 8.1 Vernadah b xy x xy x TSS RSS X Y X Y X Y xy x X n ESS TSS RSS r RSS TSS n Y RSS F v 1 df v 1, n ESS n t df b s b n n ESS n x s b
b) The mean of the fitted (predicted) values of Y is equal to the mean of the Y values: c) The residuals of the regression line sum up to zero: = ei
Mathematcal Propertes of the Least Squares Regresson The least squares regresson lne obeys certan mathematcal propertes whch are useful to know n practce. The followng propertes can be establshed algebracally:
