Simple Regression and Correlation

Transcription

1 14 Simple Regression and Simple Regression and We are going to study the relation between two variables. Let us label the first as X and the second as Y. We will observe values for these two variables in the form: X Y We are interested in finding relation between these two variables. To determine the relation between them we have to know the meaning of each of them. The following are some examples of X and Y, and the theoretical relation between them: X Y Theoretical Relation Weight of a person Height of a person Positive relation Higher Temperature Electricity consumption Positive relation KD Exchange rate Temperature in against the Dollar Japan No relation Price of a commodity Quantity of items purchased Negative relation QMIS 0 1

2 Simple Regression and For these variables we could be interested in answers to the following questions: a) Is there any relation between the two variables? b) Does the relation (if any) take a linear form? c) What is the direction of the relation (positive, negative)? d) What is the magnitude of this relation (weak, medium or trong)? The Answer to these questions could be done in two ways 1) Partially by plotting the scatter plot of X and Y. As in the following general forms of scatter plots one would obtain : 3 Simple Regression and Scatter Plot scatter plot of X and Y scatter plot of X and Y Y X Y X Complete positive linear relation r = 1 Complete negative linear relation r = 1 4 QMIS 0

3 Simple Regression and Scatter Plot scatter plot of X and Y scatter plot of X and Y Y X Y X Positive incomplete linear relation 0 < r < 1 Negative incomplete linear relation 1 < r < 0 5 Simple Regression and Scatter Plot scatter plot of X and Y Y X No relation linear relation between X and Y r = 0 6 QMIS 0 3

4 Simple Regression and Scatter Plot in MINITAB Assuming that the variable X is in column C1 and Y is in column C of the MINITAB worksheet. We can get a high resolution form of the scatter plot using the commands: Mtb> plot C*C1 The general form of the plot command is: Col. of Y Mtb> plot * Col. of X 7 Linear Coefficient () The second and more precise method to answer the previous four questions is by using a numerical measure. Carl Pearson has developed a measure of the strength of a linear relation between two variables. This measure is still wildly used and is known as Pearson's linear correlation coefficient or simply the linear correlation coefficient, and is denoted by "r". The Coefficient "r" is a scale that runs from 1 to +1. The value of "r" indicates the strength of the linear relation between the two variables X and Y. And its sign indicates the direction (negative or positive) of the linear relation. Negative relation Positive relation strong medium Weak weak medium strong Perfect negative relation No linear relation Perfect positive relation 8 QMIS 0 4

5 Linear Coefficient As an example the following values of correlation coefficients "r" is interpreted as follows: If r = 0.81 Strong negative linear relation If r = 0.34 Weak positive linear relation If r = 0.69 Medium positive linear relation If r = 1.00 Perfect positive linear relation If r = 1.00 Perfect negative linear relation If r = 0.71 Medium negative linear relation If r = 0 No linear relation 9 Linear Coefficient The formula for computing the Pearson's Coefficients is as follows: r (X X)*(Y Y) ( Definition form) (X X) (Y Y) (X) (Y) XYn ( X) ( Y) X Y n n SSxy SS xx * SSyy ( Computation form) So, to compute "r" we need the value of the following sums: n, X, Y, X, Y and XY 10 QMIS 0 5

6 Linear Coefficient Example: For the two variables X and Y we have the following 5 observations. Determine the strength and direction of the linear relation (if any) between them. X Y Sum Linear Coefficient Example To compute the Pearson's Coefficient "r" we need to find out the values of the required sums. So, we have: X Y X Y XY Sum n 5 X05 Y 176 X 1045 Y 7490 XY5655 SS xx 00 SS yy SS xy QMIS 0 6

7 n5 Linear Coefficient Example X05 Y 176 X 1045 Y 7490 XY 5655 SS xx 00 SS yy SS xy 1561 r (X) (Y) SS XYxy n SS xx * SS yy ( X) ( Y) X Y n n (05)(176) (05) (176) 00 * Which means that there is a strong negative linear relation between the two variable X and Y. 13 Linear Coefficient Example Computing the Coefficient in MINITAB: Assuming that the data of X is in column C1 and the data of Y is in column C we can use the following MINITAB command to calculate "r": Mtb> corr C1 C 14 QMIS 0 7

8 Linear Coefficient MINITAB Example MTB > print c1 c Row X Y MTB > let c3=c1*c1 MTB > let c4=c*c MTB > let c5=c1*c MTB > print c1c5 Row X Y x^ Y^ XY MTB > count c1 Total number of observations in X = 5 MTB > sum c1 Sum of X = MTB > sum c Sum of Y = MTB > sum c3 Sum of x^ = 1045 MTB > sum c4 Sum of Y^ = MTB > sum c5 Sum of XY = Linear Coefficient MINITAB Example MTB > gstd MTB > plot c c1 Y * * 45+ * 30+ * 15+ * X MTB > corr c1 c s: X, Y Pearson correlation of X and Y = PValue = QMIS 0 8

9 Linear Coefficient Hypothesis Testing 17 Linear Coefficient Hypothesis Testing t = 6.39 t(3) 5 So for =.05 and t(with 3 df) The critical values is 3.18 Decision: reject H0 with 95% confidence 18 QMIS 0 9

10 Regression We have learned from the previous section how to examine the strength and direction of a linear relation that could link two variables: X and Y. In linear regression, we are interested in forming or estimating the best linear function that ties the two variables: X and Y. The general form of linear equation is: Y = a + b X or Y = b 0 + b 1 X An example of such a linear equation is: Y = X Here a = 4 and b = 3. What is the interpretation of "a" and "b" in the general form of linear equation? 19 Regression To understand the interpretation of linear equation let us take the previous equation as an example: Y = X And find out the values of Y for different values of X: Changes in X: Changes in Y: Y X X Y 1 7 X=1=1 10 Y=107 = 3 X=3= Y=1310= 3 X=43= Y=1613= Y = 4 Value of "b" Value of "a" From the above example we see that whenever X changes by 1 unit Y changes by 3 (the value of b). And when X equals 0 the value of Y equals 4 (the value of a in the general form of the linear equation). 0 QMIS 0 10

11 Regression Y = a + b X a = is the starting or initial value of Y (i.e. the value of Y when X = 0) b = (1) the rate of change in Y when X changes by 1 unit. Or () the rate of change in Y divided by the rate of change in X (b= Y/X). (3) b is also interpreted as the slope of the linear equation Y = a + b X or the tangent (tan) of the angle between that line and the horizontal line. For the above general form of linear equation X is known as the independent variable, where as Y is the dependents variable as its value is determined by the values of X. 1 Regression QMIS 0 11

12 Regression Y = a + b X How to estimate the best linear equation for Y on X? We usually start with values for the two variables X and Y. Here we should specify which of them we are going to considered as the independent Variable and which is the dependent (or the one we need to explain by the other) variable. We start in a similar data layout as we had with the correlation coefficient case, that is: X Y Estimating the linear equation of Y as dependent variable and X as the independent variable, from a set of data, is known as the Estimating the linear regression line of Y on X. 3 Regression One way to help use estimate the linear regression of Y on X is to scatter plot the values of X and Y. And compute the correlation coefficient as discussed before. If the scatter plot of Y and X takes exactly a linear form (with positive or negative slop). Then estimating the best line that fits the data is easy and straight forward. But if the scatter plot does not perfectly follow a linear pattern. Then there are a number of ways to define and estimate a linear equation that would represent the data. The Least Squared Method is one of these methods which is widely use. The idea of the least squared method is to fit a line for the linear relation Y= a + b X that passes in the middle of the data set. To achieve this idea, the method would search for the line that has the least squared error. That is the estimated line that has the least (the lowest possible value) of (Y Ŷ) where: Y is the observed values (value obtained) for the dependent variable. Ŷ is the estimated values of Y using the estimated regression line. The quantity (Y Ŷ) is known as the error term or residual or deviation. 4 QMIS 0 1

13 Regression scatter plot of X and Y scatter plot of X and Y Y X Y (Y Ŷ) Ŷ Y X 5 Regression Since the data does not exactly follow a linear form. We can say the linear form would fit the data with some error. That is: Y = a + b X + e Where "a" and "b" are unknown constants and "e" is an unobservable error term (deviation from the line). We then can write the error term as: (Y Ŷ) or (Y ( a + b X)), the square of that is the squared deviation. (Y ( a + b X)), Y scatter plot of X and Y (Y Ŷ) Ŷ Y X 6 QMIS 0 13

14 Y (14) Simple Regression and Regression And the sum of all squared deviations for all values of Y is the sum of squared deviations: Y a b X to obtain the values of "a" and "b" that will minimize the sum of the squared errors we have to partially differentiate the sum of squared deviations, once with respect to "a" and another with respect to "b". Then equate both resulted functions with the zero, to obtain what is known as the normal equations: scatter plot of X and Y 45 a n b X Y a X b X XY (Y Ŷ) Ŷ Y X 7 Regression Solving these two equations for "a" and "b" we have: ( X)( Y) XY ˆb n ( X) X n SSxy SSxx and â Y bˆ X so the estimated equation is then: Yˆ aˆ bˆ X Y scatter plot of X and Y (Y Ŷ) Ŷ Y X 8 QMIS 0 14

15 Regression Example From the previous example n5 X Y X Y XY Sum X05 Y 176 X 1045 Y 7490 XY 5655 SS xx 00 SS yy SS xy * ˆ b = = = (05) and â Y bx ˆ ( 0.773) * Regression Example 05* ˆ b = = = (05) â YbX ˆ ( 0.773) * The final least squared estimated linear regression of Y on X is: Ŷ X Note: "b" and "r" have the same sign (+ or ) which is determined by the value of the numerator. 30 QMIS 0 15

16 Regression Example Ŷ X Usage of the estimated Linear equation: 1) To further explain the relation between X and Y ) To predict values of Y (dependent variable) for certain values of X (independent variable). Explaining the relation in the previous example: For the previous example the estimated slope (b) is: ˆ b = ΔY =0.773 ΔX Which indicates that when X increases by 1 unit the values of Y will decrease by The initial value of Y (the value of Y when X=0) is: ˆ a= Regression Example Ŷ X Prediction: What is the estimated (predicted) value of Y when X=30? i.e. What is Ŷ when X Y x * 30 X Y error deviation Y Ŷ Similarly, the predicted value of Y when X=80 is: * Yx * QMIS 0 16

17 Regression MINITAB Example Assuming that the data of X is in column C1 and the data of Y is in column C we can use the following MINITAB command to estimate the value of "a" and "b" : mtb> regr C 1 C1 The general form of the Regression command is: Y X standard error Ŷ Mtb> regr 1 Optional columns Example: Mtb> regr C 1 C1 Mtb> regr C 1 C1 C5 C6 33 Regression MINITAB Example MTB > print c1c Row X Y MTB > gstd MTB > plot c c1 Y * * 45+ * 30+ * 15+ * X MTB > corr c1 c s: X, Y Pearson correlation of X and Y = QMIS 0 17

18 Regression MINITAB Example MTB > regr c 1 c1 Regression Analysis: Y versus X The regression equation is Y = X Predictor Coef SE Coef T P Constant X S = RSq = 93.% RSq(adj) = 90.9% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Regression MINITAB Example MTB > regr c 1 c1 c5 c6 MTB > print c1 c c5 c6 Data Display Row X Y st.error Yhat QMIS 0 18

19 Regression Assumptions of the regression Model Consider the population regression model: Y = a + bx + We have four assumptions for estimating this model using a sample: Assumption 1 : The random error term has a mean equal to zero for each x. Assumption : The error associated with different observations are independent. Assumption 3 : For any given x, the distribution of error is Normal. Assumption 4 : The distribution of population errors for each x has the same constant standard deviations, which is denoted by. This assumption indicates that the spread of points around the regression line is similar for all x values. 37 Regression If we estimate a regression of Y on X using the least squared method. Then how do we know if this estimate is good? And how do we know whether it is reliable estimate? We need to test the validity of the estimated model for: Where: Y = a + b X + X is the independent variable. is a random variable that is distributed normally with mean 0 and standard deviation. i.e. [ ~ N( 0, ) ] Y is the dependent random variable. Y ~ N( a+bx, ) 38 QMIS 0 19

20 Regression Steps for estimated model validation: (1) Coefficient of Determination (r ): r is a numerical measure that takes a value between 0 and 1 (i.e. 0 r 1 ) which represents the percentage of the total variations in the dependent (Y) that is explained by the estimated linear model: Ŷ â bˆ X. r is computed by the formula: Explaned variation r = = = Total variation in Y Regression sum of squares = RSS Total sum of squares TSS The more the value of r, the stronger is the estimated model. So, r is one indication or measure for the strength of the estimated model. 39 Regression Example: If we compute r and we find out that it equals to Then this is interpreted as: The estimated model (using the least squared method) has succeeded in explaining 83% of the total variation in the dependent variable Y. The remaining 17% is not explained by the estimated model. It could be a pure error or some lack or deficiency in the estimated model. 40 QMIS 0 0

21 Regression Notes: 1 SS SS (Total Sum of Squares) (Regression Sum of Squares) SS SS 3 SS SS (Error Sum of Squares) (4) i.e. 41 Regression 4 QMIS 0 1

22 Regression 43 Regression () Test for the overall validity of the estimated model: Given the assumptions of the normally of the error term and the dependent variable Y. we can test the adequacy of the estimated model in presenting the linear relation between Y and X. The procedure for this testing follows the same steps we have been doing through all previous procedures of hypothesis testing. We start by formulating the two hypotheses. Namely, H O and H 1. In our case : H O : The estimated model does not fit the data. ( OR the model is not adequate) H 1 : The estimated model fits the data. (OR the model is adequate). 44 QMIS 0

23 Regression To test the above hypotheses we compute the value of the Test Statistics through the analysis of variance table (ANOVA). ANOVA: SOURCE DF SS MS F (test stat.) Regression k ( Ŷ Y) (Ŷ Y) / k Reg. MS Error nk1 ( Y Ŷ) (Y Ŷ) /(n k 1) Error MS Total n1 ( Y Y) The test statistics: Regression MS ~ F( k, nk1) Error MS Where: n is the number of observations used in estimating the model k is the number of independent variables (X) used in the model Ŷ is the predicted values of Y for each value of X. Y is the mean of the dependent variable Y. 45 Regression (3) Test for the model parameters: The estimated model: Ŷ â bˆx Has two parameters: â and bˆ. We will test the hypothesis that (b) (the true value of the estimated bˆ ) equals 0. The general form of the hypothesis in this case is: HO: b = b0 versus H1: b b0 Similarly, we will test the hypothesis that (a) (the true value of the estimated â ) equals 0. The general form of the hypothesis in this case is: H O: a = a 0 versus H 1: a a 0. If the value of b0 equals zero then the true value of (b) could be zero, and X will not contribute to the equation, and it is not needed in the model. In that case, we can drop X from the model. Similarly, if a0 = 0, and the result of the test shows that this could be accepted, then (a) is dropped out from the estimated model. 46 QMIS 0 3

24 Regression The Test Statistics are: same DF as the error in ANOVA same DF as the error in ANOVA The two statistics have a tdistribution with (n) df. The standard error for â and bˆ is not easy to compute and we are going to relay on the output of the MINITAB to estimated it. We can estimate the Confidence Intervals for a and b from the formulas: 47 Regression SE S SS where; S Error MS n SS b SS n 48 QMIS 0 4

25 Regression 49 Regression Example: If we have the following values of the variables X and Y: n x = 498 y=6 xy = 1178 x = 690 y =5370 SS xx = SS yy = 6.4 SS xy = 57. X Y If we input the values of these variables in columns 1 and in the MINITAB worksheet and use the commands print and Descriptive we have: 50 QMIS 0 5

26 n x = 498 y=6 xy = 1178 x = 690 y =5370 SS xx = SS yy = 6.4 SS xy = 57. Regression MTB > print c1 c Row X Y MTB > desc c1 c Descriptive Statistics: X, Y Variable N Mean Median TrMean StDev SE Mean X Y Variable Minimum Maximum Q1 Q3 X Y Regression To see the relation between the two variables: X and Y we plot the scatter plot between them using the MINITAB Plot command: MTB > GStd. MTB > Plot 'Y' 'X'; SUBC> Symbol '*'. * * Y * 5.0+ * * * 0.0+ * * * * X The scatter plot indicates that there is nonperfect but positive correlation between the two variables. This could be more confirmed by computing the Pearson s linear correlation coefficient (r) using the MINITAB corr command: 5 QMIS 0 6

27 n x = 498 y=6 xy = 1178 x = 690 y =5370 SS xx = SS yy = 6.4 SS xy = 57. Regression MTB > corr c1 c s: X, Y Pearson correlation of X and Y = PValue = 0.00 The value of the correlation coefficient confirms that there is a strong positive linear correlation between X and Y. To estimate the linear regression line of Y on X (in which Y is the dependent variable and X is the independent one) we will use the MINITAB command Regress c 1 c1. 53 n x = 498 y=6 xy = 1178 x = 690 y =5370 SS xx = SS yy = 6.4 SS xy = 57. Regression To further store the values of the standardized errors in column C3 and the predicted values of Y in column C4. We use the command: Regress c 1 c1 c3 c4. The following is the output of that command: MTB > regr c 1 c1 c3 c4 Regression Analysis: Y versus X The regression equation is Y = X Predictor Coef SE Coef T P Constant X S = RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression Residual Error Total QMIS 0 7

28 Regression MTB > print c1 c c3 c4 X Y SE yhat From the above results we have: The best linear estimated regression line using the least squared method is: Ŷ X 55 Regression To test the validation of the model we will use the previous output from MINITAB. (1) Coefficient of Determination (r ): The value of r is presented in the above result. r = or 71.1%. To show how this value is computed we will use the values of Y, Yhat and Y. Through the MINITAB we will compute r using the formula: ŶY RSS r YY TSS 6.4 b ˆ SS xy * 57. r = = = SSyy 6.4 ( SS ) ( 57.) xx xy yy S = RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression Residual Error Total SSxx = SSyy = 6.4 SSxy = 57. OR OR ( ) ( ) r = = = = r = SS * SS * QMIS 0 8

29 Regression MTB > let c5=c4mean(c) MTB > name c5 'YhYB' MTB > let c6=c5** MTB > name c6 '(YhYB)' MTB > let c7 = (cmean(c))** MTB > name c7 '(YYB)' MTB > print c1 c c4 c5c7 Row X Y Yhat YhYB (YhYB) (YYB) MTB > let k1=sum(c6) MTB > let k=sum(c7) MTB > let k3=k1/k MTB > print k1k3 K K K r 57 S = RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Regression r = or 71.1% means that the estimated model of the regression for Y on X (i.e Ŷ X ) managed to explain 71.1% of the total variation in the dependent variable Y. The remaining 9.9% is not explained by the model either because the used model is not adequate, or because that part is a pure error term that can not be explained. We can also compute the same value directly from the results of the ANOVA table presented above, by: dividing the regression sum of squares by the total sum of squares: Analysis of Variance Source DF SS MS F P Regression Residual Error Total r = QMIS 0 9

30 S = RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Regression The same value is presented in the above output: S = RSq = 71.1% RSq(adj) = 67.5% It is worth mentioning here that for the simple regression model case the coefficient of Determination (r ) equals the square of the Pearson s Coefficient (r) (r) = (0.843) = Conclusion: 71.1% is a good and acceptable ratio. So the model is considered adequate enough. 59 Regression n x = 498 y=6 xy = 1178 x = 690 y =5370 SSxx = SSyy = 6.4 SSxy = 57. () Testing the validity of the overall estimated model: We will test the following two hypotheses using the results from the ANOVA presented above: H0: The estimated model does not fit the data (the estimated model is not good or accepted) H1: The estimated model fits the data. (the estimated model is good and accepted) The number of independent variables in the model is k = 1 The ANOVA results presented in the above MINITAB results are: Analysis of Variance Source DF SS MS F P Regression Residual Error Total QMIS 0 30

31 S = RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Regression So, the test statistics for testing the above hypotheses equals to: Fcalculated = and follows the F Distribution with 1 and 8 DF. The Pvalue = 0.00 The tabulated F value for = 0.05 and 1 and 8 DF equals to Conclusion: Reject Ho: The estimated model does not fit the data (the estimated model is not good or accepted) with 95% confidence. This mean that we have statistical evidences that the estimated model is accepted and good to fit and present the linear relation between the two variables. 61 Regression (3) Testing each component of the estimated model: If the validity of the model (as a whole) is accepted in the second step above. We can go ahead and test the importance, and the necessity of keeping each of the elements that form the estimated model. If the true model we are estimating is: Y= a+ b X + e And the estimated model is: Yˆ = ˆa+ ˆb X Then we will test whether a = 0 (i.e. we can eliminate a from the estimated model). And the estimated model will be: Yˆ = ˆb X 6 QMIS 0 31

32 Regression Similarly, we will test whether b = 0 (i.e. we can eliminate b and X contribution. And estimate the model with the contribution of a alone. a here will be the mean of the dependent variable Y). And the estimated model is: Yˆ = ˆa Or simply Ŷ Y For that, we will test the hypotheses: H o : b = 0 and H o : a = 0 H 1 : b 0 H 1 : a 0 63 S = RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Regression SSxx = SSyy = 6.4 SSxy = QMIS 0 3

33 The regression equation is Y = X Predictor Coef SE Coef T P Constant X Regression 65 Regression Based on the above results, we can conclude the following: We will reject Ho that b = 0. This means that the contribution of X is not negligible and it should be included in the model. The effect of a is not important and can be eliminated from the model. Even though if we leave it in the model it will not harm the model as its roll and contribution is not of a significant importance. 66 QMIS 0 33

34 Regression Confidence Intervals For model parameters: As we have shown before, one can estimate the Confidence Interval for B using the equation: P b ˆ t( ˆ ˆ ˆ /, nk1 )*S.E.(b) b b t( /, nk1)*s.e.(b) 1 And the Confidence Interval for A: P a ˆ t( /, nk1 ) *S.E.(a) ˆ a a ˆ t( /, nk1) *S.E.(a) ˆ 1 To estimate these two confidence intervals, we will use the MINITAB output. Mainly the estimated values for the two parameters and their Standard Errors shown in the output below: The regression equation is Y = X Predictor Coef SE Coef T P Constant X S = RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Regression The regression equation is Y = X Predictor Coef SE Coef T P Constant X S = RSq = 71.1% RSq(adj) = 67.5% Analysis of Variance Source DF SS MS F P Regression Residual Error Total For = 0.05 the tabulated value of t ( 0.05/, 8) =.306, and The estimated Confidence Interval for b is: P * b * P b And the estimated Confidence Interval for a is: P * a * P a QMIS 0 34