Chapter 12. Simple Linear Regression and Correlation

Transcription

1 Chapter 12. Simple Linear Regression and Correlation 12.1 The Simple Linear Regression Model 12.2 Fitting the Regression Line 12.3 Inferences on the Slope Rarameter β Inferences on the Regression Line 12.5 Prediction Intervals for Future Response Values 12.6 The Analysis of Variance Table 12.7 Residual Analysis 12.8 Variable Transformations 12.9 Correlation Analysis Supplementary Problems NIPRL 1

2 12.1 The Simple Linear Regression Model Model Definition and Assumptions(1/5) With the simple linear regression model y i =β 0 +β 1 x i +ε i the observed value of the dependent variable y i is composed of a linear function β 0 +β 1 x i of the explanatory variable x i, together with an error term ε i. The error terms ε 1,,ε,ε n are generally taken to be independent observations from a N(0,σ 2 ) distribution, for some error variance σ 2. This implies that the values y 1,,y n are observations from the independent random variables Y i ~ N (β 0 +β 1 x i, σ 2 ) as illustrated in Figure 12.1 NIPRL 2

3 Model Definition and Assumptions(2/5) NIPRL 3

4 Model Definition and Assumptions(3/5) The parameter β 0 is known as the intercept parameter, and the parameter β 0 is known as the intercept parameter, and the parameter β 1 is known as the slope parameter. A third unknown parameter, the error variance σ 2, can also be estimated from the data set. As illustrated in Figure 12.2, the data values (x i, y i ) lie closer to the line y = β 0 +β 1 x as the error variance σ 2 decreases. NIPRL 4

5 Model Definition and Assumptions(4/5) The slope parameter β 1 is of particular interest since it indicates how the expected value of the dependent variable depends upon the explanatory variable x, as shown in Figure 12.3 The data set shown in Figure 12.4 exhibits a quadratic (or at least nonlinear) relationship between the two variables, and it would make no sense to fit a straight line to the data set. NIPRL 5

6 Model Definition and Assumptions(5/5) Simple Linear Regression Model The simple linear regression model y i = β 0 + β 1 x i + ε i fit a straight line through a set of paired data observations (x 1,y 1 ),,(x n, y n ). The error terms ε 1,,ε n are taken to be independent observations from a N(0,σ 2 ) distribution. The three unknown parameters, the intercept parameter β 0, the slope parameter β 1, and the error variance σ 2, are estimated from the data set. NIPRL 6

7 Examples(1/2) Example 3 : Car Plant Electricity Usage The manager of a car plant wishes to investigate how the plant s electricity usage depends upon the plant s production. The linear model y = β + βx 0 1 will allow a month s electrical usage to be estimated as a function of the month s production. NIPRL 7

8 Examples(2/2) NIPRL 8

9 12.2 Fitting the Regression Line Parameter Estimation(1/4) The regression line y = β + βx is fitted to the data points ( x, y ), K,( x, y ) by finding the line that is "closest" to the data points in some sense. As Figure illustrates, the fitted line is chosen to be the line that minimizes the sum of the squares of these vertical deviations Q = ( y ( β + βx )) n i= 1 i 0 1 i and this is referred to as the least squares fit. 2 n n NIPRL 9

10 Parameter Estimation(2/4) With normally distributed error terms, ˆ β and ˆ β are maximum likelihood estimates. n 2 εi i= ( Q) The joint density of the error terms ε, K, ε is Q β 0 1 n 1 2σ e. 2πσ This likelihood is maximized by minizing 2 2 εi = ( yi ( β0 + β1 i )) = n i= 1 i 0 1 i n 0 1 i= 1 i x Q = 2( y ( β + βx )) and Q n = i= 1 2 xi ( yi ( β0 + β1xi )) β the normal equations y = n ˆ β + ˆ β x and i = ˆ β + ˆ β n n n 2 i= 1xy i i 0 i= 1xi 1 i= 1x i NIPRL 10 1 n

11 β n xy ( x )( y ) S n n n i= 1 i i i= 1 i i= 1 i XY 1 = = n 2 n 2 n i=1xi ( i=1xi ) SXX and then n n i= 1yi i= 1xi β 0 = β 1 = y β 1x n n where S = ( x x) = x nx n 2 n 2 2 XX i= 1 i i= 1 i n n 2 ( i= 1xi ) = i= 1xi n and Parameter Estimation(3/4) 2 n n n n n ( i= 1xi )( i= 1yi ) SXY = i= 1( xi x)( yi y) = i= 1xy i i nxy = i= 1xy i i n * For a specific value of the explanatory variable x, this equation * provides a fitted value yˆ = β + β x for the dependent variable y, as illustrated in Figure x * 0 1 NIPRL 11

12 Parameter Estimation(4/4) 2 The error variance can be estimated by considering the deviations between the observed data values y and their fitted values y. Specifically, the sum of squares for error SSE is i σ defined to be the sum of the squares of these deviations n n SSE = ( y y ) = ( y ( β + β x )) 2 2 i= 1 i i i= 1 i 0 1 i n 2 n n = y β y β xy i= 1 i 0 i= 1 i 1 i= 1 i i and the estimate of the error variance is 2 SSE σ = n 2 i NIPRL 12

13 Examples(1/5) Example 3 : Car Plant Electricity Usage For this example n 12 i= 1 12 i= 1 12 i= 1 12 i= 1 12 i= 1 x i y x y i = 12 and = L = = L = = L = i = L = i xy i i = ( ) + L+ ( ) = NIPRL 13

15 i i i i i= 1 i= 1 i= 1 β 1 = n n 2 2 n xi ( xi ) i= 1 i= Examples(3/5) The estimates of the slope parameter and the intercept parameter : β n n n n xy ( x )( y ) ( ) ( ) = = ( ) = y βx = ( ) = The fitted regression line : y = β + βx = x 0 1 $ y = ( ) = NIPRL 15

16 Examples(4/5) Using the model for production values x outside this range is known as extrapolatio n and may give inaccurate results. NIPRL 16

17 Examples(5/5) n n n 2 y i β0 yi β1 xy i i 2 i= 1 i= 1 i= 1 σ = n ( ) ( ) = = 10 σ = = NIPRL 17

18 12.3 Inferences on the Slope Parameter β Inference Procedures(1/4) Inferences on the Slope Parameter β 1 2 ˆ σ β1 Ν( β1, ). S XX A two-sided confidence interval with a confidence level 1 α for the slope parameter in a simple linear regression model is β ( β t se..( β ), β + t se..( β )) 1 1 α / 2, n α / 2, n 2 1 which is σt α / 2, n 2 σtα / 2, n 2 β1 ( β1, β1 + ) S S XX One-sided 1 α confidence level confidence intervals are σt α, n 2 σtα, n 2 β1 (, β1 + ) and β1 ( β1, ) S S XX XX XX NIPRL 18

19 Inference Procedures(2/4) The two-sided hypotheses H : β = b versus H : β b A 1 1 for a fixed value b of interest are tested with the t-statistic β1 b1 t = σ S XX 1 The p-value is p-value = 2 P( X > t ) where the random variable X has a t-distribution with n 2 degrees of freedom. A size α test rejects the null hypothesis if t > t α / 2, n 2. NIPRL 19

20 Inference Procedures(3/4) The one-sided hypotheses H : β b versus H : β < b A 1 1 have a p-value p-value = P( X < t) and a size α test rejects the null hypothesis if t < t α. The one-sided hypotheses H : β b versus H : β > b A 1 1 have a p-value p-value = P( X > t) and a size α test rejects the null hypothesis if t > t α., n 2, n 2 NIPRL Slki Lab. 20

21 Inference Procedures(4/4) An interesting point to notice is that for a fixed value of the error variance σ 2, the variance of the slope parameter estimate decreases as the value of S XX increases. This happens as the values of the explanatory variable x i become more spread out, as illustrated in Figure This result is intuitively reasonable since a greater spread in the values x i provides a greater leverage for fitting the regression line, and therefore the slope parameter estimate β 1 should be more accurate. NIPRL 21

22 Examples(1/2) Example 3 : Car Plant Electricity Usage 12 2 ( x 12 i ) 2 2 i= SXX = xi = = i= σ se..( β1) = = = S XX The t-statistic for testing H : β = 0 β1 t se.. β ( 1) = = The two-sided p-value p value = 2 P( X > 6.37) 0 NIPRL 22

23 Examples(2/2) With t 0.005,10 = 3.169, a 99% two-sided confidence interval for the slope parameter β ( β critical point se..( β ), β + critical point se..( β )) = ( , ) = ( 0.251, 0.747) NIPRL 23

24 12.4 Inferences on the Regression Line Inference Procedures(1/2) Inferences on the Expected Value of the Dependent Variable * A 1 α confidence level two-sided confidence interval for β0 + β1x, the * expected value of the dependent variable for a particular value of the explanatory variable, is * * β * 0 + β1x ( β0 + β1 x tα / 2, n 1 se..( β0 + β1x ), * * β + βx + t se..( β + βx )) where * 1 ( x x) se..( β0 + β1x ) = σ + n S 0 1 α / 2, n * 2 XX x NIPRL 24

25 Inference Procedures(2/2) One-sided confidence intervals are * * * β + βx (, β + βx + t se..( β + βx )) α, n and * * * β + β x ( β + β x t se..( β + β x ), ) α, n β β * Hypothesis tests on 0 + 1x can be performed by comparing the t-statistic ( β t = * * 0 + β1x ) ( β0 + β1x ) * se..( β + βx ) 0 1 with a t-distribution with n 2 degrees of freedom. NIPRL 25

26 Examples(1/2) Example 3 : Car Plant Electricity Usage ( x x ) 2 * * * ( x 4.885) se..( β0 + β1x ) = σ + = n S With β t XX = 2.228, a 95% confidence interval for β + βx 0.025, * 2 1 ( x 4.885) ( , * * 0 + β1x + x + * At 5 x = x + * * 2 1 ( x 4.885) + ) β + 5 β ( ( ) 0.113, ( ) ) = (2.79,3.02) * NIPRL 26

28 12.5 Prediction Intervals for Future Response Values Inference Procedures(1/2) Prediction Intervals for Future Response Values A 1 α confidence level two-sided prediction interval for y, a future value * of the dependent variable for a particular value of the explanatory variable, is x x * * 1 ( x x) y * ( β0 + β1 x t / 2, n 1 1 x α σ + + n S * 2 XX, * 2 * 1 ( x x) β0 + β1 x + tα / 2, n 2σ 1+ + ) n S XX NIPRL 28

29 Inference Procedures(2/2) One-sided confidence intervals are * 2 * 1 ( x x) y * (, β0 + β1 x + t, n 2 1 ) x α σ + + n S and * 2 * 1 ( x x) y * ( β0 + β1 x t, n 1 1, ) x α σ + + n S XX XX NIPRL 29

30 Examples(1/2) Example 3 : Car Plant Electricity Usage With t = 2.228, a 95% confidence interval for y y 0.025,10 13 ( x 4.885) * 2 * * ( x , x * At x = ( x 4.885) + x * 2 * ) y ( ( ) 0.401, ( ) ) = (2.50,3.30) x * NIPRL 30

32 12.6 The Analysis of Variance Table Sum of Squares Decomposition(1/5) NIPRL 32

33 Sum of Squares Decomposition(2/5) NIPRL 33

34 Sum of Squares Decomposition(3/5) Source Degrees of freedom Sum of squares Mean squares F-statistic p-value Regression Error 1 N-2 SSR SSE 2 σ MSR=SSR =MSE=SSE/(n-2) F=MSR/MSE P( F 1,n-2 > F ) 1,n Total n-1 F I G U R E Analysis of variance table for simple linear regression analysis NIPRL 34

35 Sum of Squares Decomposition(4/5) NIPRL 35

36 Sum of Squares Decomposition(5/5) Coefficient of Determination R 2 The total variability in the dependent variable, the total sum of squares SST = ( y y) n i= 1 2 can be partitioned into the variability explained by the regression line, n the regression sum of squares SSR = ( y y) i i= 1 2 i and the variability about the regression line, the error sum of squares n 2 i= 1 yi yi SSE = ( ). The proportion of the total variability accounted for by the regression line is the coefficient of determination 2 SSR SSE 1 R = = 1 = SST SST SSE 1 + SSR which takes a value between zero and one. NIPRL 36

37 Examples(1/1) Example 3 : Car Plant Electricity Usage MSR F = = = MSE SSR R = = = SST NIPRL 37

38 12.7 Residual Analysis Residual Analysis Methods(1/7) The residuals are defined to be e i = yi yi, 1 i n so that they are the differences between the observed values of the dependent variable and ythe corresponding fitted values. A property of the residuals n e = 0 i= 1 i 0 Residual analysis can be used to i Identify data points that are outliers, Check whether the fitted model is appropriate, Check whether the error variance is constant, and Check whether the error terms are normally distributed. y i NIPRL 38

39 Residual Analysis Methods(2/7) A nice random scatter plot such as the one in Figure there are no indications of any problems with the regression analysis Any patterns in the residual plot or any residuals with a large absolute value alert the experimenter to possible problems with the fitted regression model. NIPRL 39

40 Residual Analysis Methods(3/7) A data point (x i, y i ) can be considered to be an outlier if it does not appear to predict well by the fitted model. Residuals of outliers have a large absolute value, as indicated in Figure Note in the figure that e i is used instead of e ˆ i. s [For your interest only] Var e 1 ( x - x) 2 i 2 ( i ) = (1- - ) s. n SXX NIPRL 40

41 Residual Analysis Methods(4/7) If the residual plot shows positive and negative residuals grouped together as in Figure 12.47, then a linear model is not appropriate. As Figure indicates, a nonlinear model is needed for such a data set. NIPRL 41

42 Residual Analysis Methods(5/7) If the residual plot shows a funnel shape as in Figure 12.48, so that the size of the residuals depends upon the value of the explanatory variable x, then the assumption of a constant error variance σ 2 is not valid. NIPRL 42

43 Residual Analysis Methods(6/7) A normal probability plot ( a normal score plot) of the residuals Check whether the error terms ε i appear to be normally distributed. The normal score of the i th smallest residual 3 i 1 8 Φ 1 n + 4 The main body of the points in a normal probability plot lie approximately on a straight line as in Figure is reasonable The form such as in Figure indicates that the distribution is not normal NIPRL 43

44 Residual Analysis Methods(7/7) NIPRL 44

45 Example : Nile River Flowrate Examples(1/2) NIPRL 45

46 Examples(2/2) x = 3.88 $ y = ( ) = 2.77 e = y y = = 1.24 i i i ei 1.24 = = 3.75 σ x = 6.13 e = y y = 5.67 ( ( )) = 1.02 i i i ei 1.02 = = 3.07 σ NIPRL 46

47 12.8 Variable Transformations Intrinsically Linear Models(1/4) NIPRL 47

48 Intrinsically Linear Models(2/4) NIPRL 48

51 Examples(1/5) Example : Roadway Base Aggregates NIPRL 51

56 12.9 Correlation Analysis The Sample Correlation Coefficient Sample Correlation Coefficient The sample correlation coefficientr for a set of paired data observations ( x, y ) is i i n SXY i= 1( xi x)( yi y) r = = S S ( x x ) ( y y ) n 2 n 2 XX YY i= 1 i i= 1 i n i= 1xy i i nxy = x nx y ny n 2 2 n 2 2 i= 1 i= 1 i i It measures the strength of linear association between two variables and can be thought of as an estimate of the correlation ρ between the two associated random variable X and Y. NIPRL 56

57 Under the assumption that the X and Y random variables have a bivariate normal distribution, a test of the null hypothesis H 0 : ρ = 0 can be performed by comparing the t-statistic r n 2 t = 2 1 r with a t-distribution with n 2 degrees of freedom. In a regression framework, this test is equivalent to testing H : β = NIPRL 57

58 NIPRL 58

59 NIPRL 59

60 Example : Nile River Flowrate Examples(1/1) r R 2 = = = NIPRL 60