Chapter 14 Inference for Regression
Lesson 14-1, Part 1 Inference for Regression
Review Least-Square Regression A family doctor is interested in examining the relationship between patient s age and total cholesterol. He randomly selects 14 of his female patients and obtains the data present in Table 1. The data are based upon results obtained from the National Center for Health Statistics. Table 1 Age Total Cholesterol Age Total Cholesterol 25 180 42 183 25 195 48 204 28 186 51 221 32 180 51 243 32 210 58 208 32 197 62 228 38 239 65 269
Review Least-Square Regression 1. What is the least-square regression line for predicting total cholesterol from age for women? The least square regression equation is ŷ = 151.3537 + 1.3991x, where ŷ represents the predicted total cholesterol for a female who age is x.
Review Least-Square Regression 2. What is the correlation coefficient between age and cholesterol? Interpret the correlation coefficient in the context of the problem The linear correlation coefficient is 0.718. There is a moderate, positive linear relationship between female age and total cholesterol.
Review Least-Square Regression 3. What is the predicted cholesterol level of 67 year old female? cholesterol yˆ 151.3537 1.399x 151.3537 1.3991( age) 151.3537 1.3991(67) 245
Review Least-Square Regression 4. Interpret the slope of the regression line in the context of the problem? For each increase in age of one year, the total cholesterol is predicted to increases by 1.3991.
Statistics and Parameters When doing inference for regression, we use ŷ abx to estimate the population regression line. a and b are estimators of population parameters α and β, the intercept and slope of the population regression line.
Conditions The conditions necessary for doing inference for regression are: For each given value of x, the values of the response variable y-values are independent and normally distributed. For each value of x, the standard deviation, σ, of y- values is the same. The mean response of the y values for the fixed values of x are linearly related by the equation μ y = α + βx
Standard Error of the Regression Line Gives the variability of the vertical distances of the y-values from the regression line Remember that a residual was the error involved when making a prediction from the regression equation The spread around the line is measured with the standard deviation of the residual, s. s y ˆ y residuals 2 2 i i n 2 n 2
Standard Error of the Slope of the Regression Line Gives the variability of the estimates of the slope of the regression line SE b s n2 yˆ 2 2 x x x x i y i i i 2
Summary Inference for regression depends upon estimating μ y = α + βx with ŷ = a + bx For each x, the response values of y are independent and follow a normal distribution, each distribution having the same standard deviation. Inference for regression depends on the following statistics: a, the estimate of the y intercept, α, of μ y b, the estimate of the slope, β, of μ y s, the standard error of the residuals SE b the standard error of the slope of the regression line.
Age, x Computing Standard Error Total Cholesterol, y of the Residual ŷ = 151.3537 + 1.3991x Residuals (y ŷ) Residuals 2 (y ŷ) 2 25 180 186.33-6.33 40.0689 25 195 186.33 8.67 75.1689 28 186 190.53-4.53 20.5209 32 180 196.12-16.12 259.8544 32 210 196.12 13.88 192.6544 32 197 196.12 0.88 0.7744 38 239 204.52 34.48 1188.8704 62 228 238.10-10.10 102.01 65 269 242.30 26.70 712.89 Σ residuals 2 = 4553.708
Computing Standard Error 2 residuals 4553.705 S n 2 14 2 19.48
Example Page 787, #14.2 Body weights and backpack weights were collected for eight students Weight (lbs) Backpack weight (lbs) 120 187 109 103 131 165 158 116 26 30 26 24 29 35 31 28 These data were entered into a statistical package and leastsquares regression of backpack weight on body weight as requested. Here are the results.
Example Page 787, #14.2 Predictor Coef Stdev t-ratio p Constant 16.265 3.937 4.13 0.006 BodyWT 0.09080 0.02831 3.21 0.018 S = 2.270 R-sq = 63.2% R-sq(adj) = 57.0% A) What is the equation of the least-square line? Backpack weight = 16.265 + 0.09080(bodyweight)
Example Page 787, #14.2 Predictor Coef Stdev t-ratio p Constant 16.265 3.937 4.13 0.006 BodyWT 0.09080 0.02831 3.21 0.018 S = 2.270 R-sq = 63.2% R-sq(adj) = 57.0% B) The model for regression inference has three parameters, which we call α, β and σ. Can you determine the estimates for α and β from the computer printout? a = 16.265 estimates the true intercept α and b = 0.09080 estimates the true slope β.
Example Page 787, #14.2 Predictor Coef Stdev t-ratio p Constant 16.265 3.937 4.13 0.006 BodyWT 0.09080 0.02831 3.21 0.018 S = 2.270 R-sq = 63.2% R-sq(adj) = 57.0% C) The computer output reports that s = 2.270. This is an estimate of the parameter σ. Use the formula for s to verify the computer s value of s. Use your TI to verify this.
Example Page 788, #14.4 Exercise 3.71 on page 187 provided data on the speed of competitive runners and the number of steps they took per second. Good runners take more steps per second as they speed up. Here is the data again. speed steps 15.86 16.88 17.50 18.62 19.97 21.06 22.11 3.05 3.12 3.17 3.25 3.36 3.46 3.55 A) Enter the data into your calculator, perform least-square regression, and plot the scatterplot with the least-square line. What is the strength of the association between speed and steps per second?
Example Page 788, #14.4 Steps = 1.77 + 0.0803(speed). There is a very strong positive linear relationship between speed and steps; r = 0.999. nearly all the variation (r 2 = 0.998) 99.8% of it in steps per second is explained by the linear relationship.
steps per second Example Page 788, #14.4 speed (feet per second)
Example Page 788, #14.4 C) The model for regression inference has three parameters, α, β and σ. Estimate these parameters from the data a = 1.766 is the estimate of α b = 0.0803 is the estimate of β s = 0.0091 is the estimate of σ
Lesson 14-1, Part 2 Inference for Regression
Significance Test for the Slope of a Regression Line We want to test whether the slope of the regression line is zero or not. If the slope of the line is zero, then there is no linear relationship between x and y variables. Remember (formula for b) if r = 0, then b = 0 Hypothesis Two Tailed: H o : β = 0 and H a : β 0 Left Tailed: H o : β = 0 and H a : β < 0 Right Tailed: H o : β = 0 and H a : β > 0
Test Statistics and Confidence Interval t b β b SE SE b b b t * SE b t distribution with n 2 degrees of freedom SE b = Standard error of the slope SE b s x i x 2
Reading Computer Printouts
Example Page 794, #14.6 Exercise 14.1 (page 786) presents data on the lengths of two bones in five fossil specimens of the extinct beast Archaeopteryx. Here is part of the output from the S-PLUS statistical software when we regress the length y of the humerus on the length x of the femur. Coefficients Value Std Error t value Pr(> t ) (Intercepts) 3.6596 4.4590 0.8207 0.4719 Femur 1.1969 0.0751
Example Page 794, #14.6 Coefficients Value Std Error t value Pr(> t ) (Intercepts) 3.6596 4.4590 0.8207 0.4719 Femur 1.1969 0.0751 A) What is the equation of the least-squares regression line? humerus 3.6596 1.1969( femur )
Example Page 794, #14.6 Coefficients Value Std Error t value Pr(> t ) (Intercepts) 3.6596 4.4590 0.8207 0.4719 Femur 1.1969 0.0751 B) We left out the t statistic for testing H o : β = 0 and its P-value. Use the output to find t. t b 1.1969 15.94 S 0.0751 b
Example Page 794, #14.6 C)How many degrees of freedom does t have? Use Table C to approximate the P-value of t against the one-sided alternative H a : β > 0. df = 3; since t > 12.92, we know P-value < 0.0005 tcdf (15.9374, E99,3) 2.68510 4
Example Page 794, #14.6 D)Write a sentence to describe your conclusion about the slope of the true regression line. There is very strong evidence that β > 0, that is, that the line is useful for predicting the length of the humerus given the length of the femur
Example Page 794, #14.6 E) Determine a 99% confidence interval for the true slope of the regression line.
Example Page 794, #14.6 b * t S 1.1969 5.841(0.0751) (0.758,1.636) b
Example Page 794, #14.8 There is some evidence that drinking moderate amounts of wine helps prevent heart attacks. Exercise 3.63 (Page 183) gives data on yearly wine consumption (liters of alcohol from drinking wine, per person) and yearly deaths from heart disease (deaths per 100,000 people) in 19 developed nations. A) Is there statistically significant evidence of a negative association between wine consumption and heart disease deaths? Carry out the appropriate test of significance and write a summary statement about your conclusions.
Example Page 794, #14.8
Example Page 794, #14.8 β = negative association between wine consumption and heart disease deaths. H H o a : β 0 : β 0
Example Page 794, #14.8 Linear Regression T-test Condition 1. The observations are independent 2. The true relationship is linear (check scatterplot to check that the overall pattern is linear or plot of residuals against the predicted values) 3. The standard deviation of the response about the true line is the same everywhere (make sure the spread around the line is nearly constant) 4. The response varies normally about the true regression line (normal probability plot of residuals is quite straight)
Example Page 794, #14.8 t S b 22.97 6.47 S 3.357 b b s ( x x) p value 2.9610 P value 0.0005 2 6 Reject Ho, since p-value = 0.0005 < = 0.05 and conclude that there a linear relationship between wine consumption and heart disease deaths.
Example Page 795, #14.10 Exercise 14.4 (page 788) presents data on the relationship between the speed of runners (x, in feet per second) and the number of steps y that they take in a second. Here is part of the Data Desk Regression output for these data: R squared = 99.8% s = 0.0091 with 7 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001
Example Page 795, #14.10 R squared = 99.8% s = 0.0091 with 7 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001 A) How can you tell from this output, even without the scatterplot, that there is a very strong straight-line relationship between running speed and steps per second?
Example Page 795, #14.10 R squared = 99.8% s = 0.0091 with 7 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001 r 2 is very close to 1, which means that nearly all the variation in steps per second is accounted for by foot speed. Also, the P-value for β is small.
Example Page 795, #14.10 R squared = 99.8% s = 0.0091 with 7 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001 B) What parameter in the regression model gives the rate at which steps per second increase as running speed increases? Give a 99% confidence interval for this rate.
Example Page 795, #14.10 R squared = 99.8% s = 0.0091 with 7 2 = 5 degrees of freedom Variable Coefficient s.e. of Coeff t-ratio prob Constant 1.76608 0.0307 57.6 <0.0001 speed 0.080284 0.0016 49.7 <0.0001 β (the slope) is this rate; the estimate is listed as coeffincient of Speed, 0.080284. * bt S b 0.080284 4.032(0.0016) (0.074,0.087)
Lesson 14-2, Part 1 Predictions and Conditions
Confidence Intervals Write the given value of the explanatory variable x as x*. The distinction between predicting a single outcome and predicting the mean of all outcomes when x = x* determines what margin of error is correct. Estimate the mean response, we use a confidence interval. µ y = α + βx* Estimate an individual response y, we use a prediction interval
Confidence Intervals for Regression Response A level C confidence interval for the mean response µ y when x takes the value x* is yˆ * t SE μˆ The standard error SE s μˆ 2 * 1 x x n x x 2
Prediction Intervals for Regression Response A level C prediction interval for a single observation on y when x takes the value x* yˆ * t SE yˆ The standard error SE * 1 x x s 1 n x x yˆ 2 2
Conditions for Regression Inference The observations are independent The true relationship is linear The standard deviation of the response about the true line is the same everywhere. The response varies normally about the true regression line. Check conditions using the residuals.
Examine the residual plot to check that the relationship is roughly linear and that the scatter about the line is the same from end to end.
Violations of the regression conditions: The variation of the residuals is not constant.
Violations of the regression conditions: There is a curved relationship between the response variable and the explanatory variable.
Example Page 802, #14.12 A) The residuals for the crying and IQ data appear in Example 14.3 (page 785). Make a stemplot to display the distribution of the residuals. Are there outliers or signs of strong departures from normality? 19.20 31.13 22.65 15.18 12.18 15.15 16.63 6.18 1.70 22.60 6.68 6.17 9.15 23.58 9.14 2.80 9.14 1.66 6.14 12.60 0.34 8.62 2.85 14.30 9.82 10.82 0.37 8.85 10.87 19.34 10.89 2.55 20.85 24.35 18.94 32.89 18.47 51.32
Example Page 802, #14.12 3 1 2 4 3 3 1 0 8 5 5 3 2 0 9 9 9 9 7 6 6 6 3 2 2 0 0 0 3 3 9 1 0 1 1 1 4 8 9 9 2 1 4 3 3 4 5 1 One residual (51.32) may be a high outlier, but the stemplot does not Show any other deviations from normality.
Example Page 802, #14.12 B) What other assumptions or conditions are required for using inference for regression on these data? Check that those conditions are satisfied and then describe your findings.
Example Page 802, #14.12
Example Page 802, #14.12 The scatter of the data points about the regression line varies to a extent as we move along the line, but the variation is not serious, as a residual plot shows. The other conditions can be assumed to be satisfied.
Example Page 802, #14.12 C) Would a 95% prediction interval for x = 25 be narrower, the same size, or wider than a 95% confidence interval? Explain your reasoning. A prediction interval would be wider. For a fixed confidence level, the margin of error is always larger when we are predicting a single observation than when we are estimating the mean response.
Example Page 802, #14.12 D) A computer package reports that the 95% prediction interval for x = 25 is (91.85, 165.33). Explain what this interval means in simple language. We are 95% confident that when x (crying intensity) = 25, the corresponding value of y (IQ) will be between 91.85 and 165.33
Example Page 802, #14.14 In exercise 14.11 (page 795) we regressed the lean of the Leaning Tower of Pisa on year to estimate the rate at which the tower is tilting. Here are the residuals from that regression, in order by years across the rows: 4.220 3.099 0.418 1.264 2.055 3.626 2.308 5.011 0.670 4.648 5.967 1.714 7.396 Use the residuals to check the regression conditions, and describe your findings. Is the regression in exercise 14.11 trustworthy?
Example Page 802, #14.14 In exercise 14.11 (page 795) we regressed the lean of the Leaning Tower of Pisa on year to estimate the rate at which the tower is tilting. Here are the residuals from that regression, in order by years across the rows: 4.220 3.099 0.418 1.264 2.055 3.626 2.308 5.011 0.670 4.648 5.967 1.714 7.396 Use the residuals to check the regression conditions, and describe your findings. Is the regression in exercise 14.11 trustworthy?
Example Page 802, #14.14 Residual Normal Prop. Of Residual The scatterplot of the residual versus year does not suggest any problems. The regression in Exercise 14.11 should be fairly reliable
Example Page 809, #14.24 Here are data on the time (in minutes) Professor Moore takes to swim 2000 yards and his pulse rate (beat per minute) after swimming: Time: 34.12 35.72 34.72 34.05 34.13 35.72 36.17 35.57 Pulse: 152 124 140 152 146 128 136 144 Time: 35.37 35.57 35.43 36.05 34.85 34.70 34.75 33.93 Pulse: 148 144 136 124 148 144 140 156 Time: 34.60 34.00 34.35 35.62 35.68 35.28 35.97 Pulse: 136 148 148 132 124 132 139
Example Page 809, #14.24 A scatterplot shows a negative linear relationship: a faster time (fewer minutes) is associated with a higher heart rate. Here is part of the output from the regression function in Excel spreadsheets. Coefficients Standard Error t Stat P-value Intercepts 479.9341457 66.22779275 7.246718119 3.87075E 07 X variable 9.694903394 1.888664503 5.1332057 4.37908E 05 Give a 90% confidence interval for the slope of the true regression line. Explain what your result tells us about the relationship between the professor s swimming time and heart rate.
Example Page 809, #14.24 Coefficients Standard Error t Stat P-value Intercepts 479.9341457 66.22779275 7.246718119 3.87075E 07 X variable 9.694903394 1.888664503 5.1332057 4.37908E 05 * 21 b t SE b 9.9649 1.721(1.8887) 12.9454 to 6.4444 bpm per minute With a 90% confidence, we can say that for each 1-minute increase in swimming time, pulse rate drops by 6 to 13 bpm.
Example Page 809, #14.24 Using the TI
Example Page 809, #14.25 Exercise 14.24 gives data on a swimmer s time and heart rate. One day the swimmer completes his laps in 34.3 minutes but forgets to take his pulse. Minitab gives this prediction for heart rate when x* = 34.3: Fit StDev Fit 90.0% CI 90.0% PI 147.40 1.97 (144.02, 150.78) (135.79, 159.01) A) Verify that Fit is the predicted heart rate from the least-square line found in exercise 14.24. Then choose one of the intervals from the output to estimate the swimmer s heart rate that day and explain why you chose this interval.
Example Page 809, #14.25 Fit StDev Fit 90.0% CI 90.0% PI 147.40 1.97 (144.02, 150.78) (135.79, 159.01) ˆy ( pulse) 479.9 9.6949 x( time) when x = 34.3 minutes ˆy ( pulse) 479.9 9.6949(34.3) 147.37 this agrees the output
Example Page 809, #14.25 Fit StDev Fit 90.0% CI 90.0% PI 147.40 1.97 (144.02, 150.78) (135.79, 159.01) The prediction interval is appropriate for estimating one value (as opposed to mean of many values): 135.79 to 159.01 bpm
Example Page 809, #14.25 Fit StDev Fit 90.0% CI 90.0% PI 147.40 1.97 (144.02, 150.78) (135.79, 159.01) B) Minitab gives only one of the two standard errors used in prediction. It is SE the standard error for estimating ˆ the mean response. Use this fact and a critical value from table C to verify Minitab s 90% confidence interval for the mean heart rate on days when the swimming time is 34.3 minutes.
Example Page 809, #14.25 Fit StDev Fit 90.0% CI 90.0% PI 147.40 1.97 (144.02, 150.78) (135.79, 159.01) yˆ * 21 ˆ t SE 147.40 1.721(1.97) 144.01 to 150.79, which agrees with the computer output