CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Transcription

1 Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the relationship etween two or more variales. A simple regression model includes only two variales: one independent and one dependent. The dependent variale is the one eing explained, and the independent variale is the one used to explain the variation in the dependent variale. Linear Regression Definition A (simple) regression model that gives a straight-line relationship etween two variales is called a linear regression model. Figure 13.1 Relationship etween food expenditure and income. (a) Linear relationship. () Nonlinear relationship.

2 Figure 13. Plotting a linear equation. Figure 13.3 y-intercept and slope of a line. SIMPLE LINEAR REGREION ANALYSIS SIMPLE LINEAR REGREION ANALYSIS Definition In the regression model y = A + Bx + ε, A is called the y- intercept or constant term, B is the slope, and ε is the random error term. The dependent and independent variales are y and x, respectively. SIMPLE LINEAR REGREION ANALYSIS Definition In the model ŷ = a + x, a and, which are calculated using sample data, are called the estimates of A and B, respectively. Tale 13.1 Incomes (in hundreds of dollars) and Food Expenditures of Seven Households

3 Scatter Diagram Definition A plot of paired oservations is called a scatter diagram. Figure 13.4 Scatter diagram. Figure 13.5 Scatter diagram and straight lines. Figure 13.6 Regression Line and random errors. Error Sum of Squares (E) The error sum of squares, denoted E, is E = e = ( y yˆ ) The values of a and that give the minimum E are called the least square estimates of A and B, and the regression line otained with these estimates is called the least squares line. The Least Squares Line For the least squares regression line ŷ = a + x, where xy = and a = y x ( x)( y) ( x) x xy = xy and = n and stands for sum of squares. The least squares regression line ŷ = a + x is also called the regression of y on x. n

4 Example 13-1 Find the least squares regression line for the data on incomes and food expenditure on the seven households given in the Tale Use income as an independent variale and food expenditure as a dependent variale. Tale 13. Example 13-1: Solution x = 386 y = 108 x = x/ n = 386 / 7 = y = y / n = 108 / 7 = Example 13-1: Solution ( x)( y) (386)(108) xy = xy = 6403 = n 7 ( ) x (386) = x = 3,058 = n 7 Example 13-1: Solution Figure 13.7 Error of prediction. xy = = = a = y x = (.55)(55.149) = Thus, our estimated regression model is ŷ = x

5 Interpretation of a and Interpretation of a! Consider a household with zero income. Using the estimated regression line otained in Example 13-1, " ŷ = (0) = $ hundred.! Thus, we can state that a household with no income is expected to spend $ per month on food.! The regression line is valid only for the values of x etween 33 and 83. Interpretation of a and Interpretation of! The value of in the regression model gives the change in y (dependent variale) due to a change of one unit in x (independent variale).! We can state that, on average, a $100 (or $1) increase in income of a household will increase the food expenditure y $5.5 (or $.55). Figure 13.8 Positive and negative linear relationships etween x and y. Case Study 13-1 Regression of Weights on Heights for NFL Players Case Study 13-1 Regression of Weights on Heights for NFL Players Assumptions of the Regression Model Assumption 1: The random error term Є has a mean equal to zero for each x Assumption : The errors associated with different oservations are independent Assumption 3: For any given x, the distriution of errors is normal Assumption 4: The distriution of population errors for each x has the same (constant) standard deviation, which is denoted σ Є

6 Figure (a) Errors for households with an income of $4000 per month. Figure () Errors for households with an income of $ 7500 per month. Figure 13.1 Distriution of errors around the population regression line. Figure Nonlinear relations etween x and y. STANDARD DEVIATION OF ERRORS AND COEFFICIENT OF DETERMINATION Degrees of Freedom for a Simple Linear Regression Model The degrees of freedom for a simple linear regression model are df = n Figure Spread of errors for x = 40 and x = 75.

7 STANDARD DEVIATION OF ERRORS AND COEFFICIENT OF DETERMINATION The standard deviation of errors is calculated as where s e = n xy Example 13- Compute the standard deviation of errors s e for the data on monthly incomes and food expenditures of the seven households given in Tale ( y ) = y n Tale 13.3 Example 13-: Solution ( y ) (108) = y = 179 = n 7 xy ( ) se = = n 7 COEFFICIENT OF DETERMINATION Total Sum of Squares (T) The total sum of squares, denoted y T, is calculated as ( ) T = y n y Figure Total errors. Note that this is the same formula that we used to calculate.

8 Tale 13.4 Figure Errors of prediction when regression model is used. COEFFICIENT OF DETERMINATION Regression Sum of Squares (R) The regression sum of squares, denoted y R, is R = T E COEFFICIENT OF DETERMINATION Coefficient of Determination The coefficient of determination, denoted y r, represents the proportion of T that is explained y the use of the regression model. The computational formula for r is r = xy and 0 r 1 Example 13-3 For the data of Tale 13.1 on monthly incomes and food expenditures of seven households, calculate the coefficient of determination. Example 13-3: Solution! From earlier calculations made in Examples 13-1 and 13-,! =.55, = , = xy (.55)( ) r = = =

9 INFERENCES ABOUT B! Sampling Distriution of! Estimation of B! Hypothesis Testing Aout B Sampling Distriution of Mean, Standard Deviation, and Sampling Distriution of Because of the assumption of normally distriuted random errors, the sampling distriution of is normal. The mean and standard deviation of, denoted y µ and, respectively, σ are σ µ = B and σ = Estimation of B Confidence Interval for B The (1 α)100% confidence interval for B is given y ± ts Example 13-4 Construct a 95% confidence interval for B for the data on incomes and food expenditures of seven households given in Tale where s = s e and the value of t is otained from the t distriution tale for α α / area in the right tail of the t distriution and n- degrees of freedom. Example 13-4: Solution s se = = = df = n = 7 = 5 α / = (1.95) / =.05 t =.571 ± ts =.55 ±.571(.0379) =.55 ±.0974 =.155 to.350 Hypothesis Testing Aout B Test Statistic for The value of the test statistic t for is calculated as B t = s The value of B is sustituted from the null hypothesis.

10 Example 13-5 Test at the 1% significance level whether the slope of the regression line for the example on incomes and food expenditures of seven households is positive. Example 13-5: Solution! Step 1: H 0 : B = 0 (The slope is zero) H 1 : B > 0 (The slope is positive)! Step : σ is not known Hence, we will use the t distriution to make the test aout B. Example 13-5: Solution! Step 3: α =.01 Area in the right tail = α =.01 df = n = 7 = 5 The critical value of t is Figure Example 13-5: Solution # Step 4: From H 0 B.55 0 t = = = 6.66 s.0379 Example 13-5: Solution! Step 5: The value of the test statistic t = 6.66 " It is greater than the critical value of t = " It falls in the rejection region Hence, we reject the null hypothesis We conclude that x (income) determines y (food expenditure) positively.

11 LINEAR CORRELATION! Linear Correlation Coefficient! Hypothesis Testing Aout the Linear Correlation Coefficient Linear Correlation Coefficient Value of the Correlation Coefficient The value of the correlation coefficient always lies in the range of 1 to 1; that is, -1 ρ 1 and -1 r 1 Figure Linear correlation etween two variales. (a) Perfect positive linear correlation, r = 1 Figure Linear correlation etween two variales. () Perfect negative linear correlation, r = -1 Copyright 013 John Wiley x & Sons. All rights reserved. Copyright 013 John Wiley x & Sons. All rights reserved. Figure Linear correlation etween two variales. Figure Linear correlation etween variales. (c) No linear correlation,, r 0 Copyright 013 John Wiley x & Sons. All rights reserved.

12 Figure Linear correlation etween variales. Figure Linear correlation etween variales. Figure Linear correlation etween variales. Linear Correlation Coefficient Linear Correlation Coefficient The simple linear correlation coefficient, denoted y r, measures the strength of the linear relationship etween two variales for a sample and is calculated as r = xy Example 13-6 Calculate the correlation coefficient for the example on incomes and food expenditures of seven households. Example 13-6: Solution r = xy = =.95 ( )( )

13 Hypothesis Testing Aout the Linear Correlation Coefficient Test Statistic for r If oth variales are normally distriuted and the null hypothesis is H 0 : ρ = 0, then the value of the test statistic t is calculated as t = r n 1 r Example 13-7 Using the 1% level of significance and the data from Example 13-1, test whether the linear correlation coefficient etween incomes and food expenditures is positive. Assume that the populations of oth variales are normally distriuted. Here n are the degrees of freedom. Example 13-7: Solution! Step 1: H 0 : ρ = 0 (The linear correlation coefficient is zero) H 1 : ρ > 0 (The linear correlation coefficient is positive)! Step : The population distriutions for oth variales are normally distriuted. Hence, we can use the t distriution to perform this test aout the linear correlation coefficient. Example 13-7: Solution! Step 3: Area in the right tail =.01 df = n = 7 = 5 The critical value of t = Figure 13.0 Example 13-7: Solution # Step 4:!=" % #/$ " # =.&'($ ) #/$ (.&'($ ) # =6.667

14 Example 13-7: Solution! Step 5: The value of the test statistic t = " It is greater than the critical value of t=3.365 " It falls in the rejection region Hence, we reject the null hypothesis. REGREION ANALYSIS: A COMPLETE Example 13-8 A random sample of eight drivers selected from a small city insured with a company and having similar minimum required auto insurance policies was selected. The following tale lists their driving experiences (in years) and monthly auto insurance premiums (in dollars). We conclude that there is a positive relationship etween incomes and food expenditures. Example 13-8 Example 13-8 (a) Does the insurance premium depend on the driving experience or does the driving experience depend on the insurance premium? Do you expect a positive or a negative relationship etween these two variales? () Compute,, and xy. (c) Find the least squares regression line y choosing appropriate dependent and independent variales ased on your answer in part a. (d) Interpret the meaning of the values of a and calculated in part c. Example 13-8 (e) Plot the scatter diagram and the regression line. (f) Calculate r and r and explain what they mean. (g) Predict the monthly auto insurance for a driver with 10 years of driving experience. (h) Compute the standard deviation of errors. (i) Construct a 90% confidence interval for B. (j) Test at the 5% significance level whether B is negative. (k) Using α =.05, test whether ρ is different from zero. (a) Based on theory and intuition, we expect the insurance premium to depend on driving experience. " The insurance premium is a dependent variale " The driving experience is an independent variale

15 Tale 13.5 () x = x/ n = 90/8= 11.5 y = y/ n = 474/8= 59.5 xy ( x)( y) (90)(474) = xy = 4739 = n 8 ( x) (90) 1396 n ( y ) (474) 9,64 n = x = = = y = = (c) xy = = = a = y x = 59.5 ( )(11.5) = (d) The value of a = gives the value of ŷ for x = 0; that is, it gives the monthly auto insurance premium for a driver with no driving experience. The value of = indicates that, on average, for every extra year of driving experience, the monthly auto insurance premium decreases y $1.55. ŷ=)-.--./ $./')-) Figure 13.1 Scatter diagram and the regression line. (e) The regression line slopes downward from left to right. (f) xy r = = =.77 ( )( ) r xy ( )( ) = = =

16 (f) The value of r = indicates that the driving experience and the monthly auto insurance premium are negatively related. The (linear) relationship is strong ut not very strong. The value of r² = 0.59 states that 59% of the total variation in insurance premiums is explained y years of driving experience and 41% is not. (g) Using the estimated regression line, we find the predicted value of y for x = 10 is ŷ = (10) = $61.18 Thus, we expect the monthly auto insurance premium of a driver with 10 years of driving experience to e $ (h) s e = n xy ( )( ) = 8 = (i) se s = = = α / =.5 (.90/) =.05 df = n = 8 = 6 t = ± ts = ± 1.943(.570) = ± =.57 to.5 (j)! Step 1: H 0 : B = 0 (B is not negative) H 1 : B < 0 (B is negative)! Step 3: Area in the left tail = α =.05 df = n = 8 = 6 The critical value of t is ! Step : Because the standard deviation of the error is not known, we use the t distriution to make the hypothesis test

17 Figure 13. # Step 4: From H 0 B t = = =.937 s.570! Step 5: The value of the test statistic t = " It falls in the rejection region Hence, we reject the null hypothesis and conclude that B is negative. (k)! Step 1: H 0 : ρ = 0 (The linear correlation coefficient is zero) H 1 : ρ 0 (The linear correlation coefficient is different from zero) The monthly auto insurance premium decreases with an increase in years of driving experience.! Step : Assuming that variales x and y are normally distriuted, we will use the t distriution to perform this test aout the linear correlation coefficient. Figure 13.3! Step 3: Area in each tail =.05/ =.05 df = n = 8 = 6 The critical values of t are and.447

18 # Step 4: *=" % #/$ " # =.)-)& ( #/$ (.))) # = -.936! Step 5: The value of the test statistic t = " It falls in the rejection region Hence, we reject the null hypothesis We conclude that the linear correlation coefficient etween driving experience and auto insurance premium is different from zero. USING THE REGREION MODEL! Using the Regression Model for Estimating the Mean Value of y! Using the Regression Model for Predicting a Particular Value of y Figure 13.4 Population and sample regression lines. Using the Regression Model for Estimating the Mean Value of y Confidence Interval for µ y x The (1 α)100% confidence interval for µ y x for x = x 0 is ˆ y ± t s ym ˆ where the value of t is otained from the t distriution tale for α/ area in the right tail of the t distriution curve and df = n. Using the Regression Model for Estimating the Mean Value of y Confidence Interval for µ y x The value of s y is calculated as follows: s ˆm 1 ( x0 x) = s + n yˆ m e

19 Example 13-9 Refer to Example 13-1 on incomes and food expenditures. Find a 99% confidence interval for the mean food expenditure for all households with a monthly income of $5500. Example 13-9: Solution! Using the regression line estimated in Example 13-1, we find the point estimate of the mean food expenditure for x = 55 " ŷ = (55) = $ hundred! Area in each tail = α/ = (1.99)/ =.005! df = n = 7 = 5! t = 4.03 Example 13-9: Solution s = , x = , and = S e yˆ m 1 ( x0 x) = se + n 1 ( ) = (1.5939) + = Example 13-9: Solution Hence, the 99% confidence interval for µ yˆ ± ts = ± 4.03(.605) yˆ m y 55 = ±.493 = to is Using the Regression Model for Predicting a Particular Value of y Prediction Interval for y p The (1 α)100% prediction interval for the predicted value of y, denoted y y p, for x = x 0 is ˆ y ± t s y ˆ p Using the Regression Model for Predicting a Particular Value of y Prediction Interval for y p where the value of t is otained from the t distriution tale for α/ area in the right tail of the t distriution curve and df = n. The value of s s y ˆ p yˆ p e is calculated as follows: 1 ( x0 x) = s 1+ + n

20 Example Refer to Example 13-1 on incomes and food expenditures. Find a 99% prediction interval for the predicted food expenditure for a randomly selected household with a monthly income of $5500. Example 13-10: Solution! Using the regression line estimated in Example 13-1, we find the point estimate of the predicted food expenditure for x = 55 " ŷ = (55) = $ hundred! Area in each tail = α/ = (1.99)/ =.005! df = n = 7 = 5! t = 4.03 Example 13-10: Solution Example 13-10: Solution s = , x = , and = S e yˆ p 1 ( x0 x) = se 1+ + n 1 ( ) = (1.5939) 1+ + = Hence, the 99% prediction interval for y for x = 55 is yˆ ± t s = ± 4.03(1.7040) ŷ p = ± = 8.50 to.630 p TI-84 TI-84

21 Minita Excel Excel Excel Excel