Chapter 10 Correlation and Regression

Chapter 10 Correlation and Regression 10-1 Review and Preview 10-2 Correlation 10-3 Regression 10-4 Prediction Intervals and Variation 10-5 Multiple Regression 10-6 Nonlinear Regression Section 10.1-1

Review In Chapter 9 we presented methods for making inferences from two samples. In this chapter we again consider paired sample data, but the objective is fundamentally different from that of Section 9-4. Section 10.1-2

Preview In this chapter we introduce methods for determining whether a correlation, or association, between two variables exists and whether the correlation is linear. For linear correlations, we can identify an equation that best fits the data and we can use that equation to predict the value of one variable given the value of the other variable. Section 10.1-3

Key Concept In part 1 of this section we introduces the linear correlation coefficient, r, which is a number that measures how well paired sample data fit a straight-line pattern when graphed. In Part 2, we discuss methods of hypothesis testing for correlation. Section 10.1-5

Part 1: Basic Concepts of Correlation Definition A correlation exists between two variables when the values of one are somehow associated with the values of the other in some way. A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line. Section 10.1-6

Scatterplots of Paired Data We can often see a relationship between two variables by constructing a scatterplot. Section 10.1-7

Scatterplots of Paired Data Section 10.1-8

Requirements for Linear Correlation 1. The sample of paired (x, y) data is a simple random sample of quantitative data. 2. The scatterplot must confirm that the points approximate a straight-line pattern. 3. The outliers must be removed if they are known to be errors. Section 10.1-9

Notation for the Linear Correlation Coefficient n x x 2 ( x) 2 xy number of pairs of sample data sum of all x-values indicate that each x-vale should be squared and the those squares added indicate that each value should be added and the total squared indicates that each x-value is multiplied by its corresponding y-value. then sum them up. r ρ linear correlation coefficient for sample data linear correlation coefficient for a population of paired data Section 10.1-10

Formula The linear correlation coefficient r: Here are two formulas: r = ( zz) x y n 1 Comment: use the first one. Section 10.1-11

Properties of the Linear Correlation Coefficient r 1. 1 r 1 2. If all x- and y-values are converted to a different scale, the value of r does not change, i.e., r(kx, ky) = r(x, y). 3. Interchange all x- and y-values and the value of r will not change, i.e., r(x, y) = r(y, x) 4. r measures strength of a linear relationship. 5. r is very sensitive to outliers. Section 10.1-12

Example The paired shoe / height data from five males are listed below. Use a computer or a calculator to find the value of the correlation coefficient r. Section 10.1-13

Example - Continued Requirement Check: The data are a simple random sample of quantitative data, the plotted points appear to roughly approximate a straight-line pattern, and there are no outliers. Section 10.1-14

Example - Continued A few technologies are displayed below, used to calculate the value of r. Section 10.1-15

Part 2: Formal Hypothesis Test We wish to determine whether there is a significant linear correlation between two variables. Notation: n = number of pairs of sample data r = linear correlation coefficient for a sample of paired data ρ = linear correlation coefficient for a population of paired data Section 10.1-16

Hypothesis Test for Correlation Hypotheses H H 0 1 : ρ = 0 (There is no linear correlation.) : ρ 0 (There is a linear correlation.) Test Statistic: r Critical Values: Refer to Table A-6 (can t do in TI-84) P-values: Refer to technology. Section 10.1-17

Hypothesis Test for Correlation If r > critical value from Table A-6, reject the null hypothesis and conclude that there is sufficient evidence to support the claim of a linear correlation. If r critical value from Table A-6, fail to reject the null hypothesis and conclude that there is not sufficient evidence to support the claim of a linear correlation. Section 10.1-18

Example continue We found previously for the shoe and height example that r = 0.591. Next, we conduct a formal hypothesis test of the claim that there is a linear correlation between the two variables. Use a 0.05 significance level. Section 10.1-19

Example - Continued We test the claim: H H 0 1 : ρ = 0 (There is no linear correlation) : ρ 0 (There is a linear correlation) Test statistic r = 0.591. The critical values of r = ± 0.878 are found in Table A-6 with n = 5 and α = 0.05. We fail to reject the null and conclude there is not sufficient evidence to support the claim of a linear correlation. Section 10.1-20

P-Value Method for a Hypothesis Test for Linear Correlation Method 2. Using t statistic. The test statistic is below, use n 2 degrees of freedom. r t = 1 r 2 n 2 P-values can be found using software or Table A-3. Section 10.1-21

Example Continuing the same example, we calculate the test statistic: t 0.591 = = = 1 r 2 1 0.591 2 n r 2 5 2 1.269 P-value = 0.2937. Fail to reject the H 0. We conclude there is not sufficient evidence to support the claim of a linear correlation between shoe print length and heights. Section 10.1-22

One-Tailed Tests For these one-tailed tests, the P-value method can be used as in earlier chapters. Section 10.1-23

Interpreting r: Explained Variation The value of r 2 is the proportion of the variation in y that is explained by the linear relationship between x and y. Section 10.1-24

Example We found previously for the shoe and height example that r = 0.591. With r = 0.591, we get r 2 = 0.349. We conclude that about 34.9% of the variation in height can be explained by the linear relationship between lengths of shoe prints and heights. Section 10.1-25

Common Errors Involving Correlation 1. Causation: It is wrong to conclude that correlation implies causality. 2. Averages: Averages suppress individual variation and may inflate the correlation coefficient. 3. Linearity: There may be some relationship between x and y even when there is no linear correlation. Section 10.1-26

Key Concept Part 1: find the equation (regression equation) of the straight line (regression line) that best fits the paired sample data. Part 2: discuss marginal change, influential points, and residual plots as tools for analyzing correlation and regression results. Section 10.1-28

Definitions Given a collection of paired sample data, the regression line is the straight line that best fits the scatterplot of data. The regression equation ŷ = b 0 + b 1 x algebraically describes the regression line. x: explanatory variable, predictor variable or independent variable, ŷ : response variable or dependent variable Section 10.1-29

Notation for Regression Equation y-intercept of regression equation Slope of regression equation Equation of the regression line Population Parameter Sample Statistic β 0 b 0 β 1 b 1 y = β 0 + β 1 x ŷ = b 0 + b 1 x Section 10.1-30

Requirements 1. The sample of paired (x, y) data is a random sample of quantitative data. 2. Visual examination of the scatterplot shows that the points approximate a straight-line pattern. 3. Any outliers must be removed if they are known to be errors. Consider the effects of any outliers that are not known errors. Section 10.1-31

Formulas for b 1 and b 0 Slope: b 1 = s r s y x y-intercept: b = y bx 0 1 Technology will compute these values. Section 10.1-32

Example Continue the example in 10.2, let x be the shoe print length, to predict the response variable, y, height. The data are listed below: Section 10.1-33

Example - Continued Requirement Check: 1.The data are assumed to be a simple random sample. 2.The scatterplot showed a roughly straight-line pattern. 3.There are no outliers. The use of technology is recommended for finding the equation of a regression line. Section 10.1-34

Example - Continued All these technologies show that the regression equation can be expressed as: yˆ = 125 + 1.73x Now we use the formulas to determine the regression equation (technology is recommended). Section 10.1-35

Example Recall that r = 0.591269. b b 1 0 = = r s s x y y = 0.591269 b x 1 4.87391 1.66823 = 1.72745 = 177.3 (1.72745)(30.04) = 125.40740 (the same coefficients found using technology) Section 10.1-36

Example Graph the regression equation on a scatterplot: yˆ = 125 + 1.73x Section 10.1-37

Strategy for Predicting Values of y Section 10.1-38

Example Use the 5 pairs of shoe print lengths and heights to predict the height of a person with a shoe print length of 29 cm. The regression line does not fit the points well. The correlation is r = 0.591, which suggests there is not a linear correlation (the P-value was 0.294). The best predicted height is simply the mean of the sample heights: y =177.3 cm Section 10.1-39

Example Use the 40 pairs of shoe print lengths from Data Set 2 in Appendix B to predict the height of a person with a shoe print length of 29 cm. Now, the regression line does fit the points well, and the correlation of r = 0.813 suggests that there is a linear correlation (the P-value is 0.000). Section 10.1-40

Example - Continued Using technology we obtain the regression equation and scatterplot: Section 10.1-41

Example - Continued The given shoe length of 29 cm is not beyond the scope of the available data, so substitute in 29 cm into the regression model: yˆ = 80.9 + 3.22x = 80.9 + 3.22 29 = 174.3 cm ( ) A person with a shoe length of 29 cm is predicted to be 174.3 cm tall. Section 10.1-42

Part 2: Beyond the Basics of Regression Section 10.1-43

Definition In working with two variables related by a regression equation, the marginal change in a variable is the amount that it changes when the other variable changes by exactly one unit. The slope b 1 in the regression equation represents the marginal change in y that occurs when x changes by one unit. Section 10.1-44

Example For the 40 pairs of shoe print lengths and heights, the regression equation was: yˆ = 80.9 + 3.22 x The slope of 3.22 tells us that if we increase shoe print length by 1 cm, the predicted height of a person increases by 3.22 cm. Section 10.1-45

Definition In a scatterplot, an outlier is a point lying far away from the other data points. Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line. Section 10.1-46

Example For the 40 pairs of shoe prints and heights, observe what happens if we include this additional data point: x = 35 cm and y = 25 cm Section 10.1-47

Example - Continued The additional point is an influential point because the graph of the regression line did change considerably. The additional point is also an outlier because it is far from the other points. Section 10.1-48

Definition For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the y-value that is predicted by using the regression equation. That is: residual = observed y predicted y = y yˆ Section 10.1-49

Residuals Section 10.1-50

Definition A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible. Section 10.1-51

Definition A residual plot is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y ŷ (where ŷ denotes the predicted value of y). That is, a residual plot is a graph of the points (x, y ŷ). Section 10.1-52

Residual Plot Analysis When analyzing a residual plot, look for a pattern in the way the points are configured, and use these criteria: The residual plot should not have any obvious patterns (not even a straight line pattern). This confirms that the scatterplot of the sample data is a straight-line pattern. The residual plot should not become thicker (or thinner) when viewed from left to right. This confirms the requirement that for different fixed values of x, the distributions of the corresponding y values all have the same standard deviation. Section 10.1-53

Example The shoe print and height data are used to generate the following residual plot: Section 10.1-54

Example - Continued The residual plot becomes thicker, which suggests that the requirement of equal standard deviations is violated. Section 10.1-55

Example - Continued On the following slides are three residual plots. Observe what is good or bad about the individual regression models. Section 10.1-56

Example - Continued Regression model is a good model: Section 10.1-57

Example - Continued Distinct pattern: sample data may not follow a straightline pattern. Section 10.1-58

Example - Continued Residual plot becoming thicker: equal standard deviations violated. Section 10.1-59

Complete Regression Analysis 1. Construct a scatterplot and verify that the pattern of the points is approximately a straight-line pattern without outliers. (If there are outliers, consider their effects by comparing results that include the outliers to results that exclude the outliers.) 2. Construct a residual plot and verify that there is no pattern (other than a straight-line pattern) and also verify that the residual plot does not become thicker (or thinner). Section 10.1-60

Complete Regression Analysis 3. Use a histogram and/or normal quantile plot to confirm that the values of the residuals have a distribution that is approximately normal. 4. Consider any effects of a pattern over time. Section 10.1-61

Key Concept In this section we present a method for constructing a prediction interval, which is an interval estimate of a predicted value of y. (Interval estimates of parameters are confidence intervals, but interval estimates of variables are called prediction intervals.) Section 10.1-63

Requirements For each fixed value of x, the corresponding sample values of y are normally distributed about the regression line, with the same variance. Section 10.1-64

Formulas For a fixed and known x 0, the prediction interval for an individual y value is: with margin of error: yˆ E < y< yˆ+ E ( x) 1 n x E = t s 1+ + n n x x 0 α /2 e 2 ( 2 ) ( ) 2 Section 10.1-65

Formulas The standard error estimate is: s e = ( y yˆ ) 2 n 2 (It is suggested to use technology to get prediction intervals.) Section 10.1-66

Example If we use the 40 pairs of shoe lengths and heights, construct a 95% prediction interval for the height, given that the shoe print length is 29.0 cm. Recall (found using technology, data in Appendix B): x 0 t α /2 yˆ = 80.9 + 3.22x s e = 29.0 = 5.943761 yˆ = 174.3 = 2.024 (Table A-3, df = 38, 0.05 area in two tails) Section 10.1-67

Example - Continued The 95% prediction interval is 162 cm < y < 186 cm. This is a large range of values, so the single shoe print doesn t give us very good information about a someone s height. Section 10.1-68

Explained and Unexplained Variation Assume the following: There is sufficient evidence of a linear correlation. The equation of the line is ŷ = 3 + 2x The mean of the y-values is 9. One of the pairs of sample data is x = 5 and y = 19. The point (5,13) is on the regression line. Section 10.1-69

Explained and Unexplained Variation The figure shows (5,13) lies on the regression line, but (5,19) does not. We arrive at: Total Deviation (from y = 9) of the point (5, 19) = y y = 19 9 = 10. Explained Deviation (from y = 9) of the point (5, 19) = yˆ y = 13 9 = 4. Unexplained Deviation (from y = 9) of the point (5, 19) = y yˆ = 19 13 = 6. Section 10.1-70

Relationships (total deviation) = (explained deviation) + (unexplained deviation) ( y y) ( yˆ y) = + ( y yˆ ) (total variation) = (explained variation) + (unexplained variation) 2 2 Σ( y y) 2 Σ( yˆ y) Σ( y yˆ ) = + Section 10.1-71

Definition The coefficient of determination is the amount of the variation in y that is explained by the regression line. 2 explained variation r = total variation The value of r 2 is the proportion of the variation in y that is explained by the linear relationship between x and y. Section 10.1-72

Example If we use the 40 paired shoe lengths and heights from the data in Appendix B, we find the linear correlation coefficient to be r = 0.813. Then, the coefficient of determination is r 2 = 0.8132 = 0.661 We conclude that 66.1% of the total variation in height can be explained by shoe print length, and the other 33.9% cannot be explained by shoe print length. Section 10.1-73