Chapter 10 Correlation and Regression

Similar documents
Example: Boats and Manatees

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Chapter 7: Simple linear regression Learning Objectives

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

MTH 140 Statistics Videos

Regression Analysis: A Complete Example

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5

2. Simple Linear Regression

Exercise 1.12 (Pg )

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Univariate Regression

Fairfield Public Schools

2013 MBA Jump Start Program. Statistics Module Part 3

Simple Regression Theory II 2010 Samuel L. Baker

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Correlation and Regression

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

The importance of graphing the data: Anscombe s regression examples

ch12 practice test SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

A full analysis example Multiple correlations Partial correlations

Introduction to Regression and Data Analysis

Simple Linear Regression

Using R for Linear Regression

Simple linear regression

Elements of statistics (MATH0487-1)

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Part 2: Analysis of Relationship Between Two Variables

Section 1.5 Linear Models

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Correlation and Simple Linear Regression

Correlation key concepts:

A Primer on Forecasting Business Performance

Hypothesis testing - Steps

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Introduction to Quantitative Methods

Regression III: Advanced Methods

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Relationships Between Two Variables: Scatterplots and Correlation

MULTIPLE REGRESSION EXAMPLE

August 2012 EXAMINATIONS Solution Part I

11. Analysis of Case-control Studies Logistic Regression

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Regression and Correlation

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Factors affecting online sales

Linear Models in STATA and ANOVA

Chapter 23. Inferences for Regression

STAT 350 Practice Final Exam Solution (Spring 2015)

Session 7 Bivariate Data and Analysis

Copyright 2013 by Laura Schultz. All rights reserved. Page 1 of 7

Pearson s Correlation

Notes on Applied Linear Regression

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

DATA INTERPRETATION AND STATISTICS

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of y = mx + b.

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

The correlation coefficient

Algebra 1 Course Information

Statistics 2014 Scoring Guidelines

1.5 Oneway Analysis of Variance

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Simple Linear Regression Inference

Chapter 5 Analysis of variance SPSS Analysis of variance

AP STATISTICS REVIEW (YMS Chapters 1-8)

table to see that the probability is (b) What is the probability that x is between 16 and 60? The z-scores for 16 and 60 are: = 1.

hp calculators HP 50g Trend Lines The STAT menu Trend Lines Practice predicting the future using trend lines

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Final Exam Practice Problem Answers

Unit 26 Estimation with Confidence Intervals

Additional sources Compilation of sources:

Homework 8 Solutions

TIME SERIES ANALYSIS & FORECASTING

Section 1: Simple Linear Regression

The Dummy s Guide to Data Analysis Using SPSS

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

17. SIMPLE LINEAR REGRESSION II

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Section 3 Part 1. Relationships between two numerical variables

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Introduction to Linear Regression

Premaster Statistics Tutorial 4 Full solutions

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Chapter 23 Inferences About Means

Algebraic expressions are a combination of numbers and variables. Here are examples of some basic algebraic expressions.

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Chapter 9 Descriptive Statistics for Bivariate Data

Elementary Statistics Sample Exam #3


Transcription:

Chapter 10 Correlation and Regression 10-1 Review and Preview 10-2 Correlation 10-3 Regression 10-4 Prediction Intervals and Variation 10-5 Multiple Regression 10-6 Nonlinear Regression Section 10.1-1

Review In Chapter 9 we presented methods for making inferences from two samples. In this chapter we again consider paired sample data, but the objective is fundamentally different from that of Section 9-4. Section 10.1-2

Preview In this chapter we introduce methods for determining whether a correlation, or association, between two variables exists and whether the correlation is linear. For linear correlations, we can identify an equation that best fits the data and we can use that equation to predict the value of one variable given the value of the other variable. Section 10.1-3

Chapter 10 Correlation and Regression 10-1 Review and Preview 10-2 Correlation 10-3 Regression 10-4 Prediction Intervals and Variation 10-5 Multiple Regression 10-6 Nonlinear Regression Section 10.1-4

Key Concept In part 1 of this section we introduces the linear correlation coefficient, r, which is a number that measures how well paired sample data fit a straight-line pattern when graphed. In Part 2, we discuss methods of hypothesis testing for correlation. Section 10.1-5

Part 1: Basic Concepts of Correlation Definition A correlation exists between two variables when the values of one are somehow associated with the values of the other in some way. A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line. Section 10.1-6

Scatterplots of Paired Data We can often see a relationship between two variables by constructing a scatterplot. Section 10.1-7

Scatterplots of Paired Data Section 10.1-8

Requirements for Linear Correlation 1. The sample of paired (x, y) data is a simple random sample of quantitative data. 2. The scatterplot must confirm that the points approximate a straight-line pattern. 3. The outliers must be removed if they are known to be errors. Section 10.1-9

Notation for the Linear Correlation Coefficient n x x 2 ( x) 2 xy number of pairs of sample data sum of all x-values indicate that each x-vale should be squared and the those squares added indicate that each value should be added and the total squared indicates that each x-value is multiplied by its corresponding y-value. then sum them up. r ρ linear correlation coefficient for sample data linear correlation coefficient for a population of paired data Section 10.1-10

Formula The linear correlation coefficient r: Here are two formulas: r = ( zz) x y n 1 Comment: use the first one. Section 10.1-11

Properties of the Linear Correlation Coefficient r 1. 1 r 1 2. If all x- and y-values are converted to a different scale, the value of r does not change, i.e., r(kx, ky) = r(x, y). 3. Interchange all x- and y-values and the value of r will not change, i.e., r(x, y) = r(y, x) 4. r measures strength of a linear relationship. 5. r is very sensitive to outliers. Section 10.1-12

Example The paired shoe / height data from five males are listed below. Use a computer or a calculator to find the value of the correlation coefficient r. Section 10.1-13

Example - Continued Requirement Check: The data are a simple random sample of quantitative data, the plotted points appear to roughly approximate a straight-line pattern, and there are no outliers. Section 10.1-14

Example - Continued A few technologies are displayed below, used to calculate the value of r. Section 10.1-15

Part 2: Formal Hypothesis Test We wish to determine whether there is a significant linear correlation between two variables. Notation: n = number of pairs of sample data r = linear correlation coefficient for a sample of paired data ρ = linear correlation coefficient for a population of paired data Section 10.1-16

Hypothesis Test for Correlation Hypotheses H H 0 1 : ρ = 0 (There is no linear correlation.) : ρ 0 (There is a linear correlation.) Test Statistic: r Critical Values: Refer to Table A-6 (can t do in TI-84) P-values: Refer to technology. Section 10.1-17

Hypothesis Test for Correlation If r > critical value from Table A-6, reject the null hypothesis and conclude that there is sufficient evidence to support the claim of a linear correlation. If r critical value from Table A-6, fail to reject the null hypothesis and conclude that there is not sufficient evidence to support the claim of a linear correlation. Section 10.1-18

Example continue We found previously for the shoe and height example that r = 0.591. Next, we conduct a formal hypothesis test of the claim that there is a linear correlation between the two variables. Use a 0.05 significance level. Section 10.1-19

Example - Continued We test the claim: H H 0 1 : ρ = 0 (There is no linear correlation) : ρ 0 (There is a linear correlation) Test statistic r = 0.591. The critical values of r = ± 0.878 are found in Table A-6 with n = 5 and α = 0.05. We fail to reject the null and conclude there is not sufficient evidence to support the claim of a linear correlation. Section 10.1-20

P-Value Method for a Hypothesis Test for Linear Correlation Method 2. Using t statistic. The test statistic is below, use n 2 degrees of freedom. r t = 1 r 2 n 2 P-values can be found using software or Table A-3. Section 10.1-21

Example Continuing the same example, we calculate the test statistic: t 0.591 = = = 1 r 2 1 0.591 2 n r 2 5 2 1.269 P-value = 0.2937. Fail to reject the H 0. We conclude there is not sufficient evidence to support the claim of a linear correlation between shoe print length and heights. Section 10.1-22

One-Tailed Tests For these one-tailed tests, the P-value method can be used as in earlier chapters. Section 10.1-23

Interpreting r: Explained Variation The value of r 2 is the proportion of the variation in y that is explained by the linear relationship between x and y. Section 10.1-24

Example We found previously for the shoe and height example that r = 0.591. With r = 0.591, we get r 2 = 0.349. We conclude that about 34.9% of the variation in height can be explained by the linear relationship between lengths of shoe prints and heights. Section 10.1-25

Common Errors Involving Correlation 1. Causation: It is wrong to conclude that correlation implies causality. 2. Averages: Averages suppress individual variation and may inflate the correlation coefficient. 3. Linearity: There may be some relationship between x and y even when there is no linear correlation. Section 10.1-26

Chapter 10 Correlation and Regression 10-1 Review and Preview 10-2 Correlation 10-3 Regression 10-4 Prediction Intervals and Variation 10-5 Multiple Regression 10-6 Nonlinear Regression Section 10.1-27

Key Concept Part 1: find the equation (regression equation) of the straight line (regression line) that best fits the paired sample data. Part 2: discuss marginal change, influential points, and residual plots as tools for analyzing correlation and regression results. Section 10.1-28

Definitions Given a collection of paired sample data, the regression line is the straight line that best fits the scatterplot of data. The regression equation ŷ = b 0 + b 1 x algebraically describes the regression line. x: explanatory variable, predictor variable or independent variable, ŷ : response variable or dependent variable Section 10.1-29

Notation for Regression Equation y-intercept of regression equation Slope of regression equation Equation of the regression line Population Parameter Sample Statistic β 0 b 0 β 1 b 1 y = β 0 + β 1 x ŷ = b 0 + b 1 x Section 10.1-30

Requirements 1. The sample of paired (x, y) data is a random sample of quantitative data. 2. Visual examination of the scatterplot shows that the points approximate a straight-line pattern. 3. Any outliers must be removed if they are known to be errors. Consider the effects of any outliers that are not known errors. Section 10.1-31

Formulas for b 1 and b 0 Slope: b 1 = s r s y x y-intercept: b = y bx 0 1 Technology will compute these values. Section 10.1-32

Example Continue the example in 10.2, let x be the shoe print length, to predict the response variable, y, height. The data are listed below: Section 10.1-33

Example - Continued Requirement Check: 1.The data are assumed to be a simple random sample. 2.The scatterplot showed a roughly straight-line pattern. 3.There are no outliers. The use of technology is recommended for finding the equation of a regression line. Section 10.1-34

Example - Continued All these technologies show that the regression equation can be expressed as: yˆ = 125 + 1.73x Now we use the formulas to determine the regression equation (technology is recommended). Section 10.1-35

Example Recall that r = 0.591269. b b 1 0 = = r s s x y y = 0.591269 b x 1 4.87391 1.66823 = 1.72745 = 177.3 (1.72745)(30.04) = 125.40740 (the same coefficients found using technology) Section 10.1-36

Example Graph the regression equation on a scatterplot: yˆ = 125 + 1.73x Section 10.1-37

Strategy for Predicting Values of y Section 10.1-38

Example Use the 5 pairs of shoe print lengths and heights to predict the height of a person with a shoe print length of 29 cm. The regression line does not fit the points well. The correlation is r = 0.591, which suggests there is not a linear correlation (the P-value was 0.294). The best predicted height is simply the mean of the sample heights: y =177.3 cm Section 10.1-39

Example Use the 40 pairs of shoe print lengths from Data Set 2 in Appendix B to predict the height of a person with a shoe print length of 29 cm. Now, the regression line does fit the points well, and the correlation of r = 0.813 suggests that there is a linear correlation (the P-value is 0.000). Section 10.1-40

Example - Continued Using technology we obtain the regression equation and scatterplot: Section 10.1-41

Example - Continued The given shoe length of 29 cm is not beyond the scope of the available data, so substitute in 29 cm into the regression model: yˆ = 80.9 + 3.22x = 80.9 + 3.22 29 = 174.3 cm ( ) A person with a shoe length of 29 cm is predicted to be 174.3 cm tall. Section 10.1-42

Part 2: Beyond the Basics of Regression Section 10.1-43

Definition In working with two variables related by a regression equation, the marginal change in a variable is the amount that it changes when the other variable changes by exactly one unit. The slope b 1 in the regression equation represents the marginal change in y that occurs when x changes by one unit. Section 10.1-44

Example For the 40 pairs of shoe print lengths and heights, the regression equation was: yˆ = 80.9 + 3.22 x The slope of 3.22 tells us that if we increase shoe print length by 1 cm, the predicted height of a person increases by 3.22 cm. Section 10.1-45

Definition In a scatterplot, an outlier is a point lying far away from the other data points. Paired sample data may include one or more influential points, which are points that strongly affect the graph of the regression line. Section 10.1-46

Example For the 40 pairs of shoe prints and heights, observe what happens if we include this additional data point: x = 35 cm and y = 25 cm Section 10.1-47

Example - Continued The additional point is an influential point because the graph of the regression line did change considerably. The additional point is also an outlier because it is far from the other points. Section 10.1-48

Definition For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the y-value that is predicted by using the regression equation. That is: residual = observed y predicted y = y yˆ Section 10.1-49

Residuals Section 10.1-50

Definition A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible. Section 10.1-51

Definition A residual plot is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y ŷ (where ŷ denotes the predicted value of y). That is, a residual plot is a graph of the points (x, y ŷ). Section 10.1-52

Residual Plot Analysis When analyzing a residual plot, look for a pattern in the way the points are configured, and use these criteria: The residual plot should not have any obvious patterns (not even a straight line pattern). This confirms that the scatterplot of the sample data is a straight-line pattern. The residual plot should not become thicker (or thinner) when viewed from left to right. This confirms the requirement that for different fixed values of x, the distributions of the corresponding y values all have the same standard deviation. Section 10.1-53

Example The shoe print and height data are used to generate the following residual plot: Section 10.1-54

Example - Continued The residual plot becomes thicker, which suggests that the requirement of equal standard deviations is violated. Section 10.1-55

Example - Continued On the following slides are three residual plots. Observe what is good or bad about the individual regression models. Section 10.1-56

Example - Continued Regression model is a good model: Section 10.1-57

Example - Continued Distinct pattern: sample data may not follow a straightline pattern. Section 10.1-58

Example - Continued Residual plot becoming thicker: equal standard deviations violated. Section 10.1-59

Complete Regression Analysis 1. Construct a scatterplot and verify that the pattern of the points is approximately a straight-line pattern without outliers. (If there are outliers, consider their effects by comparing results that include the outliers to results that exclude the outliers.) 2. Construct a residual plot and verify that there is no pattern (other than a straight-line pattern) and also verify that the residual plot does not become thicker (or thinner). Section 10.1-60

Complete Regression Analysis 3. Use a histogram and/or normal quantile plot to confirm that the values of the residuals have a distribution that is approximately normal. 4. Consider any effects of a pattern over time. Section 10.1-61

Chapter 10 Correlation and Regression 10-1 Review and Preview 10-2 Correlation 10-3 Regression 10-4 Prediction Intervals and Variation 10-5 Multiple Regression 10-6 Nonlinear Regression Section 10.1-62

Key Concept In this section we present a method for constructing a prediction interval, which is an interval estimate of a predicted value of y. (Interval estimates of parameters are confidence intervals, but interval estimates of variables are called prediction intervals.) Section 10.1-63

Requirements For each fixed value of x, the corresponding sample values of y are normally distributed about the regression line, with the same variance. Section 10.1-64

Formulas For a fixed and known x 0, the prediction interval for an individual y value is: with margin of error: yˆ E < y< yˆ+ E ( x) 1 n x E = t s 1+ + n n x x 0 α /2 e 2 ( 2 ) ( ) 2 Section 10.1-65

Formulas The standard error estimate is: s e = ( y yˆ ) 2 n 2 (It is suggested to use technology to get prediction intervals.) Section 10.1-66

Example If we use the 40 pairs of shoe lengths and heights, construct a 95% prediction interval for the height, given that the shoe print length is 29.0 cm. Recall (found using technology, data in Appendix B): x 0 t α /2 yˆ = 80.9 + 3.22x s e = 29.0 = 5.943761 yˆ = 174.3 = 2.024 (Table A-3, df = 38, 0.05 area in two tails) Section 10.1-67

Example - Continued The 95% prediction interval is 162 cm < y < 186 cm. This is a large range of values, so the single shoe print doesn t give us very good information about a someone s height. Section 10.1-68

Explained and Unexplained Variation Assume the following: There is sufficient evidence of a linear correlation. The equation of the line is ŷ = 3 + 2x The mean of the y-values is 9. One of the pairs of sample data is x = 5 and y = 19. The point (5,13) is on the regression line. Section 10.1-69

Explained and Unexplained Variation The figure shows (5,13) lies on the regression line, but (5,19) does not. We arrive at: Total Deviation (from y = 9) of the point (5, 19) = y y = 19 9 = 10. Explained Deviation (from y = 9) of the point (5, 19) = yˆ y = 13 9 = 4. Unexplained Deviation (from y = 9) of the point (5, 19) = y yˆ = 19 13 = 6. Section 10.1-70

Relationships (total deviation) = (explained deviation) + (unexplained deviation) ( y y) ( yˆ y) = + ( y yˆ ) (total variation) = (explained variation) + (unexplained variation) 2 2 Σ( y y) 2 Σ( yˆ y) Σ( y yˆ ) = + Section 10.1-71

Definition The coefficient of determination is the amount of the variation in y that is explained by the regression line. 2 explained variation r = total variation The value of r 2 is the proportion of the variation in y that is explained by the linear relationship between x and y. Section 10.1-72

Example If we use the 40 paired shoe lengths and heights from the data in Appendix B, we find the linear correlation coefficient to be r = 0.813. Then, the coefficient of determination is r 2 = 0.8132 = 0.661 We conclude that 66.1% of the total variation in height can be explained by shoe print length, and the other 33.9% cannot be explained by shoe print length. Section 10.1-73