STT200 Chapter 7-9 KM

Similar documents
Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Scatter Plot, Correlation, and Regression on the TI-83/84

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Chapter 7: Simple linear regression Learning Objectives

Correlation key concepts:

Simple linear regression

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Chapter 13 Introduction to Linear Regression and Correlation Analysis

2. Simple Linear Regression

Exercise 1.12 (Pg )

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Module 3: Correlation and Covariance

Homework 8 Solutions

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Fairfield Public Schools

Simple Linear Regression

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Correlation and Regression

AP STATISTICS REVIEW (YMS Chapters 1-8)

2013 MBA Jump Start Program. Statistics Module Part 3

Relationships Between Two Variables: Scatterplots and Correlation

Section 3 Part 1. Relationships between two numerical variables

MTH 140 Statistics Videos

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Introduction to Regression and Data Analysis

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

with functions, expressions and equations which follow in units 3 and 4.

Session 7 Bivariate Data and Analysis

Describing Relationships between Two Variables

Simple Regression Theory II 2010 Samuel L. Baker

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

Univariate Regression

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5

Outline: Demand Forecasting

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Chapter 9 Descriptive Statistics for Bivariate Data

What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of y = mx + b.

You buy a TV for $1000 and pay it off with $100 every week. The table below shows the amount of money you sll owe every week. Week

Example: Boats and Manatees

Formula for linear models. Prediction, extrapolation, significance test against zero slope.

STAT 350 Practice Final Exam Solution (Spring 2015)

Father s height (inches)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Academic Support Center. Using the TI-83/84+ Graphing Calculator PART II

Module 5: Multiple Regression Analysis

the Median-Medi Graphing bivariate data in a scatter plot

Preview. What is a correlation? Las Cucarachas. Equal Footing. z Distributions 2/12/2013. Correlation

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Chapter 23. Inferences for Regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

Copyright 2013 by Laura Schultz. All rights reserved. Page 1 of 7

Homework 11. Part 1. Name: Score: / null

CALCULATIONS & STATISTICS

DATA INTERPRETATION AND STATISTICS

The KaleidaGraph Guide to Curve Fitting

Chapter 7 Scatterplots, Association, and Correlation

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

PLOTTING DATA AND INTERPRETING GRAPHS

The Correlation Coefficient

11. Analysis of Case-control Studies Logistic Regression


Summarizing and Displaying Categorical Data

Introduction to Quantitative Methods

Descriptive statistics; Correlation and regression

FREE FALL. Introduction. Reference Young and Freedman, University Physics, 12 th Edition: Chapter 2, section 2.5

Linear Equations. Find the domain and the range of the following set. {(4,5), (7,8), (-1,3), (3,3), (2,-3)}

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Pennsylvania System of School Assessment

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

Georgia Standards of Excellence Curriculum Map. Mathematics. GSE 8 th Grade

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Name: Date: Use the following to answer questions 2-3:

Regression and Correlation

Graphing Linear Equations

ch12 practice test SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

2) The three categories of forecasting models are time series, quantitative, and qualitative. 2)

Transforming Bivariate Data

Diagrams and Graphs of Statistical Data

Statistics 2014 Scoring Guidelines

table to see that the probability is (b) What is the probability that x is between 16 and 60? The z-scores for 16 and 60 are: = 1.

Common Core Unit Summary Grades 6 to 8

Pearson s Correlation Coefficient

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

USING A TI-83 OR TI-84 SERIES GRAPHING CALCULATOR IN AN INTRODUCTORY STATISTICS CLASS

The importance of graphing the data: Anscombe s regression examples

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

Algebra 1 Course Information

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

CORRELATION ANALYSIS

Activity 5. Two Hot, Two Cold. Introduction. Equipment Required. Collecting the Data

Student Activity: To investigate an ESB bill

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Linear functions Increasing Linear Functions. Decreasing Linear Functions

Transcription:

Chapter 7 Scatterplots, Association, and Correlation Correlation, association, relationship between two sets of numerical data is often discussed. It s believed that there is a relationship between smoking and lung cancer; number of cold days in winter and number of babies born next fall; even the values of Dow Jones Industrial Average and the length of fashionable skirts. Questions to ask about paired data: 1. Is there a relationship? 2. Can I find an equation that describes it? 3. How good my find is? Can I use it to make predictions? A way to observe such relationships is constructing a scatter plot. A scatter diagram (scatter plot) is a graph that displays a relationship between two quantitative variables. Each point of the graph is plotted with a pair of two related data: x and y. Example: In a scatter plot a variable assigned to x-axis is called explanatory (or predictor), and a variable assigned to y-axis a response variable. Often a response variable is a variable that we want to predict. Things to look at: Direction (negative or positive) Strength (no, moderate, strong) Form (linear or not) Clusters, subgroups and outliers Correlation: 1 of 19

Correlation Coefficient r is a measure of the strength of the linear association between two quantitative variables. Properties 1. The sign gives direction 2. r is always between 1 and 1 3. r has no units 4. Correlation is not affected by shifting or re-scaling either variable. 5. Correlation of x and y is the same as of y and x 6. r= 0 indicates lack of linear association (but could be strong non-linear association) 7. Existence of strong correlation does not mean that the association is causal, that is change of one variable is caused by the change of the other (it may be third factor that causes both variables change in the same direction) 2 of 19

Correlation measures the strength of the linear association between two quantitative variables. Before you use correlation, you must check several conditions: Quantitative Variables Condition Straight Enough Condition Outlier Condition If you notice an outlier then it is a good idea to report the correlations with and without that point. Computations r ( x x)( y ( n 1) s s x y y) or r zz x n 1 y Example: Computing Correlation Coefficient Find the scatter plot and correlation coefficient: 14 12 10 8 6 4 2 x y 6 5 10 3 14 7 19 8 21 12 0 0 5 10 15 20 25 x y (x - x ) (y - y ) Product 6 5-8 -2 16 10 3-4 -4 16 14 7 0 0 0 19 8 5 1 5 21 12 7 5 35 Sum 70 35 0 0 72 Mean 14 7 St. dev. 6.20 3.39 r 72 (5 1) 6.20 3.39 0.856 3 of 19

Correlation Coefficient on TI-83. Correlation coefficient is not displayed automatically. We have to set the diagnostic display mode. First go to the CATALOG: press 2 nd, 0, press D (the key with x -1 ) and scroll down the screen to point to Diagnostic On, then press ENTER twice. The calculator responds with Done. You need to do this only once. 1. Enter x-values in L1 and y-values in L2. 2. Press STAT, select CALC, select LinReg(ax+b), ENTER. 3. The display says LinReg(ax+b). Enter L1,L2 and then ENTER. The display will look like this: y = ax+b a = b = r 2 = r =. (< --- this is the correlation coefficient) Common Errors Involving Correlation 1. Causation: It is wrong to conclude that correlation implies causality. 2. Averages: Averages suppress individual variation and may affect the correlation coefficient. 3. Linearity: There may be some nonlinear relationship between x and y even when there is no significant linear correlation. Class Exercises: Ch 7: 2, 6, 10, 12, 26 4 of 19

5 of 19

Chapter 8 Linear Regression Linear regression model is a model of a relationship between two variables x, and y Response = linear function of the predictor + Error y = b o + b 1 x + Error b o and b 1 are parameters of the model. Goal: Estimate b o and b 1 and the regression line yˆ b0 b1 x 6 of 19

Method: Least squares regression line the line that minimizes the sum of squared vertical distances between points on the scatterplot and the line (regression line, best fit line, prediction line). Review from algebra: the formula for the straight line is y b mx where b=y-intercept, and m=the slope of the line We ll denote the line as yˆ b0 b1 x The coefficient b 1 is the slope, which tells us how fast y changes with respect to x. The coefficient b 0 is the intercept, which tells where the line crosses (intercepts) the y-axis. Example: The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu Predictor: x= proteins in grams Response: y= fat in grams The model: y=b 0 +b 1 x fat b b protein 0 1 Using software to find b 0 and b 1, the line is Interpret the slope: each additional gram of protein adds 0.97g of fat. Interpret the intercept: if there were no protein, the amount of fat is 6.8g. 7 of 19

This value does not always make sense, but is useful as a starting point for the graph. With the existing model we can make prediction for the amount of fat given the amount of proteins. For instance, for 10 grams of protein the amount of fat, ŷ=6.8+0.97(10)=16.5 grams. If one of the observed values for x=10 was y=12, we can compute the difference between the observed and predicted value of fat: y yˆ =12-16.5=-4.5 This is called the residual. e observed predicted y yˆ Residuals: if e is positive then model underestimates the actual data value if e is negative then model overestimates the actual data value Some residuals are positive, others are negative, and, on average, they cancel each other out. When is a linear model reasonable? 1. Linear relationship shown on scatterplot 2. Large R 2 For the scatterplot above R 2 = r 2 = (correlation coefficient) 2 = 0.69=69% R 2 tells the percentage of the total variation of y explained by x 3. Residuals should be evenly distributed around zero. 8 of 19

Residual plot fat versus protein for 30 items on the Burger King menu Interpretation of R 2 : R 2 =0.69 means that 69% of the variation in fat is explained for by variation in the protein content. Regression Assumptions and Conditions Quantitative Variables Condition: Regression can only be done on two quantitative variables Straight Enough Condition: The linear model assumes that the relationship between the variables is linear. A scatterplot will let you check that the assumption is reasonable. Residuals Their scatterplot should not show any pattern and the points should be evenly spread around zero. Simply, we want to see nothing in a plot of the residuals Outliers You should also check for outliers, which could change the regression. 9 of 19

Even a single outlier can dominate the correlation value and change regression line Example: 1. Use the calculator or computer to make a scatter plot and to find correlation coefficient and coefficient of determination for the following data x y 0 0 0 1 1 0 2 0 0 2 1 1 1 2 2 2 2 1 2. Now add one more point: (10, 10) and compute the correlation coefficient again. Any change? What can go wrong? Don t fit a straight line to a nonlinear relationship. Beware of extraordinary points (y-values that stand off from the linear pattern or extreme x-values). Don t invert the regression. To swap the predictor-response roles of the variables, we must fit a new regression equation. Don t extrapolate beyond the data the linear model may no longer hold outside of the range of the data. Don t infer that x causes y just because there is a good linear model for their relationship association is not causation. Don t choose a model based on R 2 alone, scatterplot alone, or residual plot alone. Use all three. Exercises: Ch 8: 2, 4, 6, 8, 56 10 of 19

Add the units! 1965 2000 11 of 19

Year (x) Rate (y) ŷ Residual 1965 19.4 1970 18.4 1975 14.8 1980 15.9 1985 15.6 1990 16.4 1995 14.8 2000 14.4 2005 14.0 = s x = s y = r = -0.821 R 2 =0.674 20 18 16 14 12 10 1960 1970 1980 1990 2000 2010 a) Scatterplot: b) Regression Line s y b 1 r -0.110333 s b y b1 x x 0 234.978 12 of 19

Regression line: yˆ b0 b1 x 234.978-0.110333x TI-83: c) Residuals: e y y ˆ 2.00 1.50 1.00 0.50 0.00-0.50-1.00-1.50-2.00-2.50 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 d) decline by 0.11birth per 1000 women per year. e - f) \ Residual 15-16.74=-1.74 Model predicted higher. g) h) Should not be trusted at all. 13 of 19

Chapter 9 Regression Wisdom (Skip Chapter 10) Residuals - should be "evenly" distributed around zero; their scatterplot has no visible pattern, and their histogram should be bell-shaped, centered at zero Check the scatterplot of the residuals for bends that you might have overlooked in the original scatterplot. Residuals often show patterns that were not clear from a scatterplot of the original data. Scatterplot: Duration vs. heart rate Scatterplot for the Residuals: (Scatterplot for residuals shows a pattern which indicates that linear model is not appropriate here) Predictions: Linear models let us predict the value of y for each case x in the data. BUT: We cannot assume that a linear relationship in the data exists beyond the range of the data. Such a prediction is called an extrapolation. Above is a time-plot of the Energy Information Administration (EIA) predictions and actual prices. 14 of 19

Prediction - reasonable only in the range of x-values or slightly beyond Extrapolation: outside data range - not reliable Interpolation: inside data range Lurking variables. Strong association does not mean that one variable causes the change of other. The fact that both variables x and y change simultaneously could be due to another, so called lurking, variable. Outliers, Leverage, and Influence Outliers - points that stands away from others. If there is an outlier you should build 2 models, with and without the outlier, and compare them. Points of high leverage points with x-coordinate far from the mean of all x-values. A point with high leverage has the potential to change the regression line (but does not always does it) Influential Points - points which affects regression model very much (a big difference in the model when removed). Points with large residuals. The following scatterplot shows that something was surprisingly different in Palm Beach County, Florida, during the 2000 presidential election The red line shows the effects that one unusual point can have on a regression: 15 of 19

We say that a point is influential if omitting it from the analysis gives a very different model. The extraordinarily large shoe size gives the data point high leverage. Wherever the IQ is, the line will follow! You cannot simply delete unusual points from the data. You can fit a model with and without these points, and examine the two regression models to understand how they differ. Lurking Variables and Causation No matter how strong the association, no matter how large the R 2 value, no matter how straight the line, there is no way to conclude from a regression alone that one variable causes the other from an observational study. What (else) can go wrong? Extrapolation far from the mean can lead to silly and useless predictions. An R 2 value near 100% doesn t indicate that there is a causal relationship between x and y. Watch out for lurking variables. Watch out for regressions based on summaries of the data sets. These regressions tend to look stronger than the regression on the original data. Exercises: Ch 9: 2, 9 d, e, 12, 18 and class quiz 16 of 19

17 of 19

18 of 19

What Can Go Wrong? Don t say correlation when you mean association. o The word correlation should be reserved for measuring the strength and direction of the linear relationship between two quantitative variables. Don t confuse correlation with causation. o Scatterplots and correlations never demonstrate causation. There may be a strong association between two variables that have a nonlinear association What have we learned? We examine scatterplots for direction, form, strength, and unusual features. Although not every relationship is linear, when the scatterplot is straight enough, the correlation coefficient is a useful numerical summary. o The sign of the correlation tells us the direction of the association. o The magnitude of the correlation tells us the strength of a linear association. o Correlation has no units, so shifting or scaling the data, standardizing, or swapping the variables has no effect on the numerical value. The residuals also reveal how well the model works. o If a plot of the residuals against predicted values shows a pattern, we should re-examine the data to see why. o The standard deviation of the residuals s e quantifies the amount of scatter around the line. The more scatter the larger s e. 19 of 19