The Big Picture. Correlation. Scatter Plots. Data



Similar documents
Chapter 13 Introduction to Linear Regression and Correlation Analysis

Pearson s Correlation Coefficient

Univariate Regression

Exercise 1.12 (Pg )

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

To Be or Not To Be a Linear Equation: That Is the Question

Find the Relationship: An Exercise in Graphing Analysis

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

So, using the new notation, P X,Y (0,1) =.08 This is the value which the joint probability function for X and Y takes when X=0 and Y=1.

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

PROPERTIES OF ELLIPTIC CURVES AND THEIR USE IN FACTORING LARGE NUMBERS

1.6. Piecewise Functions. LEARN ABOUT the Math. Representing the problem using a graphical model

Simple Linear Regression Inference

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Simple Regression Theory II 2010 Samuel L. Baker

5.1. A Formula for Slope. Investigation: Points and Slope CONDENSED

Exponential and Logarithmic Functions

INVESTIGATIONS AND FUNCTIONS Example 1

You buy a TV for $1000 and pay it off with $100 every week. The table below shows the amount of money you sll owe every week. Week

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Diagrams and Graphs of Statistical Data

Chapter 7: Simple linear regression Learning Objectives

Regression Analysis: A Complete Example

Module 3: Correlation and Covariance

Chapter 9 Descriptive Statistics for Bivariate Data

17. SIMPLE LINEAR REGRESSION II

2. Simple Linear Regression

4. Simple regression. QBUS6840 Predictive Analytics.

EQUATIONS OF LINES IN SLOPE- INTERCEPT AND STANDARD FORM

Scatter Plot, Correlation, and Regression on the TI-83/84

Introduction to Regression and Data Analysis

CHAPTER TEN. Key Concepts

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

The correlation coefficient

SAMPLE. Polynomial functions

In this this review we turn our attention to the square root function, the function defined by the equation. f(x) = x. (5.1)

Graphing Linear Equations

Correlation key concepts:

5.2 Inverse Functions

1 Maximizing pro ts when marginal costs are increasing

Simple linear regression

Homework 11. Part 1. Name: Score: / null

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Higher. Polynomials and Quadratics 64

7.7 Solving Rational Equations

Part Three. Cost Behavior Analysis

Some Tools for Teaching Mathematical Literacy

Getting Correct Results from PROC REG

MATH REVIEW SHEETS BEGINNING ALGEBRA MATH 60

Common Core Unit Summary Grades 6 to 8

Example: Boats and Manatees

Section 1.5 Linear Models

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Polynomials. Jackie Nicholas Jacquie Hargreaves Janet Hunter

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Relationships Between Two Variables: Scatterplots and Correlation

C3: Functions. Learning objectives

Downloaded from equations. 2.4 The reciprocal function x 1 x

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Slope-Intercept Form and Point-Slope Form

Why should we learn this? One real-world connection is to find the rate of change in an airplane s altitude. The Slope of a Line VOCABULARY

Directions for using SPSS

Mathematics Placement Packet Colorado College Department of Mathematics and Computer Science

x y The matrix form, the vector form, and the augmented matrix form, respectively, for the system of equations are

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

LINEAR FUNCTIONS OF 2 VARIABLES

Name Date. Break-Even Analysis

Graphing Quadratic Equations

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

LESSON EIII.E EXPONENTS AND LOGARITHMS

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Solving Quadratic Equations by Graphing. Consider an equation of the form. y ax 2 bx c a 0. In an equation of the form

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Lesson Using Lines to Make Predictions

Basic Data Analysis Tutorial

11. Analysis of Case-control Studies Logistic Regression

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Session 7 Bivariate Data and Analysis

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DATA INTERPRETATION AND STATISTICS

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

CHAPTER 7: OPTIMAL RISKY PORTFOLIOS

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Chapter 2 Portfolio Management and the Capital Asset Pricing Model

Econometrics Simple Linear Regression

Solving Systems of Equations

When I was 3.1 POLYNOMIAL FUNCTIONS

Simple Predictive Analytics Curtis Seare

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

2013 MBA Jump Start Program. Statistics Module Part 3

430 Statistics and Financial Mathematics for Business

ax 2 by 2 cxy dx ey f 0 The Distance Formula The distance d between two points (x 1, y 1 ) and (x 2, y 2 ) is given by d (x 2 x 1 )

More Equations and Inequalities

SYSTEMS OF LINEAR EQUATIONS

How To Run Statistical Tests in Excel

Linear Equations in Two Variables

Section 3 Part 1. Relationships between two numerical variables

SECTION 2.2. Distance and Midpoint Formulas; Circles

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Transcription:

The Big Picture Correlation Bret Hanlon and Bret Larget Department of Statistics Universit of Wisconsin Madison December 6, We have just completed a length series of lectures on ANOVA where we considered models with a normall distributed response variable where the mean was determined b one or more categorical eplanator variables. We net consider simple linear regression models where there is again a normall distributed response variable, but where the means are determined b a linear function of a single quantitative eplanator variable. Before developing ideas about regression, we need to eplore scatter plots to displa two quantitative variables and correlation to numericall quantif the linear relationship between two quantitative variables. Correlation / 5 Correlation Correlation The Big Picture / 5 Data We will consider data sets of two quantitative random variables such as in this eample. age is the age of a male lion in ears; proportion.black is the proportion of a lion s nose that is black. age proportion.black...5.4.9...3.6. 3..3 3..... Scatter Plots A scatter plot is a graph that displas two quantitative variables. Each observation is a single point. One variable we will call X is plotted on the horizontal ais. A second variable we call Y is plotted on the vertical ais. If we plan to use a model where one variable is a response variable that depends on an eplanator variable, X should be the eplanator variable and Y should be the response. Correlation Correlation The Big Picture 3 / 5 Correlation Correlation Scatter Plots 4 / 5

Lion Eample Lion Data Scatter Plot Case Stud The noses of male lions get blacker as the age. This is potentiall useful as a means to estimate the age of a lion with unknown age. (How one measures the blackness of a lion s nose in the wild without getting eaten is an ecellent question!) We displa a scatter plot of age versus blackness of 3 lions of known age (Eample 7. from the tetbook). The choice of X and Y is for the desired purpose of estimating age from observable blackness. It is also reasonable to consider blackness as a response to age. Age (ears) 5...4.6.8. Proportion Black in Nose Correlation Correlation Scatter Plots 5 / 5 Correlation Correlation Scatter Plots 6 / 5 Observations We see that age and blackness in the nose are positivel associated. As one variable increases, the other also tends to increase. However, there is not a perfect relationship between the two variables, as the points do not fall eactl on a straight line or simple curve. We can imagine other scatter plots where points ma be more or less tightl clustered around a line (or other curve). Statisticians have invented a statistic called the correlation coefficient to quantif the strength of a linear relationship. Before developing simple linear regression models, we will develop our understanding of correlation. Correlation Definition The correlation coefficient r is measure of the strength of the linear relationship between two variables. r = = n n ( )( ) Xi X Yi Ȳ i= s X s Y n i= (X i X )(Y i Ȳ ) n i= (X i X ) n i= (Y i Ȳ ) Notice that the correlation is not affected b linear transformations of the data (such as changing the scale of measurement) as each variable is standardized b substracting the mean and dividing b the standard deviation. Correlation Correlation Scatter Plots 7 / 5 Correlation Correlation Scatter Plots 8 / 5

Observations about the Formula r = n n ( )( ) Xi X Yi Ȳ i= s X Notice that observations where X i and Y i are either both greater or both less than their respective means will contribute positive values to the sum, but observations where one is positive and the other is negative will contribute negative values to the sum. Hence, the correlation coeficient can be positive or negative. Its value will be positive if there is a stronger trend for X and Y to var together in the same direction (high with high, low with low) than vice versa. Correlation Correlation Scatter Plots 9 / 5 s Y Additional Observations about the Formula r = n i= (X i X )(Y i Ȳ )/ (n ) s X s Y If X = Y, the numerator would be the sample variance. When X and Y are different variables, the epression in the numerator is called the sample covariance. The covariance has units of the product of the units of X and Y. Dividing the covariance b the two standard deviations makes the correlation coeficient unitless. Also note that if X = Y, then the numerator and the denominator would both be equal to the variance of X, and r would be. This suggests that r = is the largest possible value (which is true, but requires more advanced mathematics than we assume to prove). If Y = X, then r =, and this is the smallest that r can be. Correlation Correlation Scatter Plots / 5 Scolding the Tetbook Authors Scatter plots Also note that throughout Chapters 6 and 7, the tetbook authors fail to include subscripts with their summation equations, which is inecusable. For eample, (X X ) should be n (X i X ) i= so that ou, the reader, knows that values of X i change (potentiall) from term to term, but X is constant. The net several graphs will show scatter plots of sample data with different correlation coefficients so that ou can begin to develop an intuition for the meaning of the numerical values. In addition, note that ver different graphs can have the same numerical correlation. It is important to look at graphs and not onl the value of r! Correlation Correlation Scatter Plots / 5 Correlation Correlation Scatter Plots / 5

r =.97 r =. 5 5 Correlation Correlation Scatter Plots 3 / 5 Correlation Correlation Scatter Plots 4 / 5 r =. r = 5 5 Correlation Correlation Scatter Plots 5 / 5 Correlation Correlation Scatter Plots 6 / 5

r = r = 4 3 Correlation Correlation Scatter Plots 7 / 5 Correlation Correlation Scatter Plots 8 / 5 Comment r =.97 Notice that when r =, this means that there is no strong linear relationship between X and Y, but there could be a strong nonlinear relationship between X and Y. You need to plot the data and not just calculate r! r will be close to zero whenever the sum of squared residuals around the best fitting straight line is about the same as the sum of squared residuals around the best fitting horizontal line. 4 3..5..5. Correlation Correlation Scatter Plots 9 / 5 Correlation Correlation Scatter Plots / 5

Comment r =.97 Notice that when r is close to (but not eactl one), there is a strong linear relationship between X and Y with a positive slope. However, there could be a stronger nonlinear relationship between X and Y. You need to plot the data and not just calculate r! r will be close to one whenever the sum of squared residuals around the best fitting straight line with a positive slope is much smaller than sum of squared residuals around the best fitting horizontal line. 4 3..5..5. Correlation Correlation Scatter Plots / 5 Correlation Correlation Scatter Plots / 5 Comment Summar of Correlation Notice that when r is close to (but not eactl), there is a strong linear relationship between X and Y with a negative slope. However, there could be a stronger nonlinear relationship between X and Y. You need to plot the data and not just calculate r! r will be close to whenever the sum of squared residuals around the best fitting straight line with a negtive slope is much smaller than sum of squared residuals around the best fitting horizontal line. The correlation coefficient r measures the strength of the linear relationship between two quantitative variables, on a scale from to. The correlation coefficient is or onl when the data lies perfectl on a line with negative or positive slope, respectivel. If the correlation coefficient is near one, this means that the data is tightl clustered around a line with a positive slope. Correlation coefficients near indicate weak linear relationships. However, r does not measure the strength of nonlinear relationships. If r =, rather than X and Y being unrelated, it can be the case that the have a strong nonlinear relationshsip. If r is close to, it ma still be the case that a nonlinear relationship is a better description of the data than a linear relationship. Correlation Correlation Scatter Plots 3 / 5 Correlation Correlation Scatter Plots 4 / 5

A final thought A final thought In case ou missed a major theme of this lecture... Alwas plot our data! In case ou missed a major theme of this lecture... Alwas plot our data! Correlation Correlation Scatter Plots 5 / 5 Correlation Correlation Scatter Plots 5 / 5