A SIMPLER METHOD FOR TEACHING THE MEANING OF THE CORRELATION COEFFICIENT by John N. Zorich, Jr.

Similar documents
Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

Module 3: Correlation and Covariance

Correlation key concepts:

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Session 7 Bivariate Data and Analysis

Part 2: Analysis of Relationship Between Two Variables

The correlation coefficient

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Regression Analysis: A Complete Example

Simple Regression Theory II 2010 Samuel L. Baker

Review of Fundamental Mathematics

The Big Picture. Correlation. Scatter Plots. Data

Univariate Regression

Method To Solve Linear, Polynomial, or Absolute Value Inequalities:

2. Simple Linear Regression

Determination of g using a spring

Measurement with Ratios

Simple linear regression

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

What are the place values to the left of the decimal point and their associated powers of ten?

Hedge Effectiveness Testing

Mathematics Online Instructional Materials Correlation to the 2009 Algebra I Standards of Learning and Curriculum Framework

Exercise 1.12 (Pg )

Linear Equations. Find the domain and the range of the following set. {(4,5), (7,8), (-1,3), (3,3), (2,-3)}

How To Run Statistical Tests in Excel

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5

MEASURES OF VARIATION

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Course Outlines. 1. Name of the Course: Algebra I (Standard, College Prep, Honors) Course Description: ALGEBRA I STANDARD (1 Credit)

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Scatter Plot, Correlation, and Regression on the TI-83/84

Click on the links below to jump directly to the relevant section

Algebra Cheat Sheets

Chapter 9 Descriptive Statistics for Bivariate Data

Years after US Student to Teacher Ratio

3.1. RATIONAL EXPRESSIONS

2013 MBA Jump Start Program. Statistics Module Part 3

Basic numerical skills: EQUATIONS AND HOW TO SOLVE THEM. x + 5 = = (2-2) = = 5. x = 7-5. x + 0 = 20.

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

The Method of Least Squares

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Means, standard deviations and. and standard errors

CURVE FITTING LEAST SQUARES APPROXIMATION

Dimensionality Reduction: Principal Components Analysis

Straightening Data in a Scatterplot Selecting a Good Re-Expression Model

table to see that the probability is (b) What is the probability that x is between 16 and 60? The z-scores for 16 and 60 are: = 1.

Mathematics Review for MS Finance Students

Copyright 2013 by Laura Schultz. All rights reserved. Page 1 of 7

TIME SERIES ANALYSIS & FORECASTING

Part Three. Cost Behavior Analysis

Partial Fractions. Combining fractions over a common denominator is a familiar operation from algebra:

Graphing Rational Functions

Father s height (inches)

AP Physics 1 and 2 Lab Investigations

Determine If An Equation Represents a Function

CALCULATIONS & STATISTICS

(Least Squares Investigation)

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

Section 3 Part 1. Relationships between two numerical variables

AP STATISTICS REVIEW (YMS Chapters 1-8)

Algebra I Vocabulary Cards

Homework 11. Part 1. Name: Score: / null

3.2. Solving quadratic equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

Chapter 7 - Roots, Radicals, and Complex Numbers

PLOTTING DATA AND INTERPRETING GRAPHS

INTRODUCTION TO MULTIPLE CORRELATION

5.1 Radical Notation and Rational Exponents

Scatter Plots with Error Bars

Demand Forecasting When a product is produced for a market, the demand occurs in the future. The production planning cannot be accomplished unless

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Math 1. Month Essential Questions Concepts/Skills/Standards Content Assessment Areas of Interaction

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

0.8 Rational Expressions and Equations

Using R for Linear Regression

MATH-0910 Review Concepts (Haugen)

MATH 095, College Prep Mathematics: Unit Coverage Pre-algebra topics (arithmetic skills) offered through BSE (Basic Skills Education)

Dealing with Data in Excel 2010

Section V.3: Dot Product

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

MATH 60 NOTEBOOK CERTIFICATIONS

Algebra 1. Curriculum Map

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Mathematics Curriculum Guide Precalculus Page 1 of 12

Mathematics, Basic Math and Algebra

CHAPTER 5 Round-off errors

Figure 1.1 Vector A and Vector F

Example: Boats and Manatees

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Section 1.1 Linear Equations: Slope and Equations of Lines

PREPARATION FOR MATH TESTING at CityLab Academy

Mathematics Grade-Level Instructional Materials Evaluation Tool

4. Simple regression. QBUS6840 Predictive Analytics.

Transcription:

Copyright 01, by John N. Zorich Jr., Zorich Consulting & Training, www.johnzorich.com A SIMPLER METHOD FOR TEACHING THE MEANING OF THE CORRELATION COEFFICIENT by John N. Zorich, Jr. For over 100 years, the concept of the correlation coefficient (CC) has been taught to beginning students of statistical science but without serious reference to the equation on which it is based. Instead, the meaning of the CC has been explained using wordy generalities and textbook scatter plots the CC being larger the less scattered the plot looks. Unfortunately, such generalities result in most students internalizing subtle misconceptions. Figs. 1,, and 3 demonstrate some of the difficulties that cannot be explained using classic teaching methods. In Fig. 1, each of the four Data Sets (labeled A, B, C, and D) has a least squares linear regression (LSLR) straight line drawn through the raw data points. Each Data Set has different Y values for each even X whole number from through 18 (in Set D, the different Y values at each X value are so close together that they appear as a single dot). In spite of the obvious differences between these data sets, they all yield the same high CC. In Fig., Data Sets B and C are, in effect, subsets of Set A. Set B is composed of the first six data points from Set A after subtracting 3.0 from each Y value. Likewise, Set C is the first three data points from Set A after subtracting 6.0. Notice that all three regression lines have the identical slope and have data points that lie at exactly the same distance from their regression line. It seems that a coefficient that purports to indicate correlation should indicate that these three data sets are, correlatively speaking, the same. But, as indicated in the figure, the larger the data set, the larger the CC. In Fig. 3, we seem to have a contradiction to the conclusion reached regarding Fig. ; that is, in Fig. 3, the more data points, the lower the CC, despite the fact that all three regression lines have the identical slope and have data points that lie at exactly the same distance from their regression line (as in Fig., Sets B and C are subsets of Set A, offset by a value of 3.0 or 6.0, respectively). The separation of CC meaning from the CC equation stems historically from the fact that the equations that appeared most often in textbooks were difficult to teach or understand. The first equations were developed in the late 1800s 1 ; the most commonly cited one has been some version of the following: CC = ( X X)( Y Y) ( X X) ( Y Y) Equation 1 (the traditional equation) The Short Method (Equation ) was popularized in the early 1900s, and became more common after it was revised slightly and renamed the computational form 3 for use with sophisticated mechanical calculators and simple electronic ones: Page 1 of 7

Copyright 01, by John N. Zorich Jr., Zorich Consulting & Training, www.johnzorich.com N XY X Y CC = Equation N X ( X ) N Y ( Y ) A simpler equation (Equation 3) was developed, but it didn t have much pedagogic value. In this equation, Sx and Sy are the standard deviation of the X and Y data, respectively, and slope is the slope of the linear regression line calculated by the method of least squares: (slope)sx CC = Equation 3 Sy The ratio of slope to Sy in this equation is helpful in explaining the lack of dependence of the CC on slope, since a large CC can result from either a large slope, a large Sx, and/or a small Sy. This equation is unable to explain what the CC is, but is wonderful for explaining what the CC is not. About 1985, I thought I d developed a new, more instructive equation. Alas, a few years later, I found my new equation at the end of an appendix to a 1961 introductory statistics book written by someone else. 5 I discovered my new equation within the equation for the Coefficient of Determination (CD). In the CD equation (see next), Ye represents the Y values calculated for each X value by using the least squares linear regression analysis equation (Ye = a + bx), and Yi represents the raw Y data. One Ye value is calculated for each Yi value. As always in least squares linear regression, the mean of the Ye data is the same as that of the Yi, and so it is not subscripted in the equation below: CD = CC = ( Ye Y ) ( Yi Y ) Equation Dividing top and bottom of the fraction by N-1 (where N is the number of Yi data points), I discovered an equation that is the ratio of two sample variances: CC = Variance( Ye) Variance( Yi) Equation 5 After taking the square root of both sides, I found an equation containing the absolute value of the CC on one side, and the ratio of two sample standard deviations on the other: CC = StdDeviation( Ye) StdDeviation( Yi) Equation 6 (the new equation) Page of 7

Copyright 01, by John N. Zorich Jr., Zorich Consulting & Training, www.johnzorich.com Although this equation does not calculate the sign of the CC, this is not a limitation. In LSLR, the sign of the CC is always the same as that of the regression coefficient ( b in the LSLR equation Y= a + bx) that is, if the slope is negative, so is the CC, and vice versa, as easily seen from Equation 3, above. Thus, the sign of the CC has no meaning independent of the regression slope, and so the only unique aspect of the CC is its absolute value, which the new equation calculates. Calculation of the CC using the new equation is shown by example in Table 1. Not only can this new equation be used to easily explain the surprising CC results shown in Figs. 1,, and 3, but it can also be use to explain other interesting facts, such as: 1. The correlation coefficient can never equal exactly 1.000, unless all the Yi s form a perfectly straight line which is the only case in which the standard deviations of Ye and Yi are identical.. The CC can never equal exactly 0.000, unless the standard deviation of Ye is also zero which would occur only if the calculated linear regression line were perfectly horizontal. 3. The CC represents the fraction of the total variation in Yi, as measured in units of standard deviation, that can be explained by a linear relationship between Yi and X. The larger the CC, the larger the fraction of the Yi variation which can be explained this way. The remaining variation can t be explained, at least not by the CC. Page 3 of 7

Copyright 01, by John N. Zorich Jr., Zorich Consulting & Training, www.johnzorich.com 1 Fig. 1 Linear Regression by Least Squares Correlation Coefficient = 0.955 for Each Data Set 1 A 10 B 8 6 C D 0 0 6 8 10 1 1 16 18 0 Page of 7

Copyright 01, by John N. Zorich Jr., Zorich Consulting & Training, www.johnzorich.com 1 Fig. Linear Regression by Least Squares 1 A 10 8 B 6 C Correlation Coefficients Data Set A = 0.955 Data Set B = 0.906 Data Set C = 0.71 0 0 6 8 10 1 1 16 18 0 Page 5 of 7

Copyright 01, by John N. Zorich Jr., Zorich Consulting & Training, www.johnzorich.com Fig. 3 Linear Regression by Least Squares 1 1 Correlation Coefficients Data Set A = 0.955 Data Set B = 0.96 Data Set C = 0.971 A 10 8 B 6 C 0 0 6 8 10 1 1 16 18 0 Page 6 of 7

Copyright 01, by John N. Zorich Jr., Zorich Consulting & Training, www.johnzorich.com Table 1. Calculation of the Correlation Coefficient using the New Equation (Ye is calculated using the regression equation at the bottom of this table) X raw data Yi raw data Ye, calculated 6 10 9.7609 7 11 10.065 7 10 11.535 8 11 11.535 9 1 11.535 10 1 11.9891 10 1 1.38 1 13 1.880 13 1 13.361 15 1 13.7717 Standard Deviation = 1.91 Standard Deviation = 1.03 Least Squares Linear Regression Equation is Ye = 7.1 + 0.83X Correlation Coefficient (CC) using the Traditional Equation = 0.9677 CC using the New Equation, (StdDev Ye) / (StdDev Yi) = 0.9677 1 3 5 S. M. Stigler, The History of Statistics, 1986 (Belknap Press, Cambridge MA), chapter 9. J. G. Smith, Elementary Statistics, 193 (Henry Holt & Co., New York) p. 37. H. L. Alder, E. B. Roessler, Introduction to Probability and Statistics, 6th ed., 1977 (W. H. Freeman & Co., San Francisco) p. 30. Ibid, p. 31. Alder & Roessler use different symbols than are used here. W. J. Reichmann, Use and Abuse of Statistics, 1961 (Oxford University Press, New York) p. 306. Page 7 of 7