Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2



Similar documents
Section 3 Part 1. Relationships between two numerical variables

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Correlation key concepts:

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Chapter 7: Simple linear regression Learning Objectives

Homework 11. Part 1. Name: Score: / null

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Module 3: Correlation and Covariance

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Introduction to Quantitative Methods

Simple Predictive Analytics Curtis Seare

II. DISTRIBUTIONS distribution normal distribution. standard scores

Describing Relationships between Two Variables

CALCULATIONS & STATISTICS

Relationships Between Two Variables: Scatterplots and Correlation

Statistics for Sports Medicine

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Exercise 1.12 (Pg )

The correlation coefficient

Statistics. Measurement. Scales of Measurement 7/18/2012

Session 7 Bivariate Data and Analysis

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

The Dummy s Guide to Data Analysis Using SPSS

DATA INTERPRETATION AND STATISTICS

Descriptive Statistics

Simple linear regression

Example: Boats and Manatees

Additional sources Compilation of sources:

Describing, Exploring, and Comparing Data

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Foundations for Functions

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Regression and Correlation

Father s height (inches)

Point Biserial Correlation Tests

Homework 8 Solutions

Association Between Variables

The Correlation Coefficient

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Eight things you need to know about interpreting correlations:

Univariate Regression

The Correlation Coefficient

Using Excel for inferential statistics

CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V

You buy a TV for $1000 and pay it off with $100 every week. The table below shows the amount of money you sll owe every week. Week

Algebra 1 Course Information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

CORRELATION ANALYSIS

Lecture 2. Summarizing the Sample

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Diagrams and Graphs of Statistical Data

UNIVERSITY OF NAIROBI

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Descriptive statistics; Correlation and regression

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

Week 1. Exploratory Data Analysis

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Econometrics Simple Linear Regression

AP Statistics Ch 3 Aim 1: Scatter Diagrams

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test March 2014

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

2. Simple Linear Regression

Introduction to Regression and Data Analysis

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Means, standard deviations and. and standard errors

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Geostatistics Exploratory Analysis

Introduction to Statistics and Quantitative Research Methods

Transforming Bivariate Data

Chapter 7 Scatterplots, Association, and Correlation

Biostatistics: Types of Data Analysis

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

MATH 103/GRACEY PRACTICE EXAM/CHAPTERS 2-3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

THE KRUSKAL WALLLIS TEST

Exploratory data analysis (Chapter 2) Fall 2011

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

Module 5: Statistical Analysis

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

There are probably many good ways to describe the goals of science. One might. Correlation: Measuring Relations CHAPTER 12. Relations Among Variables

Transcription:

Lesson 4 Part 1 Relationships between two numerical variables 1

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Relationship between two variables The summary statistics covered in the previous lessons are appropriate for describing a single variable. In many investigations, there is interest in evaluating the relationship between two variables. Depending on the type of data, different statistics are used to measure the relationship between two variables. 3

Measuring relationship between two variables Nominal binary data Odds Ratio or Risk Ratio (covered in Lesson 4 part 2) Numerical data Pearson Correlation Coefficient Ordinal data or Skewed Numerical data Spearman Rank Correlation Coefficient 4

Correlation Coefficients Correlation coefficients measure the linear relationship between two numerical variables. Most often the Pearson Correlation Coefficient is used to describe the linear relationship Note if just the term correlation coefficient is used, it is referring to the Pearson Correlation Coefficient If one variable is numerical and the other variable is ordinal or if both variables are ordinal, use the Spearman rank correlation coefficient instead. 5

Correlation Analysis Examples For correlation analysis, you have pairs of values. The goal is to measure the linear relationship between these values. For example, what is the linear relationship between: Weight and height Blood pressure and age Daily fat intake and cholesterol Weight and cholesterol Age and bone density The pairs of values are labeled (x,y) and can be plotted on a scatter plot 6

Types of Relationships between 2 variables There are 2 types of relationships between numerical variables Deterministic relationship the values of the 2 variables are related through an exact mathematical formula. Statistical relationship this is not a perfectly linear relationship. This is what we will explore through correlation analysis. 7

Example of Deterministic Linear relationship Weight in kg Deterministic linear relationship 85 80 75 70 65 60 55 50 45 100 120 140 160 180 Weight in pounds The relationship between weight measured in lbs. And weight measured in kg. is deterministic because it can be described with a Mathematical formula: 1 pound = 0.4536 *kg All the points in the plot fall on a straight line in a deterministic relationship 8

Statistical Linear relationship 120 110 100 90 In a statistical linear relationship the plotted (x,y) values do not fall exactly on a straight line. Var Y 80 70 60 50 40 70 80 90 100 110 120 130 140 Var X You can see, however, that there is a general linear trend in the plot of the (x,y) values. The correlation coefficient provides information about the direction and strength of the linear relationship between two numerical values. 9

Nonlinear Relationship V a r. Y 15 10 5 0 Nonlinear relationship 0 5 10 15 20 Var. X A nonlinear relationship between two variables results in a scatter plot of the (x,y) values that do not follow a straight line pattern. The relationship between these two variables is a curvilinear relationship. The correlation coefficient does not provide a description of the strength and direction of nonlinear relationships. 10

Random scatter: no pattern IQ 130 120 110 100 90 80 70 60 58 60 62 64 66 68 70 Height This scatter plot of IQ and height for 18 individuals illustrates the situation where there is no relationship between the two variables. The (x,y) values do not have either a linear or curvilinear pattern. These points have a random scatter, indicating no relationship. 11

Correlation Analysis Procedure 1. Plot the data using a scatter plot to get a visual idea of the relationship 1. If there is a linear pattern, continue 2. If the pattern is curved, stop 2. 2. Calculate the correlation coefficient 3. Interpret the correlation coefficient 12

Scatter Plots Plot the 2 variables in a scatter plot (one of the types of charts in EXCEL). The pattern of the dots in the plot indicates the direction of the statistical relationship between the variables Positive correlation pattern goes from lower left to upper right Increase in X is associated with increase in Y Negative correlation pattern goes from upper left to lower right Increase in X is associated with decrease in Y 13

Height and weight: a positive linear relationship Ht (in) Wt (lbs) 64 150 67 170 69 180 60 135 68 156 69 168 70 210 71 215 65 145 66 167 Weight (lbs) 250 200 150 100 50 0 58 60 62 64 66 68 70 72 Height (In) 14

Age and Bone Density (mg/cm 2 ): a negative linear relationship Age Bone Density 50 940 52 980 56 950 59 945 60 880 61 895 61 835 63 846 64 865 67 840 68 825 70 810 73 790 74 750 77 710 78 760 78 720 80 720 Bone density Bone Density and Age for Women 1100 1000 900 800 700 600 500 40 50 60 70 80 90 Age 15

Strength of the Association The correlation coefficient provides a measure of the strength and direction of the linear relationship. The sign of the correlation coefficient indicates the direction Positive coefficient = positive direction Negative coefficient = negative direction Strength of association Correlation coefficients closer to 1 or -1 indicates a stronger relationship Correlation coefficients = 0 (or close to 0) indicate a non-existent or very weak linear relationship 16

Correlation Coefficient r = [ ( x ( x x) 2 x)( y ][ y) ( y y) 2 ] The statistic r is called the Correlation Coefficient r is always between 1 and 1 The closer r is to 1 or -1, the stronger the linear relationship 17

Strong positive relationship 250 200 R = 0.85 Weight (lbs) 150 100 50 0 58 60 62 64 66 68 70 72 Height (In) The correlation coefficient for the (height, weight) data = 0.85 indicating a strong (close to 1.0), positive linear relationship between height and weight. 18

Strong negative relationship 1100 Bone Density and Age for Women Bome density 1000 900 800 700 600 R = -0.96 500 40 50 60 70 80 90 Age The correlation coefficient for the (age, bone density) data = -0.96 indicating a strong (close to -1.0), negative linear relationship between age and bone density for women. 19

Correlation Coefficient Interpretation Guidelines* These guidelines can be used to interpret the correlation coefficient 0 to 0.25 ( or 0 to 0.25) weak or no linear relationship 0.25 to 0.5 (or 0.25 to 0.5) fair degree of linear relationship 0.50 to 0.75 (or 0.5 to 0.75) moderately strong linear relationship > 0.75 ( or -0.75) very strong linear relationship 1 or 1 perfect linear relationship deterministic relationship * Colton (1974) 20

Calculating Correlation Coefficient in Excel If the original data are in B2:B11 and C2:C11 Use the CORREL function and the data ranges =CORREL(B2:B11,C2:C11) to find the correlation coefficient : 0.8522 Use the CORREL function for assignments and exams. 21

Interpretation of Correlation Coefficient = 0 A correlation coefficient = 0 does not necessarily mean there is NO relationship between the variables. If r = 0 this means there is no linear relationship There may be a strong nonlinear relationship but r will not identify this. Plot the data first to determine if there is a nonlinear relationship 22

Nonlinear Relationship r = 0 Nonlinear relationship V a r. Y 15 10 5 0 0 5 10 15 20 Var. X This plot illustrates that the correlation coefficient only measures the strength of linear relationships. Here there is a nonlinear (curvilinear) relationship between X and Y but the correlation coefficient = 0. 23

Random scatter: no pattern r = 0.002 IQ 130 120 110 100 90 80 70 60 58 60 62 64 66 68 70 Height When there is a random scatter of points, the correlation coefficient will be near 0 indicating no linear relationship between the variables 24

Notes about Correlation Coefficient The correlation coefficient is independent of the units used in measuring the variables The correlation coefficient is independent of which variable is designated the X variable and which is designated the Y variable 25

Correlation vs. Causation Strong correlation does not imply causation (i.e. that changes in one variable cause changes in the other variable). Example*: There is a strong positive correlation between the number of television sets /person and the average life expectancy for the world s nations. Does having more TV sets cause longer life expectancy? *Moore & McCabe 26

Coefficient of Determination The coefficient of Determination is the correlation coefficient squared (R 2 ) R 2 = the percent of variation in the Y variable that is explained by the linear association between the two variables From example of age / bone density: Correlation coefficient for age and bone density = -0.96. Coefficient of determination = (-0.96) 2 = 0.92 92% of the variation in bone density is explained by the linear association between age and bone density 27

Effect of outlier on r One outlier has been added to the random pattern plot of height and IQ: a 74 inch tall person with IQ of 165 Addition of this one point results in r changing from 0.002 to 0.44 Random Pattern with outlier IQ 200 170 140 110 80 50 r = 0.44 58 63 68 73 Height 28

Correlation Coefficient and Outliers The Pearson s correlation coefficient can be influenced by extreme values (outliers). Outliers are most easily identified from a scatter plot. If an outlier is suspected, calculate r with and without the outlier. If the results are very different, consider calculating the Spearman rank correlation coefficient instead. 29

Correlation measures the strength of any relationship between two continuous variables? 1. True 2. False

What is the direction of the association pictured in the scatter-plot below? 1. Positive 2. Negative 3. Cannot be determined. 3.

Which of the choices best describes the strength of the linear relationship pictured in the scatter-plot below? 1. Weak 2. Moderate 3. Strong 3.

Suppose more data is added to the original plot, what happens to the strength of the linear relationship? 1. Weaker 2. Stronger 3. Cannot be determined. 3.

Which of the choices best describes the strength of the linear relationship pictured in the scatter-plot below? 1. Weak 2. Moderate 3. Strong 3.

Suppose another piece data is added to the original plot, what happens to the strength of the linear relationship? 1. Weaker 2. Stronger 3. Cannot be determined 3.

Scatter-plot girls education vs. fertility rate. Source: Population Reference Bureau website: http://www.prb.org/publications/datasheets/2007/populationeconomicdevelopment2007.aspx What is the general trend in this relationship?

What is the general trend in this relationship?? 1. Positive 2. Negative

Scatter-plot girls education vs. fertility rate. Source: Population Reference Bureau website: http://www.prb.org/publications/datasheets/2007/populationeconomicdevelopment2007.aspx How strong is this linear relationship?

1. Weak How strong is this linear 2. Moderate relationship? 3. Moderately Strong 4. Extremely Strong

Scatter-plot girls education vs. fertility rate. Source: Population Reference Bureau website: http://www.prb.org/publications/datasheets/2007/populationeconomicdevelopment2007.aspx Which location appears to not follow the linear relationship?

Which location appears to not follow the linear relationship? 1. Palestinian Territory 2. Czech Republic 3. Burkina Faso 4. Laos

Scatter-plot girls education vs. fertility rate. Source: Population Reference Bureau website: http://www.prb.org/publications/datasheets/2007/populationeconomicdevelopment2007.aspx With what type of data can you create a scatter-plot?

With what type of data can you create a scatter-plot? 1. Two binary variables for each unit. 2. A continuous x and y for each unit. 3. An ordinal and a nominal variable for each unit. 4. Univariate Data

Spearman Rank Correlation (Spearman s Rho) Non-parametric: analysis methods that make no assumptions about the data distribution Spearman rank correlation is the appropriate correlation analysis to use when Both variables are ordinal data One variable is continuous and one is ordinal Continuous data is skewed due to outlier(s) 44

Spearman rank correlation Procedure Order the data values for each variable from lowest to highest Assign ranks for each variable (1 up to n) Use the ranks instead of the original data values in the formula for the correlation coefficient 45

Skewed Data due to outlier 12 10 8 6 4 2 0 Pearson Correlation = 0.41 0 5 10 15 20 With the outlier point (17,9) r = 0.41. Without the outlier, r = -0.04. 46

Original values and Rank values X Y X-rank Y-rank 8 7 6 5 4 10 3 7 3 4 2 3 17 9 7 6 2 6 1 4 6 3 5 2 5 2 4 1 The smallest X value is given rank 1, the largest is given rank 7 Similarly, the smallest Y value is given rank 1, the largest is given rank 7 Other values are ranked according to their order. 47

Spearman s Rho: the Spearman rank correlation coefficient Ranked values of Y 12 10 8 6 4 2 0 Spearman s Rho = 0.18 0 2 4 6 8 Ranked values of X Ranking the values of X and Y removes the influence of the outlier Spearman s Rho is not as large as the Pearson s correlation coefficient for the original data (r = 0.44) and the outlier is not excluded from analysis. 48

Spearman rank correlation in Excel Excel does not have a function for Spearman rank correlation Use the RANK function to find the ordered rank for each value (separately for each variable) Use the CORREL function on the ranks. This will return the Spearman rank correlation coefficient (Spearman s Rho) 49

Reading and Assignments Reading: Chapter 3 pgs. 48 50 Lesson 4 Part 1 Practice Exercises Lesson 4 Excel Module Begin Homework 2: Problems 1 and 2 50