Linear Correlation Analysis

Similar documents
We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

Section 3 Part 1. Relationships between two numerical variables

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Statistics. Measurement. Scales of Measurement 7/18/2012

UNIVERSITY OF NAIROBI

Module 3: Correlation and Covariance

EPS 625 INTERMEDIATE STATISTICS FRIEDMAN TEST

Descriptive Statistics

The Dummy s Guide to Data Analysis Using SPSS

Correlation key concepts:

Analysing Questionnaires using Minitab (for SPSS queries contact -)

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

The correlation coefficient

CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

Simple linear regression

Using Excel for inferential statistics

Simple Predictive Analytics Curtis Seare

Additional sources Compilation of sources:

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Interpretation of Somers D under four simple models

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Introduction to Quantitative Methods

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Projects Involving Statistics (& SPSS)

THE KRUSKAL WALLLIS TEST

II. DISTRIBUTIONS distribution normal distribution. standard scores

Statistical tests for SPSS

A full analysis example Multiple correlations Partial correlations

Parametric and Nonparametric: Demystifying the Terms

Point Biserial Correlation Tests

Confidence Intervals for Spearman s Rank Correlation

January 26, 2009 The Faculty Center for Teaching and Learning

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

SPSS Tests for Versions 9 to 13

Chapter G08 Nonparametric Statistics

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

2. Simple Linear Regression

Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences

Simple Regression Theory II 2010 Samuel L. Baker

Study Guide for the Final Exam

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

DATA COLLECTION AND ANALYSIS

Statistics for Sports Medicine

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Part 2: Analysis of Relationship Between Two Variables

SPSS Explore procedure

Correlation and Regression Analysis: SPSS

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

UNDERSTANDING THE DEPENDENT-SAMPLES t TEST

CALCULATIONS & STATISTICS

Nursing Journal Toolkit: Critiquing a Quantitative Research Article

College Readiness LINKING STUDY

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Association Between Variables

Homework 11. Part 1. Name: Score: / null

Chapter 7: Simple linear regression Learning Objectives

Chapter 23. Inferences for Regression

An introduction to IBM SPSS Statistics

Introduction to Statistics and Quantitative Research Methods

Introduction to Regression and Data Analysis

Biostatistics: Types of Data Analysis

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

There are probably many good ways to describe the goals of science. One might. Correlation: Measuring Relations CHAPTER 12. Relations Among Variables

Rank-Based Non-Parametric Tests

Statistics 2014 Scoring Guidelines

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

How Far is too Far? Statistical Outlier Detection

SPSS Guide How-to, Tips, Tricks & Statistical Techniques

Data analysis process

Moderator and Mediator Analysis

Illustration (and the use of HLM)

Lecture 2. Summarizing the Sample

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Come scegliere un test statistico

11. Analysis of Case-control Studies Logistic Regression

Pearson s Correlation

Analyzing Research Data Using Excel

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Describing, Exploring, and Comparing Data

The Statistics Tutor s Quick Guide to

Analysis of Questionnaires and Qualitative Data Non-parametric Tests

The Basics of Regression Analysis. for TIPPS. Lehana Thabane. What does correlation measure? Correlation is a measure of strength, not causation!

Difference tests (2): nonparametric

On the Practice of Dichotomization of Quantitative Variables

Descriptive Statistics and Measurement Scales

Means, standard deviations and. and standard errors

Statistics E100 Fall 2013 Practice Midterm I - A Solutions

NAG C Library Chapter Introduction. g08 Nonparametric Statistics

HYPOTHESIS TESTING WITH SPSS:

Transcription:

Linear Correlation Analysis Spring 2005

Superstitions Walking under a ladder Opening an umbrella indoors Empirical Evidence Consumption of ice cream and drownings are generally positively correlated. Can we reduce the number of drownings if we prohibit ice cream sales in the summer?

3 kinds of relationships between variables Association or Correlation or Covary Both variables tend to be high or low (positive relationship) or one tends to be high when the other is low (negative relationship). Variables do not have independent & dependent roles. Prediction Variables are assigned independent and dependent roles. Both variables are observed. There is a weak causal implication that the independent predictor variable is the cause and the dependent variable is the effect. Causal Variables are assigned independent and dependent roles. The independent variable is manipulated and the dependent variable is observed. Strong causal statements are allowed.

General Overview of Correlational Analysis The purpose is to measure the strength of a linear relationship between 2 variables. A correlation coefficient does not ensure causation (i.e. a change in X causes a change in Y) X is typically the Input, Measured, or Independent variable. Y is typically the Output, Predicted, or Dependent variable. If, as X increases, there is a predictable shift in the values of Y, a correlation exists.

General Properties of Correlation Coefficients Values can range between +1 and -1 The value of the correlation coefficient represents the scatter of points on a scatterplot You should be able to look at a scatterplot and estimate what the correlation would be You should be able to look at a correlation coefficient and visualize the scatterplot

Perfect Linear Correlation Occurs when all the points in a scatterplot fall exactly along a straight line.

Positive Correlation Direct Relationship As the value of X increases, the value of Y also increases Larger values of X tend to be paired with larger values of Y (and consequently, smaller values of X and Y tend to be paired)

Negative Correlation Inverse Relationship As the value of X increases, the value of Y decreases Small values of X tend to be paired with large value of Y (and vice versa).

Non-Linear Correlation As the value of X increases, the value of Y changes in a non-linear manner

No Correlation As the value of X changes, Y does not change in a predictable manner. Large values of X seem just as likely to be paired with small values of Y as with large values of Y

Interpretation Depends on what the purpose of the study is but here is a general guideline... Value = magnitude of the relationship Sign = direction of the relationship

Some of the many Types of Correlation Coefficients (there are lot s more ) Name X variable Y variable Pearson r Interval/Ratio Interval/Ratio Spearman rho Ordinal Ordinal Kendall's Tau Ordinal Ordinal Phi Dichotomous Dichotomous Intraclass R Interval/Ratio Test Interval/Ratio Retest

Some of the many Included in SPSS Bivariate Correlation (there are lot s more. these are the procedure ones we will focus on this semester) Types of Correlation Coefficients Name X variable Y variable Pearson r Interval/Ratio Interval/Ratio Spearman rho Ordinal Ordinal Kendall's Tau Ordinal Ordinal Phi Dichotomous Dichotomous Intraclass R Interval/Ratio Test Interval/Ratio Retest

The Pearson Product-Moment Correlation (r) Named after Karl Pearson (1857-1936) Both X and Y measured at the Interval/Ratio level Most widely used coefficient in the literature

The Pearson Product- Moment Correlation (r) A measure of the extent to which paired scores occupy the same or opposite positions within their own distributions From: Pagano (1994)

Computing Pearson r Hand Calculation

Step #1 Computing Pearson r in EXCEL Step #2: Insert Function (Pearson) Step #3: Select X and Y data Step #4: Format output Subject X Y A 1 2 B 3 5 C 4 3 D 6 7 E 7 5 Pearson r = 0.73

Step #1 Computing Pearson r in SPSS Step #2: Analyze-Correlate-Bivariate Step #3: Select X and Y data Step #4: Means + SD s

Output #1 Computing Pearson r in SPSS Descriptive Statistics VARX VARY Mean Std. Deviation N 4.20 2.387 5 4.40 1.949 5 Output #2: Correlations VARX VARY Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N VARX VARY 1.731..161 5 5.731 1.161. 5 5

Interpretation r = 0.73 : p =.161 The researchers found a moderate, but notsignificant, relationship between X and Y

SAMPLE SIZE: One of the many issues involved with the interpretation of correlation coefficients Descriptive Statistics VARX VARY Mean Std. Deviation N 4.20 2.179 25 4.40 1.780 25 VARX VARY Correlations Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N VARX VARY 1.731**..000 25 25.731** 1.000. 25 25 **. Correlation is significant at the 0.01 level

Interpretation r = 0.73 : p =.000 The researchers found a significant moderate relationship between X and Y

How can this be? The distribution of Pearson r is not symmetrically shaped as r approaches ± 1 (see http://davidmlane.com/hyperstat/a98696.html for more information) Examining the 95% confidence interval for r

An additional way to Interpret Pearson r Coefficient of Determination r 2 The proportion of the variability of Y accounted for by X This area of overlap represents the proportion of variability of Y accounted for by X (value is expressed as a %) Variability of Y X

Correlation Identification Practice Let s see if you can identify the value for the correlation coefficient from a scatterplot Click to begin

Outliers Observations that clearly appear to be out of range of the other observations. Variable Y 100 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 r = 0.72 Variable X Variable Y 100 90 80 70 60 50 40 30 20 10 0 r = 0.97 0 10 20 30 40 50 60 70 80 90 100 Variable X

What to do with Outliers You are stuck with them unless.. Check to see if there has been a data entry error. If so, fix the data. Check to see if these values are plausible. Is this score within the minimum and maximum score possible? If values are impossible, delete the data. Report how many scores were deleted. Examine other variables for these subjects to see if you can find an explanation for these scores being so different from the rest. You might be able to delete them if your reasoning is sound.

Correlation & Attenuation Restricting the range of scores can have a large impact on a correlation coefficient. r = 0.72 Variable Y 100 90 80 70 60 50 40 30 20 10 0 MEDIUM LOW HIGH 0 10 20 30 40 50 60 70 80 90 100 Variable X

Variable Y 100 90 80 70 60 50 40 30 20 10 0 Variable Y LOW 0 10 20 30 40 50 60 70 80 90 100 45 40 35 30 25 20 15 10 5 0 Variable X Low Group r = 0.55 0 5 10 15 20 25 30 35 Variable X

Variable Y 100 90 80 70 60 50 40 30 20 10 0 MEDIUM 0 10 20 30 40 50 60 70 80 90 100 80 70 Variable X Medium Group r = 0.86 Variable Y 60 50 40 30 20 20 30 40 50 60 70 Variable X

Variable Y 100 90 80 70 60 50 40 30 20 10 0 HIGH 0 10 20 30 40 50 60 70 80 90 100 100 Variable X High Group r = 0.67 Variable Y 90 80 70 60 60 70 80 90 100 Variable X

Using all of the data r = 0.72 Variable Y 100 90 80 70 60 50 40 30 20 10 0 LOW r=0.55 MEDIUM r=0.86 0 10 20 30 40 50 60 70 80 90 100 Variable X HIGH r=0.67

Here s another problem with interpreting Correlation Coefficients that you should watch out for.. 140 120 All data combined r = +0.89 Y variable 100 80 60 40 Men r = -0.21 Women r = +0.22 Men Women 20 0 0 20 40 60 80 100 120 140 X variable

Reporting a set of Correlation Coefficients in a table Complete correlation matrix. Notice redundancy. Lower triangular correlation matrix. Values are not repeated. There is also an upper triangular matrix!

Named after Charles E. Spearman (1863-1945) Assumptions: Spearman Rho (r s ) Data consist of a random sample of n pairs of numeric or non-numeric observations that can be ranked. Each pair of observations represents two measurement taken on the same object or individual. Photo from: http://www.york.ac.uk/depts/maths/histstat/people/sources.htm

Why choose Spearman rho instead of a Pearson r? Both X and Y are measured at the ordinal level Sample size is small X and Y are measured at the interval/ratio level, but are not normally distributed (e.g. are severely skewed) X and Y do not follow a bivariate normal distribution

What is a Bivariate Normal Distribution?

What is a Bivariate Normal Distribution?

Sample Problem Pincherle and Robinson (1974) note a marked inter-observer variation in blood pressure readings. They found that doctors who read high on systolic tended to read high on diastolic. Table 1 shows the mean systolic and diastolic blood pressure reading by 14 doctors. Research question: What is the strength of the relationship between the two variables? Pincherle, G. & Robinson, D. (1974). Mean blood pressure and its relation to other factors determined at a routine executive health examination. J. Chronic Dis., 27, 245-260.

Table 1. Mean blood pressure readings, millimeters mercury, by doctor. Doctor ID Systolic Diastolic 1 141.8 89.7 2 140.2 74.4 3 131.8 83.5 4 132.5 77.8 5 135.7 85.8 6 141.2 86.5 7 143.9 89.4 8 140.2 89.3 9 140.8 88.0 10 131.7 82.2 11 130.8 84.6 12 135.6 84.4 13 143.6 86.3 14 133.2 85.9 Research question: What is the strength of the relationship between the two variables? Option #1: Compute a Pearson r If you do not feel this data meet with assumptions of the Pearson r then Option #2: Convert data to Ranks and then compute a Spearman rho We will be going over how to check the assumptions on Wednesday when we talk about Regression

Computation of Spearman Rho Step #1 Rank each X relative to all other observed values of X from smallest to largest in order of magnitude. The rank of the ith value of X is denoted by R(X i ) and R(X i )=1 if X i is the smallest observed value of X Follow the same procedure for the Y variable

Table 1. Mean blood pressure readings, millimeters mercury, by mercury, millimeters doctor. by mercury, doctor. by doctor. Doctor ID Systolic Diastolic R(systolic) R(diastolic) 11 21 140.2 141.8 130.8 74.4 89.7 84.6 8.5 121 14 1 10 42 132.5 140.2 131.7 77.8 74.4 82.2 8.5 42 21 10 3 131.7 131.8 82.2 83.5 23 34 34 131.8 132.5 83.5 77.8 34 42 12 14 5 135.6 135.7 133.2 84.4 85.8 85.9 67 5 57 11 12 6 130.8 141.2 135.6 84.6 86.5 84.4 11 16 10 6 57 135.7 143.9 85.8 89.4 14 7 13 7 14 8 2 133.2 140.2 85.9 89.3 74.4 8.5 5 12 8 13 9 8 143.6 140.8 140.2 86.3 88.0 89.3 13 10 8.5 11 9 10 69 131.7 141.2 140.8 82.2 86.5 88.0 11 210 10 3 11 96 130.8 140.8 141.2 84.6 88.0 86.5 10 11 11 6 12 81 135.6 140.2 141.8 84.4 89.3 89.7 8.5 612 12 5 13 7 143.6 143.9 86.3 89.4 13 14 13 9 14 17 133.2 141.8 143.9 85.9 89.7 89.4 12 514 14 8

Table 1. Mean Table blood 1. pressure readings, millimeters mercury, by doctor. Mean blood pressure readings, millimeters mercury, by doctor. Doctor ID Systolic Diastolic R(systolic) R(diastolic) d ii 2 d i 1 141.8 89.7 12 14-2 4 2 140.2 74.4 8.5 1 7.5 56.25 3 131.8 83.5 3 4-1 1 4 132.5 77.8 4 2 2 4 5 135.7 85.8 7 7 0 0 6 141.2 86.5 11 10 1 1 7 143.9 89.4 14 13 1 1 8 140.2 89.3 8.5 12-3.5 12.25 9 140.8 88.0 10 11-1 1 10 131.7 82.2 2 3-1 1 11 130.8 84.6 1 6-5 25 12 135.6 84.4 6 5 1 1 13 143.6 86.3 13 9 4 16 14 133.2 85.9 5 8-3 -3 9 Σd i = 132.50

Computing Spearman Rho using SPSS Analyze-Correlate-Bivariate Correlations Spearman's rho SYSTOLIC DIASTOLI Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N **. Correlation is significant at the.01 level (2-tailed). SYSTOLIC DIASTOLI 1.000.708**..005 14 14.708** 1.000.005. 14 14

Kendall s Tau (τ, T, or t) Named after Sir Maurice G. Kendall (1907-1983) Based on the ranks of observations Values range between 1 and +1 Computation is more tedious than r s Defined as the probability of concordance minus the probability of discordance. Typically will yield a different value than r s To find out more about this statistic, see http://www2.chass.ncsu.edu/garson/pa765/assocordinal.htm Photo from: http://www.york.ac.uk/depts/maths/histstat/people/sources.htm

Correlations Comparison of values for the Blood Pressure Data SYSTOLIC DIASTOLI Spearman's rho Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N SYSTOLIC DIASTOLI Corre lations Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) **. Correlation is significant at the.01 level (2-tailed). N SYSTOLIC DIASTOLI 1.418..136 14 14.418 1.136. 14 14 SYSTOLIC DIASTOLI 1.000.708**..005 14 14.708** 1.000.005. 14 14 Correlations SYSTOLIC DIASTOLI Kendall's tau_b SYSTOLIC Correlation Coefficient 1.000.486* Sig. (2-tailed)..016 N 14 14 DIASTOLI Correlation Coefficient.486* 1.000 Sig. (2-tailed).016. N 14 14 *. Correlation is significant at the.05 level (2-tailed).

The Pearson Family Types of Correlation Coefficients Pearson "Family" Name Symbol X Y Pearson Product-moment r Interval/Ratio Interval/Ratio Spearman rho r s Ordinal Ordinal Phi Φ True Dichotomous True Dichotomous Point Biserial r pb True Dichotomous Interval/Ratio Rank-Biserial r rb True Dichotomous Ordinal Non-Pearson "family" Name Symbol X Y Kendal's Tau Τ Ordinal Ordinal Biserial r b Forced Dichotomous Interval/Ratio Tetrachoric r t Forced Dichotomous Forced Dichotomous Definitions True Dichotomous: A variable that is nominal and has only two levels. Forced Dichtomous: The variable is assumed to have an underlying normal distribution, but is forced to be a dichotomous variable (e.g. Rich/Poor, Happy/Sad, Smart/Not Smart, etc.)

From: http://www.oandp.org/jpo/library/1996_03_105.asp Nonparametric tests should not be substituted for parametric tests when parametric tests are more appropriate. Nonparametric tests should be used when the assumptions of parametric tests cannot be met, when very small numbers of data are used, and when no basis exists for assuming certain types or shapes of distributions (9). Nonparametric tests are used if data can only be classified, counted or ordered-for example, rating staff on performance or comparing results from manual muscle tests. These tests should not be used in determining precision or accuracy of instruments because the tests are lacking in both areas.

From: http://www.unesco.org/webworld/idams/advguide/chapt4_2.htm Pearson correlation is unduly influenced by outliers, unequal variances, non-normality, and nonlinearity. An important competitor of the Pearson correlation coefficient is the Spearman s rank correlation coefficient. This latter correlation is calculated by applying the Pearson correlation formula to the ranks of the data rather than to the actual data values themselves. In so doing, many of the distortions that plague the Pearson correlation are reduced considerably.

For more information about the effect of ties on Spearman Rho, see CONOVER, WJ. Approximations of the Critical Region for Spearman's Rho With and Without Ties Present. Communications in Statistics, Volume B7, No. 3 (1978) (with R. L. Iman), pp. 269-282..