Correlation. Subject Height Weight ) Correlations require at least 2 scores for each person

Similar documents
Chapter 7: Simple linear regression Learning Objectives

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Correlation key concepts:

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Section 3 Part 1. Relationships between two numerical variables

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables

The importance of graphing the data: Anscombe s regression examples

2. Simple Linear Regression

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Father s height (inches)

What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of y = mx + b.

Homework 11. Part 1. Name: Score: / null

CURVE FITTING LEAST SQUARES APPROXIMATION

Module 3: Correlation and Covariance

Exercise 1.12 (Pg )

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Univariate Regression

1 Correlation and Regression Analysis

Simple Regression Theory II 2010 Samuel L. Baker

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Session 7 Bivariate Data and Analysis

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Linear Equations. Find the domain and the range of the following set. {(4,5), (7,8), (-1,3), (3,3), (2,-3)}

Foundations for Functions

Describing Relationships between Two Variables

Microeconomics Sept. 16, 2010 NOTES ON CALCULUS AND UTILITY FUNCTIONS

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

CALCULATIONS & STATISTICS

Example: Boats and Manatees

Chapter 9 Descriptive Statistics for Bivariate Data

PLOTTING DATA AND INTERPRETING GRAPHS

Statistics. Measurement. Scales of Measurement 7/18/2012

Preview. What is a correlation? Las Cucarachas. Equal Footing. z Distributions 2/12/2013. Correlation

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

The correlation coefficient

A synonym is a word that has the same or almost the same definition of

MEASURES OF VARIATION

The Dummy s Guide to Data Analysis Using SPSS

Relationships Between Two Variables: Scatterplots and Correlation

Part Three. Cost Behavior Analysis

Algebra I Vocabulary Cards

Simple linear regression

17. SIMPLE LINEAR REGRESSION II

Econometrics Simple Linear Regression

Solving Quadratic Equations

II. DISTRIBUTIONS distribution normal distribution. standard scores

Algebraic expressions are a combination of numbers and variables. Here are examples of some basic algebraic expressions.

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

table to see that the probability is (b) What is the probability that x is between 16 and 60? The z-scores for 16 and 60 are: = 1.

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

STT 200 LECTURE 1, SECTION 2,4 RECITATION 7 (10/16/2012)

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5

with functions, expressions and equations which follow in units 3 and 4.

MATH Fundamental Mathematics IV

NSM100 Introduction to Algebra Chapter 5 Notes Factoring

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Answer Key for California State Standards: Algebra I

Vieta s Formulas and the Identity Theorem

Measurement with Ratios

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

1.2 GRAPHS OF EQUATIONS. Copyright Cengage Learning. All rights reserved.

COWLEY COUNTY COMMUNITY COLLEGE REVIEW GUIDE Compass Algebra Level 2

MATH 60 NOTEBOOK CERTIFICATIONS

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Brunswick High School has reinstated a summer math curriculum for students Algebra 1, Geometry, and Algebra 2 for the school year.

Statistics E100 Fall 2013 Practice Midterm I - A Solutions

Algebra 1 If you are okay with that placement then you have no further action to take Algebra 1 Portion of the Math Placement Test

Descriptive statistics; Correlation and regression

Lesson 4 Measures of Central Tendency

Introduction to Linear Regression

5 Systems of Equations

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

Pennsylvania System of School Assessment

x 2 + y 2 = 1 y 1 = x 2 + 2x y = x 2 + 2x + 1

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Florida Math for College Readiness

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Review Jeopardy. Blue vs. Orange. Review Jeopardy

The Big Picture. Correlation. Scatter Plots. Data

DATA INTERPRETATION AND STATISTICS

5. Correlation. Open HeightWeight.sav. Take a moment to review the data file.

SPSS Guide: Regression Analysis

INTRODUCTION TO MULTIPLE CORRELATION

Homework 8 Solutions

1.7 Graphs of Functions

Test Bias. As we have seen, psychological tests can be well-conceived and well-constructed, but

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Common Core Unit Summary Grades 6 to 8

STAT 350 Practice Final Exam Solution (Spring 2015)

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Transcription:

1 Correlation A statistical technique that describes the relationship between two or more variables Variables are usually observed in a natural environment, with no manipulation by the researcher Example: does a relationship exist between a person s age and the number of hours they exercise per day? ote: This type of research cannot establish pathways of cause-effect. 1) Correlations require at least 2 scores for each person Subject Height Weight 1 77 185 2 65 11 3 6 1 4 72 2 5 69 135 2) Each person s pair of scores can be plotted on a graph called a scatterplot Weight (pounds) Scatterplot of Height and Weight 25 2 15 1 5 55 6 65 7 75 8 Height (inches) 3) It is useful to draw an envelope, or line around the data in order to see an overall trend in the data Weight (pounds) Scatterplot of Height and Weight 25 2 15 1 5 55 6 65 7 75 8 Height (inches)

2 What s the strength of relationship? (Chap. 6) Weight (pounds) Scatterplot of Height and Weight 25 2 15 1 5 55 6 65 7 75 8 Height (inches) Generally, thin envelopes are associated with strong correlations, while fat envelopes are associated with weaker correlations Round (circular) envelopes indicate weak or no correlations between the two variables What is the relationship? (Chap. 7) I.e., what equation can one use to predict Y from X (or vice versa) ote, different equations are needed to predict in each direction 4 1 How much does the boy weigh? Characteristics of a Correlation: Direction: Correlations can be either positive or negative In a positive correlation, the values of the two variables move in the same direction High scores on x go with high scores on y In a negative correlation, the values of the two variables move in the opposite direction High scores on x go with low scores on y Postive Linear Correlation egative Linear Correlation 15 125 1 75 5 25 15 125 1 75 5 25 - - 5 1 5 1 Y = a + bx

3 Characteristics of a Correlation (cont.): Form: Linear vs. on-linear If the slope of the scatterplot changes direction, the relationship is non-linear Slopes that flatten out at the top are common in the social sciences -- these are called monotonic and are non-linear Y = a + bx c on Linear Correlation on Linear Correlation - monotonic Y = a + bln(x) 18, 16, 14, 12, 1, 8, 6, 4, 2, - 2 4 6 8 1 25 2 15 1 5-2 4 6 8 1 12 We will only consider linear relationships Correlation Uses: Prediction: When two variables are correlated, we can use scores on one variable to predict scores on another SAT scores are often used as a predictor for college GPA Validity: Used to prove that new psychological tests actually measure what they are supposed to be measuring If you develop a new IQ test, it should have a high correlation with other, well-established IQ tests Theory Validation: Theories make predictions about relationships between two variables, and correlational methods can provide evidence to validate (or invalidate) a theory The theory that smoking causes cancer may (or may not) be supported by correlational evidence

4 Statistically, the strength and direction of the relationship are expressed by the correlation coefficient (a.k.a. Pearson s product moment correlation or Pearson s r ) Correlation coefficients can range in value from -1 to +1 A value of (or near ) indicates no correlation Values of +1. or -1. are perfect correlations PEARSO S CORRELATIO COEFFICIET: Measures the degree of linear relationship between two variables degree to which x & y vary together r = degree to which x & y vary separately Another way to phrase the strength-of-relationship question is: How well does the standard score (z-score) of one variable predict the standard score (z-score) of the other? Does knowing the number of s.d. s above or below the mean one variable is tell us how many s.d. s above the mean the other variable is? This phrasing makes it meaningful to relate measures on different scales (e.g., height and weight) or of different values on the same scale (e.g., heights of children and parents) We want a scale that runs from +1 to 1 Perfect positive to perfect negative linear relation

5 Example X Y 1 2 2 4 3 6 4 8 5 1 6 12 7 14 8 16 9 18 1 2 MEA 5.5 11. S.D. 3.3 6.6 Z x Z Y Z x Z Y -1.49-1.49 2.21-1.16-1.16 1.34 -.83 -.83.68 -.5 -.5.25 -.17 -.17.3.17.17.3.5.5.25.83.83.68 1.16 1.16 1.34 1.49 1.49 2.21 SUM.. 9. SS/(-1) 1. Y 2 15 1 5 PERFECT LIEAR CORRELATIO 1 2 3 4 5 6 7 8 9 1 r =1 Perfect positive linear correlation X r = z x z y 1, where z A = A i A s A for A = X and A = Y ote that product z X z s positive if X and Y are on the same side of the mean and negative if they are on opposite sides From the previous equation we obtain: X z x z i X y Y r = 1 = s X s Y 1 This last equation can be solved (with algebra) to obtain an equation that is easier to use Computational Formula: X i r = X i X 2 i 2 X i Y 2 i 2

6 Let s see how easier.. X Y 1 2 2 4 3 6 4 8 5 1 6 12 7 14 8 16 9 18 1 2 MEA 5.5 11. S.D. 3.3 6.6 X Y X 2 Y 2 XY 1 2 1 4 2 2 4 4 16 8 3 6 9 36 18 4 8 16 64 32 5 1 25 1 5 6 12 36 144 72 7 14 49 196 98 8 16 64 256 128 9 18 81 324 162 1 2 1 4 2 SUM 55 11 385 154 77 r = X 2 i X i X i 2 X i Y 2 i 2 = ( )( 11) 77 55 1 385 552 1 154 112 1 = 165 ( 82.5) 33 ( ) =1 In the absence of any knowledge of X, one can do no better than use as Y the best guess for Y When X and Y are linearly related, then Y = a+bx gives a better predicted value ote the following: Y = Y' ( ) ( ) + Y' Y 4 3 2 1 Y' Y Y' Y 5 1 15

7 Y = ( Y' ) + ( Y' Y ) We solve this equation using more algebra Total variability of Y accounted for by X (sum of squared deviations of Y about its mean) Total variability of Y (sum of squared deviations of Y about its mean) Sum of squared deviations of predicted from actual Y scores ( Y) 2 ( ) 2 = ( Y' ) 2 + Y' Y Stays fixed Gets smaller as prediction improves Coefficient of determination: r 2 = Gets larger as prediction improves ( Y' Y ) 2 ( Y) 2 COEFFICIET OF DETERMIATIO: ( Y' Y ) 2 r 2 = ( Y) 2 The squared correlation value (r 2 ) Measures the proportion of variability in one variable that can be predicted or explained by the other variable -- how much common, or shared, variability the 2 variables have Example: The correlation between height and weight is r=.89; thus, r 2 =.79 -- 79% of the variability in the weight scores can be explained or predicted from the height scores; 79% of the total variability is shared by height and weight

8 Factors affecting Correlations Restriction of Range: occurs when the Pearson s r is calculated from a set of scores that does not represent the variable s full range of values If only the highest height scores were used to compute the correlation. otice how the envelope around those scores is very circular, indicating a low correlation 25 Restriction of Range Example 2 Weight 15 1 5 55 6 65 7 75 8 Height Factors affecting Correlations Outliers: The presence of a few extreme scores (outliers) can have an effect on the value of the correlation coefficient; as indicated by the dashed line, the envelope is more circular, or wider, indicating a lower correlation 25 Outlier Example 2 Weight 15 1 5 55 6 65 7 75 8 Height

9 Interpreting Pearson s r: Correlation describes the degree of linear relationship between 2 variables It does not explain why the variables are related It does not indicate a cause and effect relationship It does not indicate the percentage of relationship (it s just an index, not a measurement scale) -- r=.4 is not twice the relationship of r=.2 REGRESSIO REGRESSIO AALYSIS: use scores on one variable (x) to predict scores on another variable (y), based on the fact that we know that there is a linear relationship (correlation) between the two variables The x variable is termed the predictor variable; the y variable is called the criterion variable Example: SAT scores (x) are used to predict college GPA (y) The goal of regression is to generate a single line (equation) that describes the relationship between the variables the one line that best describes the entire set of scores or the line of best fit

1 Weight (pounds) 25 2 15 1 5 Scatterplot of Height and Weight 55 6 65 7 75 8 Height (inches) If we removed all of the dots on the graph, we would be left with a single line (expressed by an equation) that would describe the linear relationship between the two variables Using this equation, we can begin to predict scores on variable y from the scores on variable x Because the dots on the scatterplot do not all fall exactly on the regression line, it can be determined that the relationship between x and y is not a perfect one Since most relationships aren t perfect, we can learn how to calculate the regression line in order to achieve the best prediction possible LIEAR EQUATIO: y=a+bx describes the relationship between two variables a is the intercept (the value of y when x=) b is the slope of the line and measures the change in the y variable for every one unit change in the x variable (i.e. as x increases by 1 unit, y will increase by b units) Example: y=5+3x When x is, y is equal to 5 For every 1 unit increase in x, y will increase by 3 units y=-3x When x=, y is also zero For every 1 unit increase in x, y will decrease by 3 units

11 How does one find the best straight line when the data are not perfect? We define the best line as the one that minimizes the sum of squared deviations between observed and predicted values of Y To distinguish observed from predicted write the equations as Y= a+ bx (observed) Y = a Y + b Y X (predicted) We want to find constants, a Y and b Y that minimize ( Y' ) 2 4 3 2 1 Y predicted Y observed 5 1 15 Consider the expression ( Y' ) 2 Substitute Y for the equation of a straight line, y =a y +b y x, to obtain a y +b y X i When calculus is used to find the optimal a and b, the results are ( x) ( y) xy b = n ay= y b(x) y x 2 x n ( ( )) 2 ( ) 2 ote, that different equations are used to predict X from Y than to predict Y from X Thus, a Y and b Y are used to clarify that these are the constants in the equation used to predict Y

12 Equation to predict Y from X is Y'= a y +b y X To estimate the constants, we use ( x) ( y) xy b = n a y b(x) y y= ( x 2 x) 2 n Hints: If you compare the slope formula to that for Pearson s r, you will notice that both the numerator and denominator of this formula also appear in the Pearson s r formula (which will probably already have been calculated by the time you reach this step). Don t recalculate these values, just pull the values from the Pearson s r calculations You must first calculate the slope (b) before calculating the y- intercept (a) It is risky to predict for values of Xfar beyond those observed Example Subject Height Weight x 2 y 2 xy 1 77 185 5929 34225 14245 2 65 11 4225 121 715 3 6 1 36 1 6 4 72 2 5184 4 144 5 69 135 4761 18225 9315 SUM 343 73 23699 11455 5111 x=68.6, y=146 ( 343 )( 73 5111 ) 5111 578 b = 5 = 2 23699 23529. ( 343) 8 23699 5 a = 146 6.1(68.6) 146 418. 46 The regression equation is as follows: 132 = = 6. 1 169.2 = = 272. 46 y=-272.46 + 6.1(x) Thus, for every 1 inch in height, we predict a 6.1 pound increase in weight

13 Regression Line Properties: For every x value, we can generate a predicted y value on the regression line -- simply plug a given x value into the regression equation These predicted y values are designated (y hat) Example: The predicted weight for someone who is 6 inches tall is: y ˆ = 272.46 + 6.1(6) = 93.54 Prediction Error: For every value of x, we will predict a new y value based on the regression equation. There will be some distance between the actual y value and the new predicted y value To determine how well a regression line fits the data, we need to measure the difference between each actual y value and the corresponding predicted y value ŷ Example: Our predicted ŷ value is 93.54 The actual weight (y value) for someone who is 6 inches tall is 1 pounds The difference (6.46) is prediction error y yˆ Prediction error can be visually represented by measuring the vertical distance between the actual y data point and the predicted point on the line Conceptually, if we added up all of these error values, we would get a measure of total prediction error Actual Score Predicted Score Prediction Error The regression line generated by the equation is the one line that minimizes the total prediction error thus the term line of best fit

14 STADARD ERROR OF ESTIMATE: Provides a measure of the average distance between a regression line (predicted scores) and the actual data points; provides information about the accuracy of the predictions Conceptually similar to standard deviation in the fact that it provides an average measure of distance Found by averaging the distances between actual scores and the predicted scores ( ) y yˆ SEE = ( y y ˆ ) 2 n 2 Standard Error Properties: As the correlation increases (gets closer to 1. or -1.), the data points are more tightly clustered around the regression line, resulting in better prediction (less prediction error) As the correlation gets smaller (approaches ), the data points are spread further out from the regression line, resulting in poorer prediction (greater prediction error) 4 3 2 1 Predicting X from Y The constants in the equation to predict Y from X derive from minimizing the SSD between observed and predicted Y The constants in the equation to predict X from Y derive from minimizing the SSD between observed and predicted X 5 1 15 X'= a X + b x Y ( X) Y XY b x = ( Y) 2 ( Y 2 ) ( ) a X = X b x Y

15 Interpreting Regression Analysis: Regression cannot be used to extrapolate values that fall outside of the range of our actual data point values; we cannot predict extreme scores that fall beyond the range of our actual scores Example: Using our previously determined regression equation, y=6.1x-272.46, predict the weight for someone who is 4 inches tall yˆ = 6.1(4) 272.46 = 244-272.46 * This leaves us with = -28.46 a negative weight