1 Correlation A statistical technique that describes the relationship between two or more variables Variables are usually observed in a natural environment, with no manipulation by the researcher Example: does a relationship exist between a person s age and the number of hours they exercise per day? ote: This type of research cannot establish pathways of cause-effect. 1) Correlations require at least 2 scores for each person Subject Height Weight 1 77 185 2 65 11 3 6 1 4 72 2 5 69 135 2) Each person s pair of scores can be plotted on a graph called a scatterplot Weight (pounds) Scatterplot of Height and Weight 25 2 15 1 5 55 6 65 7 75 8 Height (inches) 3) It is useful to draw an envelope, or line around the data in order to see an overall trend in the data Weight (pounds) Scatterplot of Height and Weight 25 2 15 1 5 55 6 65 7 75 8 Height (inches)
2 What s the strength of relationship? (Chap. 6) Weight (pounds) Scatterplot of Height and Weight 25 2 15 1 5 55 6 65 7 75 8 Height (inches) Generally, thin envelopes are associated with strong correlations, while fat envelopes are associated with weaker correlations Round (circular) envelopes indicate weak or no correlations between the two variables What is the relationship? (Chap. 7) I.e., what equation can one use to predict Y from X (or vice versa) ote, different equations are needed to predict in each direction 4 1 How much does the boy weigh? Characteristics of a Correlation: Direction: Correlations can be either positive or negative In a positive correlation, the values of the two variables move in the same direction High scores on x go with high scores on y In a negative correlation, the values of the two variables move in the opposite direction High scores on x go with low scores on y Postive Linear Correlation egative Linear Correlation 15 125 1 75 5 25 15 125 1 75 5 25 - - 5 1 5 1 Y = a + bx
3 Characteristics of a Correlation (cont.): Form: Linear vs. on-linear If the slope of the scatterplot changes direction, the relationship is non-linear Slopes that flatten out at the top are common in the social sciences -- these are called monotonic and are non-linear Y = a + bx c on Linear Correlation on Linear Correlation - monotonic Y = a + bln(x) 18, 16, 14, 12, 1, 8, 6, 4, 2, - 2 4 6 8 1 25 2 15 1 5-2 4 6 8 1 12 We will only consider linear relationships Correlation Uses: Prediction: When two variables are correlated, we can use scores on one variable to predict scores on another SAT scores are often used as a predictor for college GPA Validity: Used to prove that new psychological tests actually measure what they are supposed to be measuring If you develop a new IQ test, it should have a high correlation with other, well-established IQ tests Theory Validation: Theories make predictions about relationships between two variables, and correlational methods can provide evidence to validate (or invalidate) a theory The theory that smoking causes cancer may (or may not) be supported by correlational evidence
4 Statistically, the strength and direction of the relationship are expressed by the correlation coefficient (a.k.a. Pearson s product moment correlation or Pearson s r ) Correlation coefficients can range in value from -1 to +1 A value of (or near ) indicates no correlation Values of +1. or -1. are perfect correlations PEARSO S CORRELATIO COEFFICIET: Measures the degree of linear relationship between two variables degree to which x & y vary together r = degree to which x & y vary separately Another way to phrase the strength-of-relationship question is: How well does the standard score (z-score) of one variable predict the standard score (z-score) of the other? Does knowing the number of s.d. s above or below the mean one variable is tell us how many s.d. s above the mean the other variable is? This phrasing makes it meaningful to relate measures on different scales (e.g., height and weight) or of different values on the same scale (e.g., heights of children and parents) We want a scale that runs from +1 to 1 Perfect positive to perfect negative linear relation
5 Example X Y 1 2 2 4 3 6 4 8 5 1 6 12 7 14 8 16 9 18 1 2 MEA 5.5 11. S.D. 3.3 6.6 Z x Z Y Z x Z Y -1.49-1.49 2.21-1.16-1.16 1.34 -.83 -.83.68 -.5 -.5.25 -.17 -.17.3.17.17.3.5.5.25.83.83.68 1.16 1.16 1.34 1.49 1.49 2.21 SUM.. 9. SS/(-1) 1. Y 2 15 1 5 PERFECT LIEAR CORRELATIO 1 2 3 4 5 6 7 8 9 1 r =1 Perfect positive linear correlation X r = z x z y 1, where z A = A i A s A for A = X and A = Y ote that product z X z s positive if X and Y are on the same side of the mean and negative if they are on opposite sides From the previous equation we obtain: X z x z i X y Y r = 1 = s X s Y 1 This last equation can be solved (with algebra) to obtain an equation that is easier to use Computational Formula: X i r = X i X 2 i 2 X i Y 2 i 2
6 Let s see how easier.. X Y 1 2 2 4 3 6 4 8 5 1 6 12 7 14 8 16 9 18 1 2 MEA 5.5 11. S.D. 3.3 6.6 X Y X 2 Y 2 XY 1 2 1 4 2 2 4 4 16 8 3 6 9 36 18 4 8 16 64 32 5 1 25 1 5 6 12 36 144 72 7 14 49 196 98 8 16 64 256 128 9 18 81 324 162 1 2 1 4 2 SUM 55 11 385 154 77 r = X 2 i X i X i 2 X i Y 2 i 2 = ( )( 11) 77 55 1 385 552 1 154 112 1 = 165 ( 82.5) 33 ( ) =1 In the absence of any knowledge of X, one can do no better than use as Y the best guess for Y When X and Y are linearly related, then Y = a+bx gives a better predicted value ote the following: Y = Y' ( ) ( ) + Y' Y 4 3 2 1 Y' Y Y' Y 5 1 15
7 Y = ( Y' ) + ( Y' Y ) We solve this equation using more algebra Total variability of Y accounted for by X (sum of squared deviations of Y about its mean) Total variability of Y (sum of squared deviations of Y about its mean) Sum of squared deviations of predicted from actual Y scores ( Y) 2 ( ) 2 = ( Y' ) 2 + Y' Y Stays fixed Gets smaller as prediction improves Coefficient of determination: r 2 = Gets larger as prediction improves ( Y' Y ) 2 ( Y) 2 COEFFICIET OF DETERMIATIO: ( Y' Y ) 2 r 2 = ( Y) 2 The squared correlation value (r 2 ) Measures the proportion of variability in one variable that can be predicted or explained by the other variable -- how much common, or shared, variability the 2 variables have Example: The correlation between height and weight is r=.89; thus, r 2 =.79 -- 79% of the variability in the weight scores can be explained or predicted from the height scores; 79% of the total variability is shared by height and weight
8 Factors affecting Correlations Restriction of Range: occurs when the Pearson s r is calculated from a set of scores that does not represent the variable s full range of values If only the highest height scores were used to compute the correlation. otice how the envelope around those scores is very circular, indicating a low correlation 25 Restriction of Range Example 2 Weight 15 1 5 55 6 65 7 75 8 Height Factors affecting Correlations Outliers: The presence of a few extreme scores (outliers) can have an effect on the value of the correlation coefficient; as indicated by the dashed line, the envelope is more circular, or wider, indicating a lower correlation 25 Outlier Example 2 Weight 15 1 5 55 6 65 7 75 8 Height
9 Interpreting Pearson s r: Correlation describes the degree of linear relationship between 2 variables It does not explain why the variables are related It does not indicate a cause and effect relationship It does not indicate the percentage of relationship (it s just an index, not a measurement scale) -- r=.4 is not twice the relationship of r=.2 REGRESSIO REGRESSIO AALYSIS: use scores on one variable (x) to predict scores on another variable (y), based on the fact that we know that there is a linear relationship (correlation) between the two variables The x variable is termed the predictor variable; the y variable is called the criterion variable Example: SAT scores (x) are used to predict college GPA (y) The goal of regression is to generate a single line (equation) that describes the relationship between the variables the one line that best describes the entire set of scores or the line of best fit
1 Weight (pounds) 25 2 15 1 5 Scatterplot of Height and Weight 55 6 65 7 75 8 Height (inches) If we removed all of the dots on the graph, we would be left with a single line (expressed by an equation) that would describe the linear relationship between the two variables Using this equation, we can begin to predict scores on variable y from the scores on variable x Because the dots on the scatterplot do not all fall exactly on the regression line, it can be determined that the relationship between x and y is not a perfect one Since most relationships aren t perfect, we can learn how to calculate the regression line in order to achieve the best prediction possible LIEAR EQUATIO: y=a+bx describes the relationship between two variables a is the intercept (the value of y when x=) b is the slope of the line and measures the change in the y variable for every one unit change in the x variable (i.e. as x increases by 1 unit, y will increase by b units) Example: y=5+3x When x is, y is equal to 5 For every 1 unit increase in x, y will increase by 3 units y=-3x When x=, y is also zero For every 1 unit increase in x, y will decrease by 3 units
11 How does one find the best straight line when the data are not perfect? We define the best line as the one that minimizes the sum of squared deviations between observed and predicted values of Y To distinguish observed from predicted write the equations as Y= a+ bx (observed) Y = a Y + b Y X (predicted) We want to find constants, a Y and b Y that minimize ( Y' ) 2 4 3 2 1 Y predicted Y observed 5 1 15 Consider the expression ( Y' ) 2 Substitute Y for the equation of a straight line, y =a y +b y x, to obtain a y +b y X i When calculus is used to find the optimal a and b, the results are ( x) ( y) xy b = n ay= y b(x) y x 2 x n ( ( )) 2 ( ) 2 ote, that different equations are used to predict X from Y than to predict Y from X Thus, a Y and b Y are used to clarify that these are the constants in the equation used to predict Y
12 Equation to predict Y from X is Y'= a y +b y X To estimate the constants, we use ( x) ( y) xy b = n a y b(x) y y= ( x 2 x) 2 n Hints: If you compare the slope formula to that for Pearson s r, you will notice that both the numerator and denominator of this formula also appear in the Pearson s r formula (which will probably already have been calculated by the time you reach this step). Don t recalculate these values, just pull the values from the Pearson s r calculations You must first calculate the slope (b) before calculating the y- intercept (a) It is risky to predict for values of Xfar beyond those observed Example Subject Height Weight x 2 y 2 xy 1 77 185 5929 34225 14245 2 65 11 4225 121 715 3 6 1 36 1 6 4 72 2 5184 4 144 5 69 135 4761 18225 9315 SUM 343 73 23699 11455 5111 x=68.6, y=146 ( 343 )( 73 5111 ) 5111 578 b = 5 = 2 23699 23529. ( 343) 8 23699 5 a = 146 6.1(68.6) 146 418. 46 The regression equation is as follows: 132 = = 6. 1 169.2 = = 272. 46 y=-272.46 + 6.1(x) Thus, for every 1 inch in height, we predict a 6.1 pound increase in weight
13 Regression Line Properties: For every x value, we can generate a predicted y value on the regression line -- simply plug a given x value into the regression equation These predicted y values are designated (y hat) Example: The predicted weight for someone who is 6 inches tall is: y ˆ = 272.46 + 6.1(6) = 93.54 Prediction Error: For every value of x, we will predict a new y value based on the regression equation. There will be some distance between the actual y value and the new predicted y value To determine how well a regression line fits the data, we need to measure the difference between each actual y value and the corresponding predicted y value ŷ Example: Our predicted ŷ value is 93.54 The actual weight (y value) for someone who is 6 inches tall is 1 pounds The difference (6.46) is prediction error y yˆ Prediction error can be visually represented by measuring the vertical distance between the actual y data point and the predicted point on the line Conceptually, if we added up all of these error values, we would get a measure of total prediction error Actual Score Predicted Score Prediction Error The regression line generated by the equation is the one line that minimizes the total prediction error thus the term line of best fit
14 STADARD ERROR OF ESTIMATE: Provides a measure of the average distance between a regression line (predicted scores) and the actual data points; provides information about the accuracy of the predictions Conceptually similar to standard deviation in the fact that it provides an average measure of distance Found by averaging the distances between actual scores and the predicted scores ( ) y yˆ SEE = ( y y ˆ ) 2 n 2 Standard Error Properties: As the correlation increases (gets closer to 1. or -1.), the data points are more tightly clustered around the regression line, resulting in better prediction (less prediction error) As the correlation gets smaller (approaches ), the data points are spread further out from the regression line, resulting in poorer prediction (greater prediction error) 4 3 2 1 Predicting X from Y The constants in the equation to predict Y from X derive from minimizing the SSD between observed and predicted Y The constants in the equation to predict X from Y derive from minimizing the SSD between observed and predicted X 5 1 15 X'= a X + b x Y ( X) Y XY b x = ( Y) 2 ( Y 2 ) ( ) a X = X b x Y
15 Interpreting Regression Analysis: Regression cannot be used to extrapolate values that fall outside of the range of our actual data point values; we cannot predict extreme scores that fall beyond the range of our actual scores Example: Using our previously determined regression equation, y=6.1x-272.46, predict the weight for someone who is 4 inches tall yˆ = 6.1(4) 272.46 = 244-272.46 * This leaves us with = -28.46 a negative weight