Chapter 10 Section 10-2: Correlation Section 10-3: Regression Section 10-4: Variation and Prediction Intervals The relationship between TWO variables So far we have dealt with data obtained from one variable (either categorical or quantitative). In this chapter we will explore the relationship between two quantitative variables. 1 2 Response and Explanatory Variables In most studies involving two variables, each of the variables has a role. We distinguish between: the response variable - the outcome of the study the explanatory variable - the variable that claims to explain, predict or affect the response. Scatterplots In a scatterplot one axis is used to represent each of the variables, and the data are plotted as points on the graph. Typically, the explanatory or independent variable is plotted on the x axis and the response or dependent variable is plotted on the y axis. 4 Example 1: Highway Signs A Pennsylvania research firm conducted a study in which 30 drivers (of ages 18 to 82 years old) were sampled and for each one the maximum distance at which he/she could read a newly designed sign was determined. The goal of this study was to explore the relationship between driver's age and the maximum distance at which signs were legible, and then use the study's findings to improve safety for older drivers. Since the purpose of this study is to explore the effect of age on maximum legibility distance, the explanatory variable is Age, and the response variable is Distance. 32 1
Scatterplot Example 2 Here we have two quantitative variables for each of 16 students. How many beers they drank Their blood alcohol level We are interested in the relationship between the two variables: how is one affected by changes in the other one? Student Number of Beers Blood Alcohol Level 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 14 7 0.09 15 1 0.01 16 4 0.05 8 Scatterplot example Some plots don t have clear explanatory and response variables. Student Beers BAC 1 5 0.1 2 2 0.03 3 9 0.19 6 7 0.095 7 3 0.07 9 3 0.02 11 4 0.07 13 5 0.085 4 8 0.12 5 3 0.04 8 5 0.06 10 5 0.05 12 6 0.1 Response(dependent) variable Explanatory (independent) variable Do calories explain sodium amounts in hot dogs? 14 7 0.09 15 1 0.01 16 4 0.05 9 Describing a Scatterplot Form: general shape linear clusters nonlinear no relationship 2
Direction Strength The strength of the relationship is determined by how closely the data follow the form of the relationship. A positive (or increasing) relationship means that an increase in one of the variables is associated with an increase in the other. A negative (or decreasing) relationship means that an increase in one of the variables is associated with a decrease in the other Deviation from the pattern Back to Example 1 Form: linear Direction: negative Outliers Strength: moderately strong do not appear to be any outliers Back to Example 2 Form: linear Direction: positive Strength:strong do not appear to be any outliers This is a weak relationship. For a particular state median household income, you can t predict the state per capita income very well. This is a very strong relationship. The daily amount of gas consumed can be predicted quite accurately for a given temperature value. 17 3
How to scale a scatterplot Same data for all four plots: Using an inappropriate scale for a scatterplot can give an incorrect impression. How to scale a scatterplot The straight-line pattern in the lower plot appears stronger because of the surrounding space. Both variables should be given a similar amount of space: Plot roughly square Points should occupy all the plot space (no blank space) 19 20 Example 3 Example 4 Form: linear Direction: positive Strength: weak Outliers: 3 Form: linear Direction: negative Strength: medium-strong Outliers: no 21 22 Adding categorical variables to scatterplots + for northeastern states for midwestern states The correlation coefficient, r The correlation coefficient is a measure of the direction and strength of a linear relationship. It is calculated using the mean and the standard deviation of both the x and y variables. The formal name for r is the Pearson product moment correlation coefficient. It is named after the English statistician Karl Pearson (1857-1936). 23 24 4
Correlation Back to Ex.1 Calculation: r is calculated using the following formula: r = 1 n 1 x x y y sx s y = 1 n 1 z x z y It looks scary, I know, but here s the basic idea: convert x and y to standardized values (z-scores), and find their average product (well, almost, divide by (n-1)). r ranges from 1 to +1 r quantifies the strength and direction of a linear relationship between two quantitative variables. Caution using correlation Use correlation only for linear relationships. Strength: How closely the points follow a straight line. Direction is positive when individuals with higher x values tend to have higher values of y. Influential points Correlations are calculated using means and standard deviations and thus are NOT resistant to outliers. Just moving one point away from the general trend here decreases the correlation from 0.91 to 0.75. Properties of r Correlation requires that both variables be quantitative r has no units only measures the strength of a linear relationship ranges from -1 to 1 r is negative if the form of the relationship is negative r is positive if the form of the relationship is positive r is closer to 1 when the correlation is strong r is unchanged if you interchange x and y r is unchanged if you make a linear change of scale (ex. from feet to inches) The correlation is heavily influenced by outliers. http://www.seeingstatistics.com/seeing1999/reg/corr.html 5
How to find r using the calculator 1 st step: enter you two lists (explanatory and response variables) STAT EDIT 1: Edit L1: enter your values of the explanatory variable, L2: enter your values of the response variable 2 nd step: find the correlation coefficient STAT CALC 8: LinReg(a+bx) LinReg(a+bx) L1,L2 r is the correlation coefficient BUT association does not imply causation! Even if two variables have a high correlation coefficient, it does not mean that the explanatory variable CAUSED the changes in the response variable. 31 32 Association does not imply causation! Example 1: During the months of March and April of a certain year, the weekly weight increases of a puppy in New York were collected. For the same time frame, the retail price increases of snowshoes in Alaska were collected. The data was examined and was found to have a very strong linear correlation. The weight of a growing puppy in New York (in pounds) 8 32.45 8.5 32.95 9 33.45 9.6 34.00 10.1 34.50 10.7 35.10 11.5 35.63 The retail price of snowshoes in Alaska (in dollars) So, this must mean that the weight increase of a puppy in New York is causing snowshoe prices in Alaska to increase, or the price increases of snowshoes are causing the puppy's weight to increase. Of course this is not true! 33 34 The moral of this example is: Be careful what you infer from your statistical analyses. Unfortunately, usually the situation is not as obvious as this one. Be sure your relationship makes sense. Also keep in mind that other factors may be involved in a potential cause and effect relationship. Association does not imply causation! Example 2: In the early 1930s the relationship between the human population (response variable) of Oldenburg, Germany, and number of storks nesting in the town (explanatory variable) was investigated. The correlation coefficient turned out to be 0.97. Does this mean that storks bring babies? Can you give a possible explanation for this strong association? 35 36 6
The thymus example (shocking) The thymus, a gland in your neck, unlike other organs of the body, doesn t get larger as you grow it actually gets smaller. Imagine the situation: many infants are dying of what seem to be respiratory obstructions, so doctors begin to do autopsies on infants who die with respiratory symptoms. They have done many autopsies in the past on adults who died of various causes, so they decide to rely on those autopsy results for comparison. What stands out most when they did autopsies on the infants is that they all have thymus glands that look too big in comparison to their body size. So they concluded that the respiratory problems are caused by an enlarged thymus. It became quite common in the early 1900s for surgeons to treat respiratory problems in children by removing the thymus. In particular, in 1912, Dr. Charles Mayo published an article recommending removal of the thymus. He made this recommendation even though a third of the children who were operated on died. What s the lurking variable in this shocking example? 37 38 What could be a lurking variable in these examples? There is a strong positive correlation between the foot length of K-12 students and reading scores. Students who use tutors have lower test scores than students who don t. A survey shows a strong positive correlation between the percentage of a country's inhabitants that use cell phones and the life expectancy in that country. Important: Association does not imply causation! One of the most common mistakes people make is when they observe a high correlation between two variables and conclude that one must be causing the other. Scatterplots and correlation do NOT demonstrate causation. It s hard to establish the nature and direction of causation, and there is always the risk of overlooking lurking variables. 39 40 Simpson s Paradox A relationship between two variables that holds for each individual value of a third variable can be changed or even reversed when the data for all values of the third variable are combined. This is Simpson s paradox. Simpson s paradox is an example of the effect of lurking variables on an observed association. Simpson s paradox Simpson s paradox is a severe form of confounding in which there is a reversal in the direction of an association caused by a lurking variable. Overall direction of association: positive But when we color different habitats in different colors, the data is separated by a lurking variable (different habitats) into a series of negative linear associations. 41 42 7
Simpson s Paradox Example: Is acceptance into a college (response variable) predicted by gender (explanatory variable)? Consider these data: Success Failure Total Male 198 162 360 Female 88 112 200 Proportions accepted by gender: Male success rate = 198 / 360 = 0.55 Female success rate = 88 / 200 = 0.44 Conclude: males were accepted at a higher rate than females. 43 Broken down according to the lurking variable "major " Success Failure Total Male 198 162 360 Female 88 112 200 Business Success Failure Total Male 18 102 120 Female 24 96 120 Male proportion = 18 / 120 = 0.15 Female proportion = 24 / 120 = 0.20 Therefore: males were accepted at a lower rate than females. Art Success Failure Total Male 180 60 240 Female 64 16 80 Male proportion = 180 / 240 = 0.75 Female proportion = 64 / 80 = 0.80 Therefore: males were accepted at a lower rate than females. 44 Summary of causation Association does not imply causation! Association does not imply causation! Association does not imply causation! The issue of lurking variables and Simpson's paradox occur equally in both quantitative and categorical situations. So, in either case, be careful with your conclusion, and remember: Association does not imply causation! Explanatory variables A researcher wants to know if taking increasing amounts of ginkgo biloba will result in increased capacities of memory ability for different students. He administers it to the students in doses of 250 milligrams, 500 milligrams, and 1000 milligrams. What is the explanatory variable in this study? a) Amount of ginkgo biloba given to each student. b) Change in memory ability. c) Size of the student s brain. d) Whether the student takes the ginkgo biloba. 45 Numeric bivariate data The first step in analyzing numeric bivariate data is to a) Measure strength of linear relationship. b) Create a scatterplot. c) Model linear relationship with regression line. Scatterplots Look at the following scatterplot. Choose which description BEST fits the plot. a) Direction: positive, form: linear, strength: strong b) Direction: negative, form: linear, strength: strong c) Direction: positive, form: non-linear, strength: weak d) Direction: negative, form: non-linear, strength: weak e) No relationship 8
Scatterplots Look at the following scatterplot. Choose which description BEST fits the plot. Scatterplots Look at the following scatterplot. Choose which description BEST fits the plot. a) Direction: positive, form: non-linear, strength: strong b) Direction: negative, form: linear, strength: strong c) Direction: positive, form: linear, strength: weak d) Direction: positive, form: non-linear, strength: weak e) No relationship a) Direction: positive, form: non-linear, strength: strong b) Direction: negative, form: linear, strength: strong c) Direction: positive, form: linear, strength: weak d) Direction: positive, form: non-linear, strength: weak e) No relationship Scatterplots Which of the following scatterplots displays the stronger linear relationship? Correlation For which of the following situations would it be appropriate to calculate r, the correlation coefficient? a) Plot A b) Plot B a) Time spent studying for statistics exam and score on the exam. b) Income for county employees and their respective counties. c) Eye color and hair color of selected participants. d) Party affiliation of senators and their vote on presidential impeachment. c) Same for both Correlation What is a FALSE statement about r, the correlation coefficient? Correlation Which scatterplot would give a larger value for r? a) It is a product of z-scores of X and Y. b) It can range in value from 1 to 1. c) It measures the strength and direction of the linear relationship between X and Y. d) It is measured in units of the X variable. a) Plot A b) Plot B c) It would be the same for both plots. 9
Correlation True or False? Computing r as a measure of the strength of the relationship between X and Y is appropriate for the data in the following scatterplot: Correlation tells us about strength (scatter) and direction of the linear relationship between two quantitative variables. a) True b) False In addition, we would like to have a numerical description of how both variables vary together. For instance, is one variable increasing faster than the other one? And we would like to make predictions based on that numerical description. But which line best describes our data? A regression line Example 1 revisited A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Example 1. again Example 2 revisited Which line to use? In most cases, no line will pass exactly through all the points in a scatterplot. Different people will draw different lines by eye. We need a way to draw a regression line that doesn t depend on our guess as to where the line should go. We will call this best line the Least-squares regression line. 59 60 10
Least-squares Regression Line For a set of data points (x,y) the least squares regression line is a line for which the sum of squared errors is as small as possible. Equation of the Least-squares Regression Line $y = b + b x y ˆ = a+ bx Predicted value 0 1 Book s notation Calculator s notation All we need to do is calculate the intercept a, and the slope b. http://hadm.sph.sc.edu/courses/j716/demos/leastsquares/leastsquaresdemo.html How to find a and b using the calculator 1 st step: enter you two lists (explanatory and response variables) STAT EDIT 1: Edit L1: enter your values of the explanatory variable, L2: enter your values of the response variable 2 nd step: find the correlation coefficient STAT CALC 8: LinReg(a+bx) LinReg(a+bx) L1,L2 a is the intercept, b is the slope Other way to find a and b: First we calculate the slope of the line, b = r s y b, from statistics we already know: r is the correlation s x s y is the standard deviation of the response variable y s x is the the standard deviation of the explanatory variable x Once we know b, the slope, we can calculate a, the y-intercept: a = y bx where x and y are the sample means of the x and y variables 63 Facts about least-squares regression Ex.1 AGAIN y The distinction between explanatory and response variables is essential in regression. The least-squares regression line always passes through the point ( x, y) ˆ y = a + bx a = 576 b = -3 $y = 576 3x Distance = 576 feet 3 Age x 65 11
Prediction: Interpolation The equation of the least-squares regression allows you to predict y for any x within the range studied. This is called interpolating. Prediction: Interpolation Predict the maximum distance at which a sign is legible for a 60 year old. Distance = 576 feet 3 Age Predicted distance = 576 feet - 3 60 = 396 396 feet is our best prediction for the maximum distance at which a sign is legible for a 60 year old. 67 68 Prediction Ex.1 Predict the maximum distance at which a sign is legible for a 90 year old. Distance = 576 feet 3 Age Predicted distance = 576 feet - 3 90 = 306 306 feet is our best prediction for the maximum distance at which a sign is legible for a 90 year old. BUT But this prediction is NOT RELIABLE. It is called EXTRAPOLATION. 69 Extrapolation Extrapolation is the use of a regression line for predictions outside the range of x values used to obtain the line. This can be a very silly thing to do, as seen here.!!!!!! Example 2 AGAIN y$ = 0. 0008 + 0. 0144x Nobody in the study drank 6.5 beers, but by finding the value of ŷ from the regression line for x = 6.5, we would expect a blood alcohol content of 0.094 mg/ml. 12
Residuals The distances from each point to the least-squares regression are called residuals. The sum of these residuals is always 0. Points above the line have a positive residual. Ex.1 AGAIN $y = 576 3 32 = 480 $y y Points below the line have a negative residual. ^ Predicted y Observed y dist. ( y yˆ) = residual residual y y$ = 410 480 = 70 74 Sum of squared errors Which least-squares regression line would have a smaller sum of squared errors? a) The line in Plot A. b) The line in Plot B. c) It would be the same for both plots. Slope Look at the following scatterplot. What would be a correct interpretation of the slope? a) As we increase our CO content by 1 mg, we increase the tar content by 1.01 mg. b) As we increase our CO content by 0.66 mg, we increase the tar content by 1.01 mg. c) As we increase our CO content by 0.66 mg, we increase the tar content by 0.66 mg. d) As we increase our CO content by 1 mg, we increase the tar content by 0.66 mg. Residuals Look at the following least-squares regression line. Compare the residuals from the two Points A and B. a) Point A s would be greater than Point B s. b) Point A s would be less than Point B s. c) Point A s would be equal to Point B s. d) There is not enough information. Residuals Residual equals a) b) c) d) 13
Correlation or regression Which of the following measures the direction and strength of the linear association between X and Y? Correlation or regression Which of the following makes no distinction between explanatory and response variables? a) Correlation b) Regression a) Correlation b) Regression Correlation or regression Which of the following is used for prediction? Regression line A regression line always passes through the point a) Correlation b) Regression a) b) c) d) Linear regression The following graph shows the linear relationship between diamond size and price for diamonds size 0.35 carats or less. Using this relationship to predict the price of a diamond that is 1 carat is considered Don t forget, the first test is on next Wednesday, 3/4. It will cover Chapters 1, 2, 3, and 10. a) Extrapolation. b) An influential observation. c) Prediction. 84 14