The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL STAT 155 Introductory Statistics Lecture 10: Cautions about Regression and Correlation, Causation 10/03/06 Lecture 10 1
Review Least-Squares Regression Lines Equation and interpretation of the line Prediction using the line Correlation and Regression Coefficient of Determination 10/03/06 Lecture 10 2
Regression Diagnostics Look at residuals (errors): A residual is the difference between an observed value of the response variable and the value predicted by the regression line, i.e., residual = y yˆ. The sum of the least-squares residuals is always zero. Why? 10/03/06 Lecture 10 3
Residual Plots A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of a regression line. 10/03/06 Lecture 10 4
Age vs. Height 10/03/06 Lecture 10 5
Residual Plot If the regression line catches the overall pattern of the data, there should be no pattern in the residual. totally random 10/03/06 Lecture 10 6
nonlinear nonconstant variation 10/03/06 Lecture 10 7
Diabetes Patient: FPG vs. HbA FPG: fasting plasma glucose. HbA: percent of red blood cells that have a glucose molecule attached. Both are measuring blood glucose. We expect a positive association. 18 subjects, r = 0.4819. See the scatterplot on the next page. 10/03/06 Lecture 10 8
Diabetes Patient: FPG vs. HbA 10/03/06 Lecture 10 9
Outliers and Influential Observations An outlier is a point that lies outside the overall pattern of the other points. Outliers in the y direction have large residuals, but other outliers may not. An influential obs. is a point that the regression line would be significantly changed with or without it. Outliers in the x direction are often influential points. But not always 10/03/06 Lecture 10 10
Diabetes Patient: FPG vs. HbA 10/03/06 Lecture 10 11
Outliers & Influential Obs. Outliers in the y direction can be spotted from the residual plot. Influential points can be identified by fitting regression lines with/without those points. More serious. Can not be identified via residual plot. Scatterplot gives us some hint. 10/03/06 Lecture 10 12
Cautions about correlation and regression Linear only DO NOT extrapolate Not resistant Beware lurking variables Beware correlations based on averaged data The restricted-range problem 10/03/06 Lecture 10 13
Lurking Variable A lurking (hidden) variable is a variable that has an important effect on the relationship among the variables in a study, but is not included among the variables being studied. Examples: SAT scores and college grades Lurking variable: IQ 10/03/06 Lecture 10 14
Lurking variables can create nonsense correlations. For the world s nations, let x be the number of TVs/person and y be the average life expectancy; A high positive correlation nations with more TV sets have higher life expectancies. Could we lengthen the lives of people in Rwanda by shipping them more TVs? Lurking variable: wealth of the nation Rich nations: more TV sets. Rich nations: longer life expectancies because of better nutrition, clean water, and better health care. There is no cause-and-effect tie between TV sets and length of life. Association vs. causation. 10/03/06 Lecture 10 15
Misleading correlation (two clusters) 10/03/06 Lecture 10 16
Beware correlations based on averaged data A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on data for individuals. Age vs. Height (Basketball) score % vs. practice time 10/03/06 Lecture 10 17
The restricted-range problem A restricted-range problem occurs when one does not get to observe the full range of the variables. When data suffer from restricted range, r and r 2 are lower than they would be if the full range could be observed. SAT scores vs. College GPA Princeton vs. Generic State College (Ex 2.22) 10/03/06 Lecture 10 18
Causation vs. Association Some studies want to find the existence of causation. Example of causation: Increased drinking of alcohol causes a decrease in coordination. Smoking and Lung Cancer. Example of association: The above two examples. SAT scores and Freshman year GPA. 10/03/06 Lecture 10 19
Association does not imply causation. An association between two variables x and y can reflect many types of relationship among x, y, and one or more lurking variables. An association between a predictor x and a response y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y. 10/03/06 Lecture 10 20
Explaining Association 10/03/06 Lecture 10 21
Explaining Association: Causation Cause-and-effect Examples Amount of fertilizer and yield of corn Weight of a car and its MPG Dosage of a drug and the survival rate of the mice 10/03/06 Lecture 10 22
Explaining Association: Common Response Lurking variables Both x and y change in response to changes in z, the lurking variable There may not be direct causal link between x and y. Examples: SAT scores vs. College GPA (IQ, Attitude) Monthly flow of money into stock mutual funds vs. rate of return for the stock market (Market Condition, Investor Attitude) 10/03/06 Lecture 10 23
Explaining Association: Confounding Two variables are confounded when their effects on a response variable are mixed together. One explanatory variable may be confounded with other explanatory variables or lurking variables. Examples: More education leads to higher income. Family background Religious people live longer. Life style 10/03/06 Lecture 10 24
Establishing causation The only compelling method: Designed experiment (More in Chapter 3) Hot disputes: Does gun control reduce violent crime? Does meat consumption in your diet cause heart diseases? Does smoking cause lung cancer? 10/03/06 Lecture 10 25
Does smoking CAUSE lung cancer? causation: smoking causes lung cancer. common response: people who have a genetic predisposition to lung cancer also have a genetic predisposition to smoking. confounding: people who drink too much, don't exercise, eat unhealthy foods, etc. are more likely to get lung cancer as a result of their lifestyle. Such people may be more likely to be smokers as well. 10/03/06 Lecture 10 26
Some guidelines when designed experiment is impossible: strong association association consistent across various studies higher dose associated with stronger responses the cause precedes the effect in time plausibility 10/03/06 Lecture 10 27
Take Home Message Residual Plots Outliers and Influential Observations Lurking Variables Cautions about Correlation and Regression Explaining associations: Causation Common response Confounding How to establish causation? 10/03/06 Lecture 10 28