STAT 155 Introductory Statistics. Lecture 10: Cautions about Regression and Correlation, Causation

Similar documents
Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Chapter 7: Simple linear regression Learning Objectives

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Section 3 Part 1. Relationships between two numerical variables

STT 200 LECTURE 1, SECTION 2,4 RECITATION 7 (10/16/2012)

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Correlation. The relationship between height and weight

Lecture 13/Chapter 10 Relationships between Measurement (Quantitative) Variables

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Introduction to Statistics and Quantitative Research Methods

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Homework 8 Solutions

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Correlation key concepts:

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Simple linear regression

13. Poisson Regression Analysis

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

AP STATISTICS REVIEW (YMS Chapters 1-8)

MTH 140 Statistics Videos

Chapter 7 Scatterplots, Association, and Correlation

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Statistics 151 Practice Midterm 1 Mike Kowalski

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

Preview. What is a correlation? Las Cucarachas. Equal Footing. z Distributions 2/12/2013. Correlation

ch12 practice test SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Prevention of and the Screening for Diabetes Part I Insulin Resistance By James L. Holly, MD Your Life Your Health The Examiner January 19, 2012

Example: Boats and Manatees

Father s height (inches)

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

a) Find the five point summary for the home runs of the National League teams. b) What is the mean number of home runs by the American League teams?

Easy Read. How can we make sure everyone gets the right health care? How can we make NHS care better?

Obesity in America: A Growing Trend

Pitcairn Medical Practice New Patient Questionnaire

AP Statistics Ch 3 Aim 1: Scatter Diagrams

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Correlation and Regression

Univariate Regression

List of Examples. Examples 319

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Chapter 23. Inferences for Regression

Elementary Statistics

STAT 350 Practice Final Exam Solution (Spring 2015)

2. Simple Linear Regression

11. Analysis of Case-control Studies Logistic Regression

Introduction to History & Research Methods of Psychology

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Outline: Demand Forecasting

socscimajor yes no TOTAL female male TOTAL

Chapter 6. Examples (details given in class) Who is Measured: Units, Subjects, Participants. Research Studies to Detect Relationships

WHAT IS DIABETES MELLITUS? CAUSES AND CONSEQUENCES. Living your life as normal as possible

AP Stats- Mrs. Daniel Chapter 4 MC Practice

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section:

Using Excel for Statistical Analysis

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Elements of statistics (MATH0487-1)

Is a monetary incentive a feasible solution to some of the UK s most pressing health concerns?

2013 MBA Jump Start Program. Statistics Module Part 3

Major dietary patterns are related to plasma concentrations of markers of inflammation and endothelial dysfunction

4. Simple regression. QBUS6840 Predictive Analytics.

AP Statistics. Chapter 4 Review

WHO STEPwise approach to chronic disease risk factor surveillance (STEPS)

Exercise 1.12 (Pg )

DATA INTERPRETATION AND STATISTICS

Chapter 9 Descriptive Statistics for Bivariate Data

BIG DATA SCIENTIFIC AND COMMERCIAL APPLICATIONS (ITNPD4) LECTURE: DATA SCIENCE IN MEDICINE

Introduction to Regression and Data Analysis

Scatter Plot, Correlation, and Regression on the TI-83/84

Relationships Between Two Variables: Scatterplots and Correlation

4. Multiple Regression in Practice

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Examining a Fitted Logistic Model

The Great Debate Correlation vs Causation

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

Module 3: Correlation and Covariance

Scientific Methods II: Correlational Research

Chapter 1: The Nature of Probability and Statistics

The Importance of Statistics Education

Do Supplemental Online Recorded Lectures Help Students Learn Microeconomics?*

Simple Linear Regression

The Mozart effect Methods of Scientific Research

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

Health Risk Appraisal Profile

Module 5: Multiple Regression Analysis

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

1) The table lists the smoking habits of a group of college students. Answer: 0.218

Premaster Statistics Tutorial 4 Full solutions

Linear Models in STATA and ANOVA

Describing Relationships between Two Variables

HiddenLevers Statistical Analysis Approach

Predictive Modeling and Big Data

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Mortality Assessment Technology: A New Tool for Life Insurance Underwriting

Introduction to Linear Regression

Transcription:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL STAT 155 Introductory Statistics Lecture 10: Cautions about Regression and Correlation, Causation 10/03/06 Lecture 10 1

Review Least-Squares Regression Lines Equation and interpretation of the line Prediction using the line Correlation and Regression Coefficient of Determination 10/03/06 Lecture 10 2

Regression Diagnostics Look at residuals (errors): A residual is the difference between an observed value of the response variable and the value predicted by the regression line, i.e., residual = y yˆ. The sum of the least-squares residuals is always zero. Why? 10/03/06 Lecture 10 3

Residual Plots A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of a regression line. 10/03/06 Lecture 10 4

Age vs. Height 10/03/06 Lecture 10 5

Residual Plot If the regression line catches the overall pattern of the data, there should be no pattern in the residual. totally random 10/03/06 Lecture 10 6

nonlinear nonconstant variation 10/03/06 Lecture 10 7

Diabetes Patient: FPG vs. HbA FPG: fasting plasma glucose. HbA: percent of red blood cells that have a glucose molecule attached. Both are measuring blood glucose. We expect a positive association. 18 subjects, r = 0.4819. See the scatterplot on the next page. 10/03/06 Lecture 10 8

Diabetes Patient: FPG vs. HbA 10/03/06 Lecture 10 9

Outliers and Influential Observations An outlier is a point that lies outside the overall pattern of the other points. Outliers in the y direction have large residuals, but other outliers may not. An influential obs. is a point that the regression line would be significantly changed with or without it. Outliers in the x direction are often influential points. But not always 10/03/06 Lecture 10 10

Diabetes Patient: FPG vs. HbA 10/03/06 Lecture 10 11

Outliers & Influential Obs. Outliers in the y direction can be spotted from the residual plot. Influential points can be identified by fitting regression lines with/without those points. More serious. Can not be identified via residual plot. Scatterplot gives us some hint. 10/03/06 Lecture 10 12

Cautions about correlation and regression Linear only DO NOT extrapolate Not resistant Beware lurking variables Beware correlations based on averaged data The restricted-range problem 10/03/06 Lecture 10 13

Lurking Variable A lurking (hidden) variable is a variable that has an important effect on the relationship among the variables in a study, but is not included among the variables being studied. Examples: SAT scores and college grades Lurking variable: IQ 10/03/06 Lecture 10 14

Lurking variables can create nonsense correlations. For the world s nations, let x be the number of TVs/person and y be the average life expectancy; A high positive correlation nations with more TV sets have higher life expectancies. Could we lengthen the lives of people in Rwanda by shipping them more TVs? Lurking variable: wealth of the nation Rich nations: more TV sets. Rich nations: longer life expectancies because of better nutrition, clean water, and better health care. There is no cause-and-effect tie between TV sets and length of life. Association vs. causation. 10/03/06 Lecture 10 15

Misleading correlation (two clusters) 10/03/06 Lecture 10 16

Beware correlations based on averaged data A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on data for individuals. Age vs. Height (Basketball) score % vs. practice time 10/03/06 Lecture 10 17

The restricted-range problem A restricted-range problem occurs when one does not get to observe the full range of the variables. When data suffer from restricted range, r and r 2 are lower than they would be if the full range could be observed. SAT scores vs. College GPA Princeton vs. Generic State College (Ex 2.22) 10/03/06 Lecture 10 18

Causation vs. Association Some studies want to find the existence of causation. Example of causation: Increased drinking of alcohol causes a decrease in coordination. Smoking and Lung Cancer. Example of association: The above two examples. SAT scores and Freshman year GPA. 10/03/06 Lecture 10 19

Association does not imply causation. An association between two variables x and y can reflect many types of relationship among x, y, and one or more lurking variables. An association between a predictor x and a response y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y. 10/03/06 Lecture 10 20

Explaining Association 10/03/06 Lecture 10 21

Explaining Association: Causation Cause-and-effect Examples Amount of fertilizer and yield of corn Weight of a car and its MPG Dosage of a drug and the survival rate of the mice 10/03/06 Lecture 10 22

Explaining Association: Common Response Lurking variables Both x and y change in response to changes in z, the lurking variable There may not be direct causal link between x and y. Examples: SAT scores vs. College GPA (IQ, Attitude) Monthly flow of money into stock mutual funds vs. rate of return for the stock market (Market Condition, Investor Attitude) 10/03/06 Lecture 10 23

Explaining Association: Confounding Two variables are confounded when their effects on a response variable are mixed together. One explanatory variable may be confounded with other explanatory variables or lurking variables. Examples: More education leads to higher income. Family background Religious people live longer. Life style 10/03/06 Lecture 10 24

Establishing causation The only compelling method: Designed experiment (More in Chapter 3) Hot disputes: Does gun control reduce violent crime? Does meat consumption in your diet cause heart diseases? Does smoking cause lung cancer? 10/03/06 Lecture 10 25

Does smoking CAUSE lung cancer? causation: smoking causes lung cancer. common response: people who have a genetic predisposition to lung cancer also have a genetic predisposition to smoking. confounding: people who drink too much, don't exercise, eat unhealthy foods, etc. are more likely to get lung cancer as a result of their lifestyle. Such people may be more likely to be smokers as well. 10/03/06 Lecture 10 26

Some guidelines when designed experiment is impossible: strong association association consistent across various studies higher dose associated with stronger responses the cause precedes the effect in time plausibility 10/03/06 Lecture 10 27

Take Home Message Residual Plots Outliers and Influential Observations Lurking Variables Cautions about Correlation and Regression Explaining associations: Causation Common response Confounding How to establish causation? 10/03/06 Lecture 10 28