The importance of graphing the data: Anscombe s regression examples

Similar documents

2. Simple Linear Regression

Chapter 7: Simple linear regression Learning Objectives

Regression Analysis: A Complete Example

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Exercise 1.12 (Pg )

2013 MBA Jump Start Program. Statistics Module Part 3

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Example: Boats and Manatees

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Correlation key concepts:

Simple linear regression

Univariate Regression

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Correlation and Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

MTH 140 Statistics Videos

Simple Predictive Analytics Curtis Seare

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Outline: Demand Forecasting

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Premaster Statistics Tutorial 4 Full solutions

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

DATA INTERPRETATION AND STATISTICS

5. Linear Regression

Scatter Plot, Correlation, and Regression on the TI-83/84

STT 200 LECTURE 1, SECTION 2,4 RECITATION 7 (10/16/2012)

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

SPSS Guide: Regression Analysis

Data Mining Part 5. Prediction

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Section 1.5 Linear Models

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Additional sources Compilation of sources:

Introduction to Regression and Data Analysis

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Florida Math for College Readiness

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Homework 11. Part 1. Name: Score: / null

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Analysing Questionnaires using Minitab (for SPSS queries contact -)

TIME SERIES ANALYSIS & FORECASTING

MULTIPLE REGRESSION EXAMPLE

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Chapter 23. Inferences for Regression

Introduction to Linear Regression

CALCULATIONS & STATISTICS

11. Analysis of Case-control Studies Logistic Regression

Homework 8 Solutions

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Elements of statistics (MATH0487-1)

Algebra 1 Course Information

AP Physics 1 and 2 Lab Investigations

Multiple Regression: What Is It?

Session 7 Bivariate Data and Analysis

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

Module 3: Correlation and Covariance

Diagrams and Graphs of Statistical Data

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Introduction to Data Analysis in Hierarchical Linear Models

Introduction to Linear Regression

10. Analysis of Longitudinal Studies Repeat-measures analysis

Moderation. Moderation

August 2012 EXAMINATIONS Solution Part I

Mario Guarracino. Regression

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Factors affecting online sales

Testing for Lack of Fit

Part Three. Cost Behavior Analysis

5. Multiple regression

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel for Statistical Analysis

Business Valuation Review

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Algebra I Vocabulary Cards

17. SIMPLE LINEAR REGRESSION II

Section 1: Simple Linear Regression

Statistical Models in R

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

1.1. Simple Regression in Excel (Excel 2010).

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Curve Fitting. Before You Begin

SPSS Explore procedure

Elementary Statistics Sample Exam #3

Curve Fitting in Microsoft Excel By William Lee

Linear Models in STATA and ANOVA

1 Simple Linear Regression I Least Squares Estimation

Example G Cost of construction of nuclear power plants

Chapter 9 Descriptive Statistics for Bivariate Data

Transcription:

The importance of graphing the data: Anscombe s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008 B. Weaver, NHRC 2008 1

The Objective To demonstrate that good graphs are an essential part of linear regression analysis. B. Weaver, NHRC 2008 2

Not this kind of regression analysis B. Weaver, NHRC 2008 3

This kind of regression analysis B. Weaver, NHRC 2008 4

A very brief primer on simple linear regression B. Weaver, NHRC 2008 5

Simple linear regression A model in which X is used to predict Y. Y is a continuous variable with interval scale properties. In the prototypical case, X is also a continuous variable with interval-scale properties. Example: Y = distance in a 6-minute walk test X = FEV1 B. Weaver, NHRC 2008 6

Back to high school Equation for a straight line Y = bx + a SLOPE INTERCEPT b = slope of the line = the rise over the run a = the value of Y when X = 0 B. Weaver, NHRC 2008 7

Example of a straight line Gym membership Annual fee = $100 Fee per visit = $2 Let X = the number of visits to the gym Let Y = the total cost Y = 2X + 100 Let X = 200 visits to the gym Total cost = 2(200) + 100 = $500 B. Weaver, NHRC 2008 8

What if the relationship is imperfect? Straight line for a perfect relationship: Y = bx + a Straight line for an imperfect relationship: Y = bx + a Y = bx + a Two different symbols for the predicted value of Y B. Weaver, NHRC 2008 9

R-squared R-squared = the proportion of variability in Y that is accounted for by explanatory variables in the model. For a simple linear regression model (i.e., one predictor variable), R-squared = the proportion of the variability in Y that can be accounted for by the linear relationship between X and Y The adjusted R-squared corrects for upward bias in R-squared B. Weaver, NHRC 2008 10

Anscombe s examples (1973) Frank Anscombe devised 4 sets of X-Y pairs He performed simple linear regression for each data set Here are the results B. Weaver, NHRC 2008 11

Means & Standard Deviations X Y Data Set N Mean SD Mean SD 1 11 7.50 2.03 9.00 3.32 2 11 7.50 2.03 9.00 3.32 3 11 7.50 2.03 9.00 3.32 4 11 7.50 2.03 9.00 3.32 The means and SDs for the 4 data sets are identical to two decimals. B. Weaver, NHRC 2008 12

Correlations between X and Y Data Set Pearson r R-squared Adj. R-sq SE 1 0.82 0.67 0.63 1.24 2 0.82 0.67 0.63 1.24 3 0.82 0.67 0.63 1.24 4 0.82 0.67 0.63 1.24 Correlations, R-squared, adjusted R- squared, and standard errors are all identical to two decimals. B. Weaver, NHRC 2008 13

ANOVA Summary Tables Data Set Source SS df MS F p Regression 27.490 1 27.490 18.003 0.002 1 Residual 13.742 9 1.527 Total 41.232 10 Regression 27.470 1 27.470 17.972 0.002 2 Residual 13.756 9 1.528 Total 41.226 10 Regression 27.500 1 27.500 17.966 0.002 3 Residual 13.776 9 1.531 Total 41.276 10 Regression 27.510 1 27.510 17.990 0.002 4 Residual 13.763 9 1.529 Total 41.273 10 B. Weaver, NHRC 2008 14

The Regression Coefficients Data Set 1 2 3 4 B SE t p 95% CI Lower Upper Constant 3.00 1.124 2.67 0.026 0.459 5.544 X 0.50 0.118 4.24 0.002 0.233 0.766 Constant 3.00 1.124 2.67 0.026 0.459 5.546 X 0.50 0.118 4.24 0.002 0.233 0.766 Constant 3.00 1.125 2.67 0.026 0.455 5.547 X 0.50 0.118 4.24 0.002 0.233 0.767 Constant 3.00 1.125 2.67 0.026 0.456 5.544 X 0.50 0.118 4.24 0.002 0.233 0.767 For all 4 models, Y = 0.5(X) + 3 B. Weaver, NHRC 2008 15

Which Model is Best? Judging by everything we ve just seen, it appears that the models are all equally good But if that were true, I wouldn t be doing this talk! It is well known that good graphs are an essential part of data analysis (Tukey, 1977; Tufte, 1997) Let s look at some graphs that show the relationship between X and Y B. Weaver, NHRC 2008 16

Scatter-plot for Data Set 1 10 data points Influential point Not a good model B. Weaver, NHRC 2008 17

Scatter-plot for Data Set 2 Perfect linear relationship except for one outlier Better model than for Data Set 1, but still not great. B. Weaver, NHRC 2008 18

Scatter-plot for Data Set 3 Wrong model! The relationship between X and Y is curvilinear, not linear! The model should include both X and X 2 as predictors. B. Weaver, NHRC 2008 19

Scatter-plot for Data Set 4 This is a good looking plot. No influential points; straight line provides a good fit. B. Weaver, NHRC 2008 20

Summary The usual summary statistics for the 4 regression models were virtually identical Scatter-plots revealed that only one of the 4 data sets gave us a good model Appropriate graphs are an essential part of data analysis B. Weaver, NHRC 2008 21

What about multivariable models? Scatter-plots are useful for simple linear regression models (i.e., only one predictor variable) But often, we have multiple, or multivariable regression models (i.e., 2 or more predictor variables) In that case, it is more common to assess the fit of the model by looking at residual plots B. Weaver, NHRC 2008 22

What is a residual? In linear regression, a residual is an error in prediction Residual = (Y Y ) = (actual score predicted score) B. Weaver, NHRC 2008 23

Set 1: Scatter-plot vs. Residual Plot Scatter-plot Residual Plot Y Residual X Predicted value of Y B. Weaver, NHRC 2008 24

Set 2: Scatter-plot vs. Residual Plot Scatter-plot Residual Plot Residual Predicted value of Y B. Weaver, NHRC 2008 25

Set 3: Scatter-plot vs. Residual Plot Scatter-plot Residual Plot Residual Predicted value of Y Runs of same-sign residuals B. Weaver, NHRC 2008 26

Set 4: Scatter-plot vs. Residual Plot Scatter-plot Residual Plot Residual Predicted value of Y B. Weaver, NHRC 2008 27

Summary The usual summary statistics for the 4 regression models were virtually identical Scatter-plots revealed that only one of the 4 data sets gave us a good model Residual plots reveal the same thing, and have the advantage of being applicable to multivariable regression models Appropriate graphs are an essential part of data analysis B. Weaver, NHRC 2008 28

Questions? I think you should be more explicit here in step 2. B. Weaver, NHRC 2008 29

References Anscombe FJ. (1973). Graphs in statistical analysis. The American Statistician, 27, 17-21. Tufte ER. (1997). Visual Explanations, Images and Quantities, Evidence and Narrative (3rd Ed.). Graphics Press: Cheshire. Tukey JW. (1977). Exploratory data analysis. Addison-Wesley: Reading, Mass. B. Weaver, NHRC 2008 30

Extra Slides B. Weaver, NHRC 2008 31

Just as one would expect! The experimentalist comes running excitedly into the theorist's office, waving a graph taken off his latest experiment. "Hmmm," says the theorist, "That's exactly where you'd expect to see that peak. Here's the reason (long logical explanation follows)." In the middle of it, the experimentalist says "Wait a minute", studies the chart for a second, and says, "Oops, this is upside down." He fixes it. "Hmmm," says the theorist, "you'd expect to see a dip in exactly that position. Here's the reason...". B. Weaver, NHRC 2008 32

Best-fitting line: Least squares criterion Many lines could be placed on the scatter-plot, but only one of them is considered the best-fitting line. The most common criterion for best-fitting is that the sum of the squared errors in prediction is minimized. This is called the least-squares criterion. B. Weaver, NHRC 2008 33

Illustration of Least Squares Error in prediction B. Weaver, NHRC 2008 34

Illustration of Least Squares Squared error in prediction Error = 0 for this point, so no square Squared error in prediction B. Weaver, NHRC 2008 35

Illustration of Least Squares Sum of squared errors = the sum of the areas of all these squares For any other regression line, the sum of the squared errors would be greater. B. Weaver, NHRC 2008 36

What is a residual plot? Scatter-plot with: X = the fitted (or predicted) value of Y Y = the residual (i.e., the error in prediction) Residuals should be independent of the fitted value of Y There should be no serial correlation in the residuals (e.g., long runs of same-sign residuals) Both of these problems (plus some others) can be detected via residual plots Advantage of residual plots: they can be used in multivariable (i.e., multi-predictor) regression models B. Weaver, NHRC 2008 37

Examples of residual plots Curvilinear relationship Residual Predicted Y Outlier Heteroscedasticity B. Weaver, NHRC 2008 38

Example of a good residual plot B. Weaver, NHRC 2008 39

Example of a zig-zag pattern You do not want to see this kind of zig-zag pattern in the residual plot. B. Weaver, NHRC 2008 40

Simple linear regression & correlation Pearson r = the correlation It measures of the direction and strength of the linear association between X and Y It ranges from -1 to +1 B. Weaver, NHRC 2008 41

Direction of the linear relationship Positive relationship Negative relationship As X increases, Y increases As X increases, Y decreases B. Weaver, NHRC 2008 42

Perfect vs. Imperfect Relationship Perfect relationship Imperfect relationship B. Weaver, NHRC 2008 43

r-squared The square of Pearson r is a measure of how well the regression model fits the observed data It gives the proportion of variability in Y that is accounted for the linear relationship between X and Y. E.g., let r = 0.6 (or -0.6) r 2 = 0.36 So 36% of the variability in the Y-scores is accounted for by the linear relationship between X and Y B. Weaver, NHRC 2008 44