The importance of graphing the data: Anscombe s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008 B. Weaver, NHRC 2008 1
The Objective To demonstrate that good graphs are an essential part of linear regression analysis. B. Weaver, NHRC 2008 2
Not this kind of regression analysis B. Weaver, NHRC 2008 3
This kind of regression analysis B. Weaver, NHRC 2008 4
A very brief primer on simple linear regression B. Weaver, NHRC 2008 5
Simple linear regression A model in which X is used to predict Y. Y is a continuous variable with interval scale properties. In the prototypical case, X is also a continuous variable with interval-scale properties. Example: Y = distance in a 6-minute walk test X = FEV1 B. Weaver, NHRC 2008 6
Back to high school Equation for a straight line Y = bx + a SLOPE INTERCEPT b = slope of the line = the rise over the run a = the value of Y when X = 0 B. Weaver, NHRC 2008 7
Example of a straight line Gym membership Annual fee = $100 Fee per visit = $2 Let X = the number of visits to the gym Let Y = the total cost Y = 2X + 100 Let X = 200 visits to the gym Total cost = 2(200) + 100 = $500 B. Weaver, NHRC 2008 8
What if the relationship is imperfect? Straight line for a perfect relationship: Y = bx + a Straight line for an imperfect relationship: Y = bx + a Y = bx + a Two different symbols for the predicted value of Y B. Weaver, NHRC 2008 9
R-squared R-squared = the proportion of variability in Y that is accounted for by explanatory variables in the model. For a simple linear regression model (i.e., one predictor variable), R-squared = the proportion of the variability in Y that can be accounted for by the linear relationship between X and Y The adjusted R-squared corrects for upward bias in R-squared B. Weaver, NHRC 2008 10
Anscombe s examples (1973) Frank Anscombe devised 4 sets of X-Y pairs He performed simple linear regression for each data set Here are the results B. Weaver, NHRC 2008 11
Means & Standard Deviations X Y Data Set N Mean SD Mean SD 1 11 7.50 2.03 9.00 3.32 2 11 7.50 2.03 9.00 3.32 3 11 7.50 2.03 9.00 3.32 4 11 7.50 2.03 9.00 3.32 The means and SDs for the 4 data sets are identical to two decimals. B. Weaver, NHRC 2008 12
Correlations between X and Y Data Set Pearson r R-squared Adj. R-sq SE 1 0.82 0.67 0.63 1.24 2 0.82 0.67 0.63 1.24 3 0.82 0.67 0.63 1.24 4 0.82 0.67 0.63 1.24 Correlations, R-squared, adjusted R- squared, and standard errors are all identical to two decimals. B. Weaver, NHRC 2008 13
ANOVA Summary Tables Data Set Source SS df MS F p Regression 27.490 1 27.490 18.003 0.002 1 Residual 13.742 9 1.527 Total 41.232 10 Regression 27.470 1 27.470 17.972 0.002 2 Residual 13.756 9 1.528 Total 41.226 10 Regression 27.500 1 27.500 17.966 0.002 3 Residual 13.776 9 1.531 Total 41.276 10 Regression 27.510 1 27.510 17.990 0.002 4 Residual 13.763 9 1.529 Total 41.273 10 B. Weaver, NHRC 2008 14
The Regression Coefficients Data Set 1 2 3 4 B SE t p 95% CI Lower Upper Constant 3.00 1.124 2.67 0.026 0.459 5.544 X 0.50 0.118 4.24 0.002 0.233 0.766 Constant 3.00 1.124 2.67 0.026 0.459 5.546 X 0.50 0.118 4.24 0.002 0.233 0.766 Constant 3.00 1.125 2.67 0.026 0.455 5.547 X 0.50 0.118 4.24 0.002 0.233 0.767 Constant 3.00 1.125 2.67 0.026 0.456 5.544 X 0.50 0.118 4.24 0.002 0.233 0.767 For all 4 models, Y = 0.5(X) + 3 B. Weaver, NHRC 2008 15
Which Model is Best? Judging by everything we ve just seen, it appears that the models are all equally good But if that were true, I wouldn t be doing this talk! It is well known that good graphs are an essential part of data analysis (Tukey, 1977; Tufte, 1997) Let s look at some graphs that show the relationship between X and Y B. Weaver, NHRC 2008 16
Scatter-plot for Data Set 1 10 data points Influential point Not a good model B. Weaver, NHRC 2008 17
Scatter-plot for Data Set 2 Perfect linear relationship except for one outlier Better model than for Data Set 1, but still not great. B. Weaver, NHRC 2008 18
Scatter-plot for Data Set 3 Wrong model! The relationship between X and Y is curvilinear, not linear! The model should include both X and X 2 as predictors. B. Weaver, NHRC 2008 19
Scatter-plot for Data Set 4 This is a good looking plot. No influential points; straight line provides a good fit. B. Weaver, NHRC 2008 20
Summary The usual summary statistics for the 4 regression models were virtually identical Scatter-plots revealed that only one of the 4 data sets gave us a good model Appropriate graphs are an essential part of data analysis B. Weaver, NHRC 2008 21
What about multivariable models? Scatter-plots are useful for simple linear regression models (i.e., only one predictor variable) But often, we have multiple, or multivariable regression models (i.e., 2 or more predictor variables) In that case, it is more common to assess the fit of the model by looking at residual plots B. Weaver, NHRC 2008 22
What is a residual? In linear regression, a residual is an error in prediction Residual = (Y Y ) = (actual score predicted score) B. Weaver, NHRC 2008 23
Set 1: Scatter-plot vs. Residual Plot Scatter-plot Residual Plot Y Residual X Predicted value of Y B. Weaver, NHRC 2008 24
Set 2: Scatter-plot vs. Residual Plot Scatter-plot Residual Plot Residual Predicted value of Y B. Weaver, NHRC 2008 25
Set 3: Scatter-plot vs. Residual Plot Scatter-plot Residual Plot Residual Predicted value of Y Runs of same-sign residuals B. Weaver, NHRC 2008 26
Set 4: Scatter-plot vs. Residual Plot Scatter-plot Residual Plot Residual Predicted value of Y B. Weaver, NHRC 2008 27
Summary The usual summary statistics for the 4 regression models were virtually identical Scatter-plots revealed that only one of the 4 data sets gave us a good model Residual plots reveal the same thing, and have the advantage of being applicable to multivariable regression models Appropriate graphs are an essential part of data analysis B. Weaver, NHRC 2008 28
Questions? I think you should be more explicit here in step 2. B. Weaver, NHRC 2008 29
References Anscombe FJ. (1973). Graphs in statistical analysis. The American Statistician, 27, 17-21. Tufte ER. (1997). Visual Explanations, Images and Quantities, Evidence and Narrative (3rd Ed.). Graphics Press: Cheshire. Tukey JW. (1977). Exploratory data analysis. Addison-Wesley: Reading, Mass. B. Weaver, NHRC 2008 30
Extra Slides B. Weaver, NHRC 2008 31
Just as one would expect! The experimentalist comes running excitedly into the theorist's office, waving a graph taken off his latest experiment. "Hmmm," says the theorist, "That's exactly where you'd expect to see that peak. Here's the reason (long logical explanation follows)." In the middle of it, the experimentalist says "Wait a minute", studies the chart for a second, and says, "Oops, this is upside down." He fixes it. "Hmmm," says the theorist, "you'd expect to see a dip in exactly that position. Here's the reason...". B. Weaver, NHRC 2008 32
Best-fitting line: Least squares criterion Many lines could be placed on the scatter-plot, but only one of them is considered the best-fitting line. The most common criterion for best-fitting is that the sum of the squared errors in prediction is minimized. This is called the least-squares criterion. B. Weaver, NHRC 2008 33
Illustration of Least Squares Error in prediction B. Weaver, NHRC 2008 34
Illustration of Least Squares Squared error in prediction Error = 0 for this point, so no square Squared error in prediction B. Weaver, NHRC 2008 35
Illustration of Least Squares Sum of squared errors = the sum of the areas of all these squares For any other regression line, the sum of the squared errors would be greater. B. Weaver, NHRC 2008 36
What is a residual plot? Scatter-plot with: X = the fitted (or predicted) value of Y Y = the residual (i.e., the error in prediction) Residuals should be independent of the fitted value of Y There should be no serial correlation in the residuals (e.g., long runs of same-sign residuals) Both of these problems (plus some others) can be detected via residual plots Advantage of residual plots: they can be used in multivariable (i.e., multi-predictor) regression models B. Weaver, NHRC 2008 37
Examples of residual plots Curvilinear relationship Residual Predicted Y Outlier Heteroscedasticity B. Weaver, NHRC 2008 38
Example of a good residual plot B. Weaver, NHRC 2008 39
Example of a zig-zag pattern You do not want to see this kind of zig-zag pattern in the residual plot. B. Weaver, NHRC 2008 40
Simple linear regression & correlation Pearson r = the correlation It measures of the direction and strength of the linear association between X and Y It ranges from -1 to +1 B. Weaver, NHRC 2008 41
Direction of the linear relationship Positive relationship Negative relationship As X increases, Y increases As X increases, Y decreases B. Weaver, NHRC 2008 42
Perfect vs. Imperfect Relationship Perfect relationship Imperfect relationship B. Weaver, NHRC 2008 43
r-squared The square of Pearson r is a measure of how well the regression model fits the observed data It gives the proportion of variability in Y that is accounted for the linear relationship between X and Y. E.g., let r = 0.6 (or -0.6) r 2 = 0.36 So 36% of the variability in the Y-scores is accounted for by the linear relationship between X and Y B. Weaver, NHRC 2008 44