A short primer on residual plots

Chapter 24 A short primer on residual plots Contents 24.1 Linear Regression................................... 1598 24.2 ANOVA residual plots................................. 1599 24.3 Logistic Regression residual plots - Part I...................... 1600 24.4 Logistic Regression residual plots - Part II...................... 1601 24.5 Poisson Regression residual plots - Part I....................... 1602 24.6 Poisson Regression residual plots - Part II...................... 1603 The suggested citation for this chapter of notes is: Schwarz, C. J. (2015). A short primer on residual plots. In Course Notes for Beginning and Intermediate Statistics. Available at http://www.stat.sfu.ca/~cschwarz/coursenotes. Retrieved 2015-08-20. Residual plots are one of the most important diagnostic tools available for model checking. However, residual plots can take a variety of forms depending upon the type of model fitted that can appear to be confusing at first glance. At its simplest, the residual is defined as: residual i = observed i predicted i where the i th residual is difference between the observed and predicted values for the i th observation. These residuals are often standardized or studentized. Standardization occurs when all of the residuals are divided by a common, average standard deviation of the residuals. Studentization occurs when each individual residual is divided by its own standard deviation which may vary among the residuals. For example, in simple linear regression, the standardized residuals are divided by the MSE which is an estimate of the common standard deviation about the regression line. However, residuals near the middle of the regression line (i.e. near to X) are less variable than residuals near the extremities of the line. The studentized residual is divided by s 1 h ii where h ii are the leverage values for the i th observation. 1597

Regardless if standardized or studentized residuals are used, these are plotted against the predicted values. A good model will have the residuals centered around zero with a high proportion (about 95%) within ±2, and no pattern to the residuals. 24.1 Linear Regression For example, consider the Fitness data set available in the JMP sample data library. This consists of measurements of males and females weight, age, pulse rates, and oxygen consumption as they completed a standardized fitness test. Consider the model: or in a simplified notation Oxy i = β 0 + β 1 W eight i + ɛ i Oxy = W eight This model was fit, and the resulting residual plot 1 is: This shows a random scatter around zero with only a few points outside the ±2 limits. Notice that in simple regression, the Y variable is continuous, as is the X variable. Consequently, predictions are also continuous and so the plot of the residuals will show this random scatter (assuming the model fits well). Similar plots are obtained in multiple regression, or ANCOVA models. 1 This was constructed by (a) using the Analyze->Fit Model platform, (b) Red-triangle Saving Columns to the data table for the predicted oxygen consumption and the studentized residual, (c) using Graph Overlay to get the base plot (d) clicking on the Y axis and adding reference lines at 0, 2, and 2. c 2015 Carl James Schwarz 1598 2015-08-20

24.2 ANOVA residual plots Consider now comparisons of Y values among different treatment groups. For example, is there a difference in the mean oxygen consumption between males and females as sampled in the Fitness data set. The model is now: Oxy = Sex The model was fit, and the resulting residual plot 2 is: At first glance, this plot does not show a random scatter as there is a definite pattern with two vertical lines. However, on a sober second thought, this is not surprising. There are only two levels of Sex and so there are at most two distinct predicted values, one for males and one for females. All females will have the same predicted value, and all males will have the sample predicted value. These correspond to the two vertical positions on the plot. The scatter within each vertical line represents the variability of individuals in their oxygen consumptions within their respective group. Points of concern would be those individual whose studentized residual value is outside the ±2 lines. If the X variable had k treatment groups, there would be k vertical lines. 2 This was constructed by (a) using the Analyze->Fit Model platform with Sex as the X variable, (b) Red-triangle Saving Columns to the data table for the predicted oxygen consumption and the studentized residual, (c) using Graph Overlay to get the base plot (d) clicking on the Y axis and adding reference lines at 0, 2, and 2. c 2015 Carl James Schwarz 1599 2015-08-20

24.3 Logistic Regression residual plots - Part I Suppose we wish to predict membership in a category as a function of a continuous covariate. For example, can we predict the sex of an individual based on their weight? This is known as logistic regression and is discussed in another chapter in this series of notes. Again refer to the Fitness dataset. The (Generalized Linear) model is: Y i distributed as Binomial(p i ) φ i = logit(p i ) φ i = W eight The residual plot is produced automatically from the Generalized Linear Model option of the Analyze- >Fit Model platform and looks like 3 : This plot looks a bit strange! Along the bottom of the plot, is the predicted probability of being female 4 This is found by substituting in the weight of each person into the estimated linear part, and then back-transforming from the logit scale to the ordinary probability scale. The first point on the plot, identified by a square box, is from a male who weighs over 90 kg. The predicted probability of being female is very small, about 5%. The first question is exactly how is a residual defined when the Y variable is a category? For example, how would the residual for this point be computed - it makes no sense to simply take the observed (male) minus the predicted probability (.05)? 3 I added reference lines at zero, 2, and 2 by clicking on the Y axis of the plot 4 The first part of the output from the platform states that the probability of being female is being modeled. c 2015 Carl James Schwarz 1600 2015-08-20

Many computer packages redefine the categories using 0 and 1 labels. Because JMP was modeling the probability of being female, all males are assigned the value of 0, and all females assigned the value of 1. Hence the residual for this point is 0-.05-0.05 which after studentization, is plots as shown. The bottom line in the residual plot corresponds to the male subjects, The top line corresponds to the female subjects. Where are areas of concern? You would be concerned about females who have a very small probability of prediction for being female, and males who have a large probability of prediction of being female. These are located in the plot in the circled areas. The residual plot s strange appearance is an artifact of the modeling process. 24.4 Logistic Regression residual plots - Part II What happens if the predictors in a logistic regression are also categorical. Based on what what seen for the ordinary regression case, you can expect to see a set of vertical lines. But, there are only two possible responses, so the plot reduces to a (non-informative) set of lattice points. For example, consider predicting survival rates of Titanic passengers as a function of their sex. This model is: Y i distributed as Binomial(p i ) φ i = logit(p i ) φ i = Sex The residual plot is produced automatically from the Generalized Linear Model option of the Analyze- >Fit Model platform and looks like 5 : 5 I added reference lines at zero, 2, and 2 by clicking on the Y axis of the plot c 2015 Carl James Schwarz 1601 2015-08-20

The same logic applies as in the previous sections. Because Sex is a discrete predictor with two possible values, there are only two possible predicted probability of survival corresponding to the two vertical lines in the plot. Because the response variable is categorical, it is converted to a 0 or 1 values, and the residuals computed which then correspond to the two dots in each vertical line. Note that each dot represents several hundred data values! This residual plot is rarely informative after all, if there are only two outcomes and only two categories for the predictors, some people have to lie in the two outcomes for each of the two categories of predictors. 24.5 Poisson Regression residual plots - Part I Poisson regression is similar to the case of multiple regression, but also has some features of the logistic regression case. For example, the responses are counts which can only take discrete values (like the logistic case), but there can be a wide range of counts (like the multiple regression case). For example, consider predicting the number of satellite males around female horseshoe crabs as a function of the body mass of the female. The model fit is: Y i distributed as P oisson(µ i ) φ i = log(µ i ) φ i = W eight c 2015 Carl James Schwarz 1602 2015-08-20

The residual plot is produced automatically from the Generalized Linear Model option of the Analyze- >Fit Model platform and looks like: 6 : The plot now has a series of lines. These correspond to the distinct values of Y (as in the logistic regression case), with the lowest line corresponding to crabs with Y = 0, the next line corresponds to Y = 1, then Y = 2 and so on. Again the areas of concern are those points outside of ±2. In this plot, there are several females with large number of satellite males that were predicted to have only 2 or 3 satellite males. 24.6 Poisson Regression residual plots - Part II Finally, consider the case where the X variable is also discrete. For example, consider trying to predict the number of satellite males as a function of the color of the female crab. The model fit is: Y i distributed as P oisson(µ i ) φ i = log(µ i ) φ i = Color The residual plot is produced automatically from the Generalized Linear Model option of the Analyze- >Fit Model platform and looks like: 7 : 6 I added reference lines at zero, 2, and 2 by clicking on the Y axis of the plot 7 I added reference lines at zero, 2, and 2 by clicking on the Y axis of the plot c 2015 Carl James Schwarz 1603 2015-08-20

Because the X variable is nominally scaled with 4 levels, there are four vertical lines on the plot (note that two of the predicted values are very closed around the 2.25 area and can barely be distinguished). Because the Y values are restricted to non-negative integer values, there are again a series of lines corresponding to the discrete values of Y. Again points outside the ±2 reference line may be of concern and may require further investigation. c 2015 Carl James Schwarz 1604 2015-08-20