A short primer on residual plots

Similar documents
Scatter Plots with Error Bars

January 26, 2009 The Faculty Center for Teaching and Learning

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Characteristics of Binomial Distributions

Chapter 5 Analysis of variance SPSS Analysis of variance

Univariate Regression

Introduction to Regression and Data Analysis

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

MULTIPLE REGRESSION WITH CATEGORICAL DATA

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

11. Analysis of Case-control Studies Logistic Regression

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Dealing with Data in Excel 2010

The first three steps in a logistic regression analysis with examples in IBM SPSS. Steve Simon P.Mean Consulting

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

Simple Linear Regression Inference

Statistical Analysis Using SPSS for Windows Getting Started (Ver. 2014/11/6) The numbers of figures in the SPSS_screenshot.pptx are shown in red.

Directions for using SPSS

CHAPTER TWELVE TABLES, CHARTS, AND GRAPHS

Coins, Presidents, and Justices: Normal Distributions and z-scores

Linear Models in STATA and ANOVA

Assignments Analysis of Longitudinal data: a multilevel approach

Introduction to proc glm

Chapter 7: Simple linear regression Learning Objectives

The Big Picture. Correlation. Scatter Plots. Data

IBM SPSS Direct Marketing 23

EXCEL Tutorial: How to use EXCEL for Graphs and Calculations.

IBM SPSS Direct Marketing 22

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Descriptive Statistics

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Scatter Plot, Correlation, and Regression on the TI-83/84

Math 132. Population Growth: the World

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08

The Dummy s Guide to Data Analysis Using SPSS

An analysis method for a quantitative outcome and two categorical explanatory variables.

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

4. Simple regression. QBUS6840 Predictive Analytics.

Additional sources Compilation of sources:

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Using Excel for Statistical Analysis

SPSS Explore procedure

Gestation Period as a function of Lifespan

A Determination of g, the Acceleration Due to Gravity, from Newton's Laws of Motion

Data Mining Lab 5: Introduction to Neural Networks

Describing, Exploring, and Comparing Data

Exercise 1.12 (Pg )

SAS Analyst for Windows Tutorial

SAS Software to Fit the Generalized Linear Model

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Statistiek II. John Nerbonne. October 1, Dept of Information Science

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

8. Time Series and Prediction

Psychology 205: Research Methods in Psychology

ABSORBENCY OF PAPER TOWELS

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Data exploration with Microsoft Excel: analysing more than one variable

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

5. Correlation. Open HeightWeight.sav. Take a moment to review the data file.

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

2013 MBA Jump Start Program. Statistics Module Part 3

APPENDIX B. Routers route based on the network number. The router that delivers the data packet to the correct destination host uses the host ID.

10. Analysis of Longitudinal Studies Repeat-measures analysis

Simple Predictive Analytics Curtis Seare

The Circumference Function

When to Use a Particular Statistical Test

Intermediate PowerPoint

430 Statistics and Financial Mathematics for Business

Estimation of σ 2, the variance of ɛ

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Elementary Statistics Sample Exam #3

Simple linear regression

Data Mining Part 5. Prediction

Common Core Unit Summary Grades 6 to 8

Quantitative vs. Categorical Data: A Difference Worth Knowing Stephen Few April 2005

Time series Forecasting using Holt-Winters Exponential Smoothing

LOGISTIC REGRESSION ANALYSIS

Module 5: Introduction to Multilevel Modelling SPSS Practicals Chris Charlton 1 Centre for Multilevel Modelling

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

IBM SPSS Data Preparation 22

7 Time series analysis

6 3 The Standard Normal Distribution

13. Poisson Regression Analysis

MBA Jump Start Program

Mixed 2 x 3 ANOVA. Notes

Review of Fundamental Mathematics

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Scientific Graphing in Excel 2010

Probit Analysis By: Kim Vincent

Windows-Based Meta-Analysis Software. Package. Version 2.0

Determine If An Equation Represents a Function

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

IBM SPSS Direct Marketing 19

Transcription:

Chapter 24 A short primer on residual plots Contents 24.1 Linear Regression................................... 1598 24.2 ANOVA residual plots................................. 1599 24.3 Logistic Regression residual plots - Part I...................... 1600 24.4 Logistic Regression residual plots - Part II...................... 1601 24.5 Poisson Regression residual plots - Part I....................... 1602 24.6 Poisson Regression residual plots - Part II...................... 1603 The suggested citation for this chapter of notes is: Schwarz, C. J. (2015). A short primer on residual plots. In Course Notes for Beginning and Intermediate Statistics. Available at http://www.stat.sfu.ca/~cschwarz/coursenotes. Retrieved 2015-08-20. Residual plots are one of the most important diagnostic tools available for model checking. However, residual plots can take a variety of forms depending upon the type of model fitted that can appear to be confusing at first glance. At its simplest, the residual is defined as: residual i = observed i predicted i where the i th residual is difference between the observed and predicted values for the i th observation. These residuals are often standardized or studentized. Standardization occurs when all of the residuals are divided by a common, average standard deviation of the residuals. Studentization occurs when each individual residual is divided by its own standard deviation which may vary among the residuals. For example, in simple linear regression, the standardized residuals are divided by the MSE which is an estimate of the common standard deviation about the regression line. However, residuals near the middle of the regression line (i.e. near to X) are less variable than residuals near the extremities of the line. The studentized residual is divided by s 1 h ii where h ii are the leverage values for the i th observation. 1597

Regardless if standardized or studentized residuals are used, these are plotted against the predicted values. A good model will have the residuals centered around zero with a high proportion (about 95%) within ±2, and no pattern to the residuals. 24.1 Linear Regression For example, consider the Fitness data set available in the JMP sample data library. This consists of measurements of males and females weight, age, pulse rates, and oxygen consumption as they completed a standardized fitness test. Consider the model: or in a simplified notation Oxy i = β 0 + β 1 W eight i + ɛ i Oxy = W eight This model was fit, and the resulting residual plot 1 is: This shows a random scatter around zero with only a few points outside the ±2 limits. Notice that in simple regression, the Y variable is continuous, as is the X variable. Consequently, predictions are also continuous and so the plot of the residuals will show this random scatter (assuming the model fits well). Similar plots are obtained in multiple regression, or ANCOVA models. 1 This was constructed by (a) using the Analyze->Fit Model platform, (b) Red-triangle Saving Columns to the data table for the predicted oxygen consumption and the studentized residual, (c) using Graph Overlay to get the base plot (d) clicking on the Y axis and adding reference lines at 0, 2, and 2. c 2015 Carl James Schwarz 1598 2015-08-20

24.2 ANOVA residual plots Consider now comparisons of Y values among different treatment groups. For example, is there a difference in the mean oxygen consumption between males and females as sampled in the Fitness data set. The model is now: Oxy = Sex The model was fit, and the resulting residual plot 2 is: At first glance, this plot does not show a random scatter as there is a definite pattern with two vertical lines. However, on a sober second thought, this is not surprising. There are only two levels of Sex and so there are at most two distinct predicted values, one for males and one for females. All females will have the same predicted value, and all males will have the sample predicted value. These correspond to the two vertical positions on the plot. The scatter within each vertical line represents the variability of individuals in their oxygen consumptions within their respective group. Points of concern would be those individual whose studentized residual value is outside the ±2 lines. If the X variable had k treatment groups, there would be k vertical lines. 2 This was constructed by (a) using the Analyze->Fit Model platform with Sex as the X variable, (b) Red-triangle Saving Columns to the data table for the predicted oxygen consumption and the studentized residual, (c) using Graph Overlay to get the base plot (d) clicking on the Y axis and adding reference lines at 0, 2, and 2. c 2015 Carl James Schwarz 1599 2015-08-20

24.3 Logistic Regression residual plots - Part I Suppose we wish to predict membership in a category as a function of a continuous covariate. For example, can we predict the sex of an individual based on their weight? This is known as logistic regression and is discussed in another chapter in this series of notes. Again refer to the Fitness dataset. The (Generalized Linear) model is: Y i distributed as Binomial(p i ) φ i = logit(p i ) φ i = W eight The residual plot is produced automatically from the Generalized Linear Model option of the Analyze- >Fit Model platform and looks like 3 : This plot looks a bit strange! Along the bottom of the plot, is the predicted probability of being female 4 This is found by substituting in the weight of each person into the estimated linear part, and then back-transforming from the logit scale to the ordinary probability scale. The first point on the plot, identified by a square box, is from a male who weighs over 90 kg. The predicted probability of being female is very small, about 5%. The first question is exactly how is a residual defined when the Y variable is a category? For example, how would the residual for this point be computed - it makes no sense to simply take the observed (male) minus the predicted probability (.05)? 3 I added reference lines at zero, 2, and 2 by clicking on the Y axis of the plot 4 The first part of the output from the platform states that the probability of being female is being modeled. c 2015 Carl James Schwarz 1600 2015-08-20

Many computer packages redefine the categories using 0 and 1 labels. Because JMP was modeling the probability of being female, all males are assigned the value of 0, and all females assigned the value of 1. Hence the residual for this point is 0-.05-0.05 which after studentization, is plots as shown. The bottom line in the residual plot corresponds to the male subjects, The top line corresponds to the female subjects. Where are areas of concern? You would be concerned about females who have a very small probability of prediction for being female, and males who have a large probability of prediction of being female. These are located in the plot in the circled areas. The residual plot s strange appearance is an artifact of the modeling process. 24.4 Logistic Regression residual plots - Part II What happens if the predictors in a logistic regression are also categorical. Based on what what seen for the ordinary regression case, you can expect to see a set of vertical lines. But, there are only two possible responses, so the plot reduces to a (non-informative) set of lattice points. For example, consider predicting survival rates of Titanic passengers as a function of their sex. This model is: Y i distributed as Binomial(p i ) φ i = logit(p i ) φ i = Sex The residual plot is produced automatically from the Generalized Linear Model option of the Analyze- >Fit Model platform and looks like 5 : 5 I added reference lines at zero, 2, and 2 by clicking on the Y axis of the plot c 2015 Carl James Schwarz 1601 2015-08-20

The same logic applies as in the previous sections. Because Sex is a discrete predictor with two possible values, there are only two possible predicted probability of survival corresponding to the two vertical lines in the plot. Because the response variable is categorical, it is converted to a 0 or 1 values, and the residuals computed which then correspond to the two dots in each vertical line. Note that each dot represents several hundred data values! This residual plot is rarely informative after all, if there are only two outcomes and only two categories for the predictors, some people have to lie in the two outcomes for each of the two categories of predictors. 24.5 Poisson Regression residual plots - Part I Poisson regression is similar to the case of multiple regression, but also has some features of the logistic regression case. For example, the responses are counts which can only take discrete values (like the logistic case), but there can be a wide range of counts (like the multiple regression case). For example, consider predicting the number of satellite males around female horseshoe crabs as a function of the body mass of the female. The model fit is: Y i distributed as P oisson(µ i ) φ i = log(µ i ) φ i = W eight c 2015 Carl James Schwarz 1602 2015-08-20

The residual plot is produced automatically from the Generalized Linear Model option of the Analyze- >Fit Model platform and looks like: 6 : The plot now has a series of lines. These correspond to the distinct values of Y (as in the logistic regression case), with the lowest line corresponding to crabs with Y = 0, the next line corresponds to Y = 1, then Y = 2 and so on. Again the areas of concern are those points outside of ±2. In this plot, there are several females with large number of satellite males that were predicted to have only 2 or 3 satellite males. 24.6 Poisson Regression residual plots - Part II Finally, consider the case where the X variable is also discrete. For example, consider trying to predict the number of satellite males as a function of the color of the female crab. The model fit is: Y i distributed as P oisson(µ i ) φ i = log(µ i ) φ i = Color The residual plot is produced automatically from the Generalized Linear Model option of the Analyze- >Fit Model platform and looks like: 7 : 6 I added reference lines at zero, 2, and 2 by clicking on the Y axis of the plot 7 I added reference lines at zero, 2, and 2 by clicking on the Y axis of the plot c 2015 Carl James Schwarz 1603 2015-08-20

Because the X variable is nominally scaled with 4 levels, there are four vertical lines on the plot (note that two of the predicted values are very closed around the 2.25 area and can barely be distinguished). Because the Y values are restricted to non-negative integer values, there are again a series of lines corresponding to the discrete values of Y. Again points outside the ±2 reference line may be of concern and may require further investigation. c 2015 Carl James Schwarz 1604 2015-08-20