Model Diagnostics for Regression



Similar documents
Multiple Linear Regression

Correlation and Simple Linear Regression

Two-way ANOVA and ANCOVA

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Statistical Models in R

An Interactive Tool for Residual Diagnostics for Fitting Spatial Dependencies (with Implementation in R)

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Elements of statistics (MATH0487-1)

MTH 140 Statistics Videos

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Correlation and Regression

Simple linear regression

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

4. Simple regression. QBUS6840 Predictive Analytics.

Homework 8 Solutions

Example: Boats and Manatees

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Simple Linear Regression

You have data! What s next?

Data Mining and Visualization

Systat: Statistical Visualization Software

Univariate Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

MULTIPLE REGRESSION EXAMPLE

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Fairfield Public Schools

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Multiple Regression: What Is It?

Data analysis and regression in Stata

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Premaster Statistics Tutorial 4 Full solutions

Homework 11. Part 1. Name: Score: / null

Session 9 Case 3: Utilizing Available Software Statistical Analysis

Exercise 1.12 (Pg )

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Lecture 1: Review and Exploratory Data Analysis (EDA)

Simple Predictive Analytics Curtis Seare

Dimensionality Reduction: Principal Components Analysis

Using R for Linear Regression

Chapter 7: Simple linear regression Learning Objectives

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE

How Far is too Far? Statistical Outlier Detection

SAS Software to Fit the Generalized Linear Model

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Exploratory Data Analysis

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Regression Analysis (Spring, 2000)

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Getting Started with Minitab 17

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Scatter Plot, Correlation, and Regression on the TI-83/84

Diagrams and Graphs of Statistical Data

Minitab Tutorials for Design and Analysis of Experiments. Table of Contents

Getting Correct Results from PROC REG

A full analysis example Multiple correlations Partial correlations

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

SAS Certificate Applied Statistics and SAS Programming

Module 5: Multiple Regression Analysis

Moderator and Mediator Analysis

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Lecture 5: Model Checking. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

2. Simple Linear Regression

Section 3 Part 1. Relationships between two numerical variables

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

MBA 611 STATISTICS AND QUANTITATIVE METHODS

5 Correlation and Data Exploration

List of Examples. Examples 319

Week 5: Multiple Linear Regression

Chapter 23. Inferences for Regression

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Simple Linear Regression Inference

Introduction to Quantitative Methods

Linear Models in STATA and ANOVA

Introduction to Regression and Data Analysis

Module 3: Correlation and Covariance

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Describing, Exploring, and Comparing Data

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Street Address: 1111 Franklin Street Oakland, CA Mailing Address: 1111 Franklin Street Oakland, CA 94607

2. Linear regression with multiple regressors

Transcription:

Model Diagnostics for Regression After fitting a regression model it is important to determine whether all the necessary model assumptions are valid before performing inference. If there are any violations, subsequent inferential procedures may be invalid resulting in faulty conclusions. Therefore, it is crucial to perform appropriate model diagnostics. In constructing our regression models we assumed that the response y to the explanatory variables were linear in the parameters and that the errors were independent and identically distributed (i.i.d) normal random variables with mean 0 and constant variance 2. Model diagnostic procedures involve both graphical methods and formal statistical tests. These procedures allow us to explore whether the assumptions of the regression model are valid and decide whether we can trust subsequent inference results. Ex. Data was collected on 100 houses recently sold in a city. It consisted of the sales price (in $), house size (in square feet), the number of bedrooms, the number of bathrooms, the lot size (in square feet) and the annual real estate tax (in $). To read in the data and fit a multiple linear regression model with price as the response variable and size and lot as the explanatory variables, use the commands: > Housing =read.table("c:/users/martin/documents/w2024/housing.txt",header=true) > results = lm(price ~ Size + Lot, data=housing) All information about the fitted regression model is now contained in results. A. Studying the Variables There are several graphical methods appropriate for studying the behavior of the explanatory variables. We are primarily concerned with determining the range and concentration of the values of the X variables and whether there exist any outliers. Graphical procedures include histograms, boxplots and sequence plots. We can use the functions hist() and boxplot() to make histograms and boxplots, respectively. In addition, scatter plots of all pair-wise combinations of variables contained in the model can be summarized in a scatter plot matrix. A scatter plot matrix can be created by typing: > pairs(~ Price + Size + Lot, data=housing)

From this plot we can study the relationship between Price and Size, Price and Lot and Size and Lot. B. Residuals and Leverage The leverage of an observation measures its ability to move the regression model all by itself by simply moving in the y-direction. The leverage measures the amount by which the predicted value would change if the observation was shifted one unit in the y- direction. The leverage always takes values between 0 and 1. A point with zero leverage has no effect on the regression model. If a point has leverage equal to 1 the line must follow the point perfectly. We can compute and plot the leverage of each point using the following commands: > results = lm(price ~ Size + Lot, data=housing) > lev = hat(model.matrix(results)) > plot(lev)

Note there is one point that has a higher leverage than all the other points (~0.25). To identify this point type: > Housing[lev >0.2,] Taxes Bedrooms Baths Price Size Lot 64 4350 3 3 225000 4050 35000 This command returns all observations with leverage values above 0.2 in the Housing data set. To study the influence of this particular point we can make scatter plots and specifically mark this point using the points() function. For example: > plot(housing$size, Housing$Price) > points(housing[64,]$size, Housing[64,]$Price, col='red') > plot(housing$lot, Housing$Price) > points(housing[64,]$lot, Housing[64,]$Price, col='red')

Note the 64 th observation is plotted in red in both graphs for easy identification. The residuals can be accessed using the command results$res where results is the variable where the output from the function lm() is stored, i.e. > results = lm(price ~ Size + Lot, data=housing) Prior to studying the residuals it is common to standardize them to compensate for differences in leverage. The studentized residuals are given by: r i s e i 1 h ii where e i is the residual and h ii the leverage for the i th observation. Studentized residuals can be computed in R using the command: > r = rstudent(results) An influential point is one if removed from the data would significantly change the fit. An influential point may either be an outlier or have large leverage, or both, but it will tend to have at least one of those properties. Cook s distance is a commonly used influence measure that combines these two properties. It can be expressed as C i ri p 1 1 pii p ii Typically, points with C i greater than 1 are classified as being influential.

We can compute the Cook s distance using the following commands: > cook = cooks.distance(results) > plot(cook,ylab="cooks distances") > points(64,cook[64],col='red') Note the Cook s distance for the 64 th observation is roughly 0.90. If we ultimately decide we need to remove an observation from the data set and refit the model, we can use the command: > dat2 = dat[-64,] This command creates a new data frame called dat2, which is a copy of the data frame dat except it excludes the 64 th observation. C. Residual Plots We can use residuals to study whether: The regression function is nonlinear. The error terms have nonconstant variance. The error terms are not independent. There are outliers. The error terms are not normally distributed.

We can check for violations of these assumptions by making plots of the residuals, including: Plots of the residuals against the explanatory variable or fitted values. Histograms or boxplots of the residuals. Normal probability plots of the residuals. We plot the residuals against both the predicted values and the explanatory variables. The residuals should be randomly scattered about 0 and the width should be equal throughout for the constant variance assumption to hold. > par(mfrow=c(1,3)) > plot(size, results$res) > plot(lot, results$res) > plot(results$fitted, results$res) To make the same residual plots using the studentized residuals type: > r = rstudent(results) > plot(size, r) > plot(lot, r) > plot(results$fitted, r)

A Normal probability plot of the residuals can be used to check the normality assumption. Here each residual is plotted against its expected value under normality. To make normal probability plots, as well as a histogram, type: > qqnorm(results$res) > qqline(results$res) > hist(results$res)