31. SIMPLE LINEAR REGRESSION VI: LEVERAGE AND INFLUENCE

Similar documents
Solution Let us regress percentage of games versus total payroll.

Regression Analysis: A Complete Example

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

The importance of graphing the data: Anscombe s regression examples

2. Simple Linear Regression

Chapter 7: Simple linear regression Learning Objectives

1.1. Simple Regression in Excel (Excel 2010).

17. SIMPLE LINEAR REGRESSION II

The Volatility Index Stefan Iacono University System of Maryland Foundation

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Univariate Regression

4. Multiple Regression in Practice

Coefficient of Determination

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Title: Modeling for Prediction Linear Regression with Excel, Minitab, Fathom and the TI-83

Chapter 13 Introduction to Linear Regression and Correlation Analysis

PTC Thermistor: Time Interval to Trip Study

MULTIPLE REGRESSION EXAMPLE

11. Analysis of Case-control Studies Logistic Regression

Correlation and Regression

Regression III: Advanced Methods

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Section 1: Simple Linear Regression

c 2015, Jeffrey S. Simonoff 1

A Predictive Model for NFL Rookie Quarterback Fantasy Football Points

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

The Effects of Atmospheric Conditions on Pitchers

10. Analysis of Longitudinal Studies Repeat-measures analysis

Module 5: Multiple Regression Analysis

Premaster Statistics Tutorial 4 Full solutions

Simple linear regression

Use of Monte Carlo Simulation for a Peer Review Process Performance Model

Getting Started with Minitab 17

Simple Linear Regression

Example G Cost of construction of nuclear power plants

Introduction to Linear Regression

Exercise 1.12 (Pg )

Chapter 23. Inferences for Regression

Chapter 4 and 5 solutions

INTRODUCTION TO MULTIPLE CORRELATION

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Simple Linear Regression Inference

Final Exam Practice Problem Answers

Factors affecting online sales

Introduction to Linear Regression

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

MTH 140 Statistics Videos

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Statistics. Measurement. Scales of Measurement 7/18/2012

Regression and Time Series Analysis of Petroleum Product Sales in Masters. Energy oil and Gas

Simple Predictive Analytics Curtis Seare

Homework 8 Solutions

AP Statistics. Chapter 4 Review

Using Baseball Data as a Gentle Introduction to Teaching Linear Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Chapter 9 Descriptive Statistics for Bivariate Data

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Chapter 2 Simple Comparative Experiments Solutions

Betting on Customer Satisfaction

(More Practice With Trend Forecasts)

How To Run Statistical Tests in Excel

THE UNIVERSITY OF CHICAGO, Booth School of Business Business 41202, Spring Quarter 2014, Mr. Ruey S. Tsay. Solutions to Homework Assignment #2

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

The Dummy s Guide to Data Analysis Using SPSS

Predictability Study of ISIP Reading and STAAR Reading: Prediction Bands. March 2014

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

PROBATE LAW CASE SUMMARY

When and Which Companies Are Better Suited to Use Factoring

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Multiple Regression: What Is It?

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Fairfield Public Schools

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Example: Boats and Manatees

MULTIPLE REGRESSION WITH CATEGORICAL DATA

August 2012 EXAMINATIONS Solution Part I

Elements of statistics (MATH0487-1)

Water scarcity as an indicator of poverty in the world

Copyright 2013 by Laura Schultz. All rights reserved. Page 1 of 7

Elementary Statistics Sample Exam #3

Notes on logarithms Ron Michener Revised January 2003

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

STATISTICA Formula Guide: Logistic Regression. Table of Contents

The Effects of State Public K 12 Education Expenditures On Income Distribution

Moderation. Moderation

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Confidence Intervals for Cp

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

International Statistical Institute, 56th Session, 2007: Phil Everson

2 Sample t-test (unequal sample sizes and unequal variances)

The leverage statistic, h, also called the hat-value, is available to identify cases which influence the regression model more than others.

Least Squares Regression. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

Simple Regression Theory II 2010 Samuel L. Baker

Transcription:

31. SIMPLE LINEAR REGRESSION VI: LEVERAGE AND INFLUENCE These topics are not covered in the text, but they are important. Leverage If the data set contains outliers, these can affect the leastsquares fit. To study the impact on the fitted line of moving a single data point, see the website at: http://www.stat.sc.edu/~west/javahtml/regression.html If a given data point (say, the i th one) is moved up or down, the corresponding fitted value ŷ i will move proportionally to the change in y i. The proportionality constant is called leverage, and denoted in Minitab by h i. We get a value of the leverage h i for each data point. The leverage of a given of the data point measures the impact that y i has on. ŷ i The further x i is from x, the larger h i, and therefore the more sensitive is to changes in y i. ŷ i So points with very large and very small x values have more leverage than points with intermediate x values.

If for some reason a point with high leverage also happens to be far from the least squares line which would be fitted to the remaining data points (i.e., if the point is an outlier), then we may need to take some action, e.g., delete the point, reconsider whether the model is reasonable, see if there was a recording error, etc. It can be shown that the h i are all between 0 and 1. In practice h i is considered large if it exceeds 4/n. Influence Diagnostics An observation is influential if the estimates change substantially when the point is omitted. Leverage depends only on the x's, not on the y's. A point with high leverage may or may not be influential. A point with low leverage may or may not be influential. Looking at residuals may not reveal influential points, since an outlier, particularly if it occurs at a point of high leverage, will tend to drag the fitted line along with it and therefore it may have a small residual. This phenomenon is called masking.

A more direct measure of the influence of the i th data point is given by Cook's D statistic, which measures the sum of squared deviations between the observed ŷ values and the hypothetical ŷ values we would get if we deleted the i th data point. Observations with D i > 1 should be examined carefully. Eg: Consider the Team Batting Average (x) and Team Winning Percentage (y) for the 14 teams in the American League in 1986. The data file is Baseball86.MTP The scatterplot shows some indication of a positive linear association, although some of the teams with high batting averages have surprisingly low winning percentages. These teams are Cleveland, Milwaukee, Toronto, and Minnesota (the most extreme case). The residual plot confirms that the linear model is far from perfect.

Team Team Batting Team Winning Average (x) Percentage (y) Baltimore.266.574 Boston.269.661 California.256.508 Chicago.246.410 Cleveland.271.500 Detroit.259.467 Kansas City.250.508 Milwaukee.271.525 Minnesota.274.403 New York.268.587 Oakland.252.422 Seattle.246.391 Texas.263.548 Toronto.270.500 Scatterplot of Winning vs. Batting 0.6 Winning 0.5 0.4 0.245 0.255 0.265 0.275 Batting

Residuals Versus the Fitted Values (response is Winning) 0.1 Residual 0.0-0.1 0.45 0.50 0.55 Fitted Value The points which were "surprisingly low" in the scatterplot now show up as strongly negative residuals, indicating that for these teams, their winning percentages fall short of what would be predicted by a linear regression model. Another problem is that the residuals indicate an overall upward trend. This is a sign that the outliers have "dragged down" the fitted line. The fitted model is y ˆ = 0.5245 + 3.919x. The p-value for β 1 is 0.070, and R 2 is 0.248, indicating a weak to moderate linear association.

Regression Analysis The regression equation is Winning = - 0.524 + 3.92 Batting Predictor Coef SE Coef T P Constant -0.5245 0.5154-1.02 0.329 Batting 3.919 1.969 1.99 0.070 S = 0.07017 R-Sq = 24.8% R-Sq(adj) = 18.5% Analysis of Variance Source DF SS MS F P Regression 1 0.019496 0.019496 3.96 0.070 Residual Error 12 0.059089 0.004924 Total 13 0.078585 Predicted Values Fit StDev Fit 95.0% CI 95.0% PI 0.4944 0.0190 ( 0.4530, 0.5358) ( 0.3360, 0.6528) Incidentally, if we delete the outlier teams, the p-value goes down to 0.000 and R 2 goes up to 0.821. So the linear relationship is strong for the remaining 10 teams. We next examine the Minitab "Fitted Line Plot". This gives a scatterplot, together with the fitted line, and (an option for) 95% confidence and prediction intervals. Note that the confidence intervals are wider at the ends.

Regression Plot Y = -5.2E-01 + 3.91887X R-Sq = 24.8 % 0.7 0.6 Winning 0.5 0.4 Regression 0.3 95% CI 95% PI 0.245 0.255 0.265 0.275 Batting Next, we compute the leverage and Cook's D statistics. In Minitab, use Stat Regression Regression Storage. Click boxes for Hi (leverage) and Cook s Distance. The point for Minnesota (Case 9) has a leverage of 0.1945, which does not exceed 4/n = 0.29, and therefore would not be considered extremely high. It has a Cook's D of 0.65, which does not exceed 1, and so would not be considered an outlier by this criterion. But the unusualness of Minnesota is partially masked by Cleveland, Milwaukee and Toronto. If we leave out all four teams, the results change drastically. In general, Cook's D can be "fooled" by multiple outliers.

American League Baseball, 1986 Team Batting Winning HI1 COOK1 Baltimore 0.266 0.574 0.087380 0.033503 Boston 0.269 0.661 0.115737 0.259203 California 0.256 0.508 0.095257 0.010122 Chicago 0.246 0.410 0.260676 0.042267 Cleveland 0.271 0.500 0.142520 0.027700 Detroit 0.259 0.467 0.076352 0.005014 Kansas City 0.250 0.508 0.175603 0.073092 Milwaukee 0.271 0.525 0.142520 0.003083 Minnesota 0.274 0.403 0.194509 0.651305 New York 0.268 0.587 0.104709 0.049751 Oakland 0.252 0.422 0.142520 0.033177 Seattle 0.246 0.391 0.260676 0.114114 Texas 0.263 0.548 0.073201 0.015146 Toronto 0.270 0.500 0.128341 0.019360 Regression Analysis The regression equation is Winning = - 0.524 + 3.92 Batting Predictor Coef SE Coef T P Constant -0.5245 0.5154-1.02 0.329 Batting 3.919 1.969 1.99 0.070 S = 0.07017 R-Sq = 24.8% R-Sq(adj) = 18.5% Analysis of Variance Source DF SS MS F P Regression 1 0.019496 0.019496 3.96 0.070 Residual Error 12 0.059089 0.004924 Total 13 0.078585 Predicted Values Fit StDev Fit 95.0% CI 95.0% PI 0.4944 0.0190 ( 0.4530, 0.5358) ( 0.3360, 0.6528)

Regression Analysis BASEBALL DATA, WITHOUT MINNESOTA, CLEVELAND, MILWAUKEE,TORONTO The regression equation is Winning = - 1.79 + 8.93 Batting Predictor Coef SE Coef T P Constant -1.7913 0.3792-4.72 0.001 Batting 8.928 1.472 6.07 0.000 S = 0.03895 R-Sq = 82.1% R-Sq(adj) = 79.9% Analysis of Variance Source DF SS MS F P Regression 1 0.055835 0.055835 36.80 0.000 Residual Error 8 0.012139 0.001517 Total 9 0.067974 Baseball: Four Cases Omitted Y = -1.79134 + 8.92791X R-Sq = 82.1 % 0.7 0.6 Winning 0.5 0.4 Regression 95% CI 0.3 95% PI 0.25 0.26 0.27 Batting