Prediction and Confidence Intervals in Regression

Similar documents
Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Simple Regression Theory II 2010 Samuel L. Baker

Simple linear regression

Coefficient of Determination

Simple Linear Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Regression III: Advanced Methods

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

4. Simple regression. QBUS6840 Predictive Analytics.

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Chapter 7: Simple linear regression Learning Objectives

Session 7 Bivariate Data and Analysis

2. Simple Linear Regression

International Statistical Institute, 56th Session, 2007: Phil Everson

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

CALCULATIONS & STATISTICS

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Introduction to Regression and Data Analysis

Module 5: Multiple Regression Analysis

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Elements of statistics (MATH0487-1)

A Basic Introduction to Missing Data

Applying Statistics Recommended by Regulatory Documents

Dealing with Data in Excel 2010

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Week 4: Standard Error and Confidence Intervals

Using R for Linear Regression

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Premaster Statistics Tutorial 4 Full solutions

LOGIT AND PROBIT ANALYSIS

Constructing and Interpreting Confidence Intervals

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Comparing Means in Two Populations

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

An introduction to Value-at-Risk Learning Curve September 2003

Example: Boats and Manatees

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

INTRODUCTION TO MULTIPLE CORRELATION

Analysis of Variance ANOVA

13. Poisson Regression Analysis

Pricing I: Linear Demand

2. Linear regression with multiple regressors

Multiple Regression: What Is It?

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Module 3: Correlation and Covariance

Lecture Notes Module 1

Ordinal Regression. Chapter

THE UNIVERSITY OF CHICAGO, Booth School of Business Business 41202, Spring Quarter 2014, Mr. Ruey S. Tsay. Solutions to Homework Assignment #2

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Exercise 1.12 (Pg )

Nonlinear Regression Functions. SW Ch 8 1/54/

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Homework 8 Solutions

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

hp calculators HP 50g Trend Lines The STAT menu Trend Lines Practice predicting the future using trend lines

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Descriptive Statistics

Data Mining Introduction

FREE FALL. Introduction. Reference Young and Freedman, University Physics, 12 th Edition: Chapter 2, section 2.5

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Regression Analysis: A Complete Example

Confidence Intervals for One Standard Deviation Using Standard Deviation

5. Multiple regression

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

11. Analysis of Case-control Studies Logistic Regression

Statistics. Measurement. Scales of Measurement 7/18/2012

GLM I An Introduction to Generalized Linear Models

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Note on growth and growth accounting

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Stat 20: Intro to Probability and Statistics

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques Page 1 of 11. EduPristine CMA - Part I

Integrated Resource Plan

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Mgmt 469. Regression Basics. You have all had some training in statistics and regression analysis. Still, it is useful to review

AP Physics 1 and 2 Lab Investigations

Univariate Regression

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

The importance of graphing the data: Anscombe s regression examples

Fairfield Public Schools

How To Run Statistical Tests in Excel

Predictability Study of ISIP Reading and STAAR Reading: Prediction Bands. March 2014

Factors affecting online sales

CS 147: Computer Systems Performance Analysis

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Schools Value-added Information System Technical Manual

Logs Transformation in a Regression Equation

17. SIMPLE LINEAR REGRESSION II

Transcription:

Fall Semester, 2001 Statistics 621 Lecture 3 Robert Stine 1 Prediction and Confidence Intervals in Regression Preliminaries Teaching assistants See them in Room 3009 SH-DH. Hours are detailed in the syllabus. Assignment #1 due Friday Substantial penalty if not turned in until Monday. Feedback to me In-class feedback form e-mail from web page Cohort academic reps, quality circle Feedback questions on class web page

Fall Semester, 2001 2 Review of Key Points Regression model adds assumptions to the equation 0. Equation, stated either in terms of the average of the response ave( Y X )= b + b X i i 0 1 i or by adding an error term to the average, obtaining for each of the individual values of the response Yi = ave( Yi Xi) + εi = β + β X + ε 0 1 1. Independent observations (independent errors) 2. Constant variance observations (equal error variance) i 3. Normally distributed around the regression line, written as an assumption about the unobserved errors: ε i 2 ~ N( 0, σ ) Least squares estimates of intercept, slope and errors i Importance of the assumptions evident in the context of prediction. Checking the assumptions Linearity (order IS important) Scatterplots(data/residuals) (nonlinearity) Independence Plots highlighting trend (autocorrelation) Constant variance Residual plots (heteroscedasticity) Normality Quantile plots (outliers, skewness) (summary on p. 46) One outlier can exert much influence on a regression JMP-IN script illustrating effect of outlier on the fitted least squares regression line (available from my web page for the class). Another version of this script is packaged with the JMP software and installed on your machine (as part of the standard installation).

Fall Semester, 2001 3 Review Questions from Feedback Can I use the RMSE to compare models? Yes, but You need to ensure that the comparison is honest. As long as you have the same response in both models, the comparison of two RMSE values makes sense. If you change the response, such as through a log transformation of the response as in the second question of Assignment #1, you need to be careful. The following figure compares a linear fit to a log-log fit for the MPG versus the weight: MPG City 40 35 30 25 20 15 1500 2000 2500 3000 3500 4000 Weight(lb) If you compare the nominal RMSEs shown for the two models, you will get the following RMSE linear 2.2 RMSE log-log 0.085 While the linear fit looks inferior, it s not that much worse! The problem is that we have ignored the units; these two reported RMSE values are on different scales. The linear RMSE is 2.2 MPG, but the log-log RMSE is 0.085 log MPG, a very different scale. JMP-IN helps out if you ask it. The following summary describes the accuracy of the fitted log-log model, but in the original units. That summary gives an RMSE of 1.9 MPG, close to the linear value.

Fall Semester, 2001 4 What does RMSE tell you? Fit Measured on Original Scale Sum of Squared Error 394.518 Root Mean Square Error 1.894 RSquare 0.803 Sum of Residuals 9.242 The RMSE gives the SD of the residuals. The RMSE thus estimates the concentration of the data around the fitted equation. Here s the plot of the residuals from the linear equation. Residual 10 6 2-2 -6 1500 2000 2500 3000 3500 4000 Weight(lb) Visual inspection of the normal quantile plot of the residuals suggests the RMSE is around 2-3. If the data are roughly normal, then most of the residuals lie within about ± 2 RMSE of their mean (at zero): 10 8 6 4 2 0-2 -4-6.01.05.10.25.50.75.90.95.99-2 -1 0 1 2 3 Normal Quantile Plot

Fall Semester, 2001 5 Key Applications Providing a margin for error For a feature of the population How does the spending on advertising affect the level of sales? Form confidence intervals as (estimate) +/ 2 SE(estimate) For the prediction of another observation Will sales exceed $10,000,000 next month? Form prediction intervals as (prediction) +/ 2 RMSE Concepts and Terminology Standard error Same concept as in Stat 603/604, but now in regression models Want a standard error for both the estimated slope and intercept. Emphasize slope since it measures expected impact of changes in the predictor upon values of the response. Similar to expression for SE(sample average), but with one important adjustment (page 91) SE( estimated slope) = SE(ˆ) β SD( error) 1 = sample size SD( predictor) σ 1 = n SD( X) Confidence intervals, t-ratios for parameters As in Stat 603/604, the empirical rule + CLT give intervals of the form Fitted intercept +/ 2 SE(Fitted intercept) Fitted slope +/ 2 SE(Fitted slope) t-ratio answers the question How many SEs is the estimate away from zero?

Fall Semester, 2001 6 In-sample prediction versus extrapolation In-sample prediction... able to check model properties. Out-of-sample prediction... hope form of model continues. Statistical extrapolation penalty assumes model form continues. See the example of the display space transformation (p. 101-103). Another approach to extrapolation penalty is to compare models e.g., Hurricane path uncertainty formed by comparing the predictions of different models. Confidence intervals for the regression line Y 14 12 10 8 6 4 2 0-10 -5 0 5 10 X Answers Where do I think the population regression line lies? Fitted line ± 2 SE(Fitted line) Regression line is average value of response for chosen values of X. Statistical extrapolation penalty CI for regression line grows wider as get farther away from the mean of the predictor. Is this penalty reasonable or optimistic (i.e., too narrow)? JMP-IN: Confid Curves Fit option from the fit pop-up.

Fall Semester, 2001 7 Prediction intervals for individual observations Y 14 12 10 8 6 4 2 0-10 -5 0 5 10 X Answers Where do I think a single new observation will fall? Interval captures single new random observation rather than average. Must accommodate random variation about the fitted model. Holds about 95% of data surrounding the fitted regression line. Approximate in sample form: Fitted line ± 2 RMSE Typically more useful that CIs for the regression line: More often are trying to predict a new observation, than wondering where the average of a collection of future values lies. JMP-IN: Confid Curves Indiv option from the fit pop-up. Goodness-of-fit and R 2 Proportion of variation in response captured by the fitted model and is thus a relative measure of the goodness of fit RMSE is an absolute measure of the accuracy in the units of Y. Computed from the variability in response and residuals (p 92) Ratio of sums of squared deviations. Related to RMSE, RMSE 2 approximately equals (1 R 2 ) Var(Y) Square of the correlation between the predictor X and response Y. corr(x,y) 2 = R 2 Another way to answer What s a big correlation?

Fall Semester, 2001 8 Examples for Today The ideal regression model Utopia.jmp, page 47 Which features of a regression change when the sample size grows? Simulate data from a normal population. Add more observations. Use formula to generate more values. Some features stay about the same : intercept, slope, RMSE, R 2 (Note: Each of these estimates a population feature.) Standard errors shrink and confidence intervals become more narrow. Philadelphia housing prices Phila.jmp, page 62 How do crime rates impact the average selling price of houses? Initial plot shows that Center City is leveraged (unusual in X). Initial analysis with all data finds $577 impact per crime (p 64). Residuals show lack of normality (p 65). Without CC, regression has much steeper decay, $2289/crime (p 66). Residuals remain non-normal (p 67). Why is CC an outlier? What do we learn from this point? Alternative analysis with transformation suggests may be not so unusual. (see pages 68-70)

Fall Semester, 2001 9 Housing construction Cottages.jmp, page 89 How much can a builder expect to profit from building larger homes? Highly leveraged observation ( special cottage ) (p 89) Contrast confidence intervals with prediction intervals. role of assumptions of constant variance and normality. Model with special cottage R 2 0.8, RMSE 3500 (p 90) Predictions suggest profitable Model without special cottage R 2 0.08, RMSE 3500 (p94-95) Predictions are useless Recall demonstration with JMP-IN rubber-band regression line. Do we keep the outlier, or do we exclude the outlier? Liquor sales and display space Display.jmp, page 99 How precise is our estimate of the number of display feet? Can this model be used to predict sales for a promotion with 20 feet? Lack of sensitivity of optimal display footage to transformation Log gives 2.77 feet (p 19-20), whereas reciprocal gives 2.57 feet (p 22) 95% confidence interval for the optimal footage (p 100). 95% CI for optimal under log model is [ 2.38, 3.17 ] Predictions out to 20 feet are very sensitive to transformation Prediction interval at 20 feet is far from range of data. Very sensitive: Log interval does not include reciprocal pred (p111) Management has to decide which model to use. Have we captured the true uncertainty?

Fall Semester, 2001 10 Key Take-Away Points Standard error again used in inference R 2 Confidence interval for the slope is (fitted slope) ± 2 SE(slope) As the sample size grows, the confidence interval shrinks. If 0 is outside the interval, the predictor is significant. Formula for standard error is different from that for an average depends on the variation in the predictor more variation in predictor, more precise slope as a measure of goodness of fit Square of usual correlation between predictor and response Popular as percentage of explained variation Prediction intervals used to measure accuracy of predictions Formed as predicted value (from equation) with margin for error set as plus or minus twice the SD of the residuals, (predicted value) +/ 2 RMSE (in-sample only!) RMSE provides baseline measure of predictive accuracy In-sample prediction vs. out-of-sample extrapolation Models are more reliable in-sample. Statistical extrapolation penalty is often too small because it assumes the model equation extends beyond the range of observation. Role of outliers Next Time Single leveraged observation can skew fit to rest of data. Outliers may be very informative; not a simple choice to delete them. Multiple regression Combining the information from several predictors to model more variation, the only way to more accurate predictions.