Week 5: Multiple Linear Regression



Similar documents
Multiple Linear Regression

Statistical Models in R

Section 1: Simple Linear Regression

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Correlation and Simple Linear Regression

Using R for Linear Regression

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

5. Linear Regression

1 Simple Linear Regression I Least Squares Estimation

Comparing Nested Models

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Part II. Multiple Linear Regression

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Module 5: Multiple Regression Analysis

Testing for Lack of Fit

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Chapter 13 Introduction to Linear Regression and Correlation Analysis

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

ANOVA. February 12, 2015

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

17. SIMPLE LINEAR REGRESSION II

3. Regression & Exponential Smoothing

GLM I An Introduction to Generalized Linear Models

Estimation of σ 2, the variance of ɛ

Univariate Regression

Notes on Applied Linear Regression

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Part 2: Analysis of Relationship Between Two Variables

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Simple Linear Regression Inference

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

Additional sources Compilation of sources:

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

11. Analysis of Case-control Studies Logistic Regression

Interaction between quantitative predictors

Simple Linear Regression

Psychology 205: Research Methods in Psychology

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Econometrics Simple Linear Regression

Week TSX Index

Chapter 7: Simple linear regression Learning Objectives

MULTIPLE REGRESSION EXAMPLE

General Regression Formulae ) (N-2) (1 - r 2 YX

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Least Squares Regression. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

Chapter 4 and 5 solutions

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

3.1 Least squares in matrix form

Multivariate Logistic Regression

Introduction to Regression and Data Analysis

Regression Analysis: A Complete Example

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

The importance of graphing the data: Anscombe s regression examples

5. Multiple regression

Final Exam Practice Problem Answers

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

VI. Introduction to Logistic Regression

Causal Forecasting Models

Hypothesis testing - Steps

SPSS Guide: Regression Analysis

2. Simple Linear Regression

Topic 3. Chapter 5: Linear Regression in Matrix Form

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Multiple Regression: What Is It?

STAT 350 Practice Final Exam Solution (Spring 2015)

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

n + n log(2π) + n log(rss/n)

International Statistical Institute, 56th Session, 2007: Phil Everson

Elements of statistics (MATH0487-1)

4. Simple regression. QBUS6840 Predictive Analytics.

Regression III: Advanced Methods

Least Squares Estimation

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Simple Linear Regression

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

2. Linear regression with multiple regressors

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Multivariate Normal Distribution

Statistical Models in R

BIOL 933 Lab 6 Fall Data Transformation

Regression step-by-step using Microsoft Excel

Implementing Panel-Corrected Standard Errors in R: The pcse Package

Statistics 104: Section 6!

Curve Fitting. Before You Begin

A Primer on Forecasting Business Performance

Linear Regression. Guy Lebanon

Sales forecasting # 2

Getting Correct Results from PROC REG

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Transcription:

BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School of Business faculty.chicagobooth.edu/robert.gramacy/teaching

Beyond SLR Many problems involve more than one independent variable or factor which affects the dependent or response variable. Multi-factor asset pricing models (beyond CAPM). Demand for a product given prices of competing brands, advertising, household attributes, etc. More than size to predict house price! In SLR, the conditional mean of Y depends on X. The multiple linear regression (MLR) model extends this idea to include more than one independent variable. 1

Categorical effects/dummy variables Week 1 already introduced one type of multiple regression: Regression for Y onto a categorical X : E[Y group = r] = β 0 + β r, for r = 1,..., R 1. I.e., ANOVA for grouped data. To represent these qualitative factors in multiple regression, we use dummy, binary, or indicator variables. 2

Dummy variables allow the mean (intercept) to shift by taking on the value 0 or 1. Examples: temporal effects (1 if Holiday season, 0 if not) spatial (1 if in Midwest, 0 if not) If a factor X takes R possible levels, we can represent X through R 1 dummy variables E[Y X ] = β 0 + β 1 1 [X =2] + β 2 1 [X =3] + + β R 1 1 [X =R] (1 [X =r] = 1 if X = r, 0 if X r.) What is E[Y X = 1]? 3

Recall the pick-up truck data example: > pickup <- read.csv("pickup.csv") > t1 <- lm(log(price) ~ make, data=pickup) > summary(t1) ## abbreviated output Call: lm(formula = log(price) ~ make, data = pickup) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 8.71499 0.23109 37.713 <2e-16 *** makeford 0.17725 0.31289 0.566 0.574 makegmc -0.05124 0.27505-0.186 0.853 The coefficient values correspond to our dummy variables. 4

What if you also want to include mileage? No problem. > t2 <- lm(log(price) ~ make + log(miles), data=pickup) > summary(t2) ## abbreviated output Call: lm(formula = log(price) ~ make + log(miles), data=pickup) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 13.83117 1.08608 12.735 5.17e-16 *** makeford 0.32419 0.25658 1.264 0.213 makegmc 0.13265 0.22720 0.584 0.562 log(miles) -0.46618 0.09747-4.783 2.15e-05 *** 5

The MLR Model The MLR model is same as always, but with more covariates. Y X 1,..., X d ind N(β 0 + β 1 X 1 + + β d X d, σ 2 ) Recall the key assumptions of our linear regression model: (i) The conditional mean of Y is linear in the X j variables. (ii) The additive errors (deviations from line) are normally distributed independent from each other identically distributed (i.e., they have constant variance) 6

Our interpretation of regression coefficients can be extended from the simple single covariate regression case: β j = E[Y X 1,..., X d ] X j Holding all other variables constant, β j is the average change in Y per unit change in X j. is from calculus and means change in 7

If d = 2, we can plot the regression surface in 3D. Consider sales of a product as predicted by price of this product (P1) and the price of a competing product (P2). Everything measured on log-scale, of course. 8

The data and least squares The data in multiple regression is a set of points with values for output Y and for each input variable. Data: Y i and x i = [X 1i, X 2i,..., X di ], for i = 1,..., n. Or, as a data array (i.e., data.frame), Data = Y 1 X 11 X 21... X d1. Y n X 1n X 2n... X dn 9

So the model is Y i = β 0 + β 1 X i1 + + β d X id + ε i, ε i iid N (0, σ 2 ). How do we estimate the MLR model parameters? The principle of least squares is unchanged; define: fitted values Ŷi = b 0 + b 1 X 1i + b 2 X 2i + + b d X di residuals e i = Y i Ŷ i standard error s = n i=1 e2 i n p, where p = d + 1. Then find the best fitting plane, i.e., coefs b 0, b 1, b 2,..., b d, by minimizing the sum of squared residuals, s 2. 10

What are the LS coefficient values? Say that Y = Y 1. Y n ˆX = 1 X 11 X 12... X 1d. 1 X n1 X n2... X nd. Then the estimates are [b 0 b d ] = b = (ˆX ˆX) 1 ˆX Y. Same intuition as for SLR: b captures the covariance between X j and Y (ˆX Y), normalized by input sum of squares (ˆX ˆX). 11

Obtaining these estimates in R is very easy: > salesdata <- read.csv("sales.csv") > attach(salesdata) > salesmlr <- lm(sales ~ P1 + P2) > salesmlr Call: lm(formula = Sales ~ P1 + P2) Coefficients: (Intercept) P1 P2 1.003-1.006 1.098 12

Multiple vs simple linear regression Basic concepts and techniques translate directly from SLR. Individual parameter inference and estimation is the same, conditional on the rest of the model. ANOVA is exactly the same, and the F -test still applies. Our diagnostics and transformations apply directly. We still use lm, summary, rstudent, predict, etc. We have been plotting HD data since Week 1. The hardest part would be moving to matrix algebra to translate all of our equations. Luckily, R does all that for you. 13

Residual standard error First off, the calculation for s 2 = var(e) is exactly the same: s 2 = n i=1 (Y i Ŷ i ) 2. n p Ŷ i = b 0 + b j X ji and p = d + 1. The residual standard error is ˆσ = s = s 2. 14

Residuals in MLR As in the SLR model, the residuals in multiple regression are purged of any relationship to the independent variables. We decompose Y into the part predicted by X and the part due to idiosyncratic error. Y = Ŷ + e corr(x j, e) = 0 corr(ŷ, e) = 0 15

Residual diagnostics for MLR Consider the residuals from the sales data: 0.5 1.0 1.5 2.0 0.03 0.00 0.02 fitted residuals 0.2 0.4 0.6 0.8 0.03 0.00 0.02 P1 residuals 0.2 0.6 1.0 0.03 0.00 0.02 P2 residuals We use the same residual diagnostics (scatterplots, QQ, etc). Plot residuals (raw or student) against Ŷ to see overall fit. Compare e or r against each X to identify problems. 16

Another great plot for MLR problems is to look at Y (true values) against Ŷ (fitted values). > plot(salesmlr$fitted, Sales, + main= "Fitted vs True Response for Sales data", + pch=20, col=4, ylab="y", xlab="y.hat") > abline(0, 1, lty=2, col=8) 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 Fitted vs True Response for Sales data Y.hat Y If things are working, these values should form a nice straight line. 17

Transforming variables in MLR The transformation techniques are also the same as in SLR. For Y nonlinear in X j, consider adding Xj 2 polynomial terms. and other For nonconstant variance, use log(y ), Y, or another transformation such as Y /X j that moves the data to a linear scale. Use the log-log model for price/sales data and other multiplicative relationships. Again, diagnosing the problem and finding a solution involves looking at lots of residual plots (against different X j s). 18

For example, the sales, P1, and P2 variables were pre-transformed from raw values to a log scale. On the original scale, things don t look so good: > expsalesmlr <- lm(exp(sales) ~ exp(p1) + exp(p2)) 1 2 3 4 5 6 7 0.2 0.2 0.6 fitted residuals 1.2 1.6 2.0 2.4 0.2 0.2 0.6 exp(p1) residuals 1.5 2.0 2.5 3.0 3.5 0.2 0.2 0.6 exp(p2) residuals 19

In particular, the studentized residuals are heavily right skewed. ( studentizing is the same, but leverage is now distance in d-dim.) > hist(rstudent(expsalesmlr), col=7, + xlab="studentized Residuals", main="") Frequency 0 10 20 30 40 50 Our log-log transform fixes this problem. 2 1 0 1 2 3 4 Studentized Residuals 20

Inference for coefficients As before in SLR, the LS linear coefficients are random (different for each sample) and correlated with each other. The LS estimators are unbiased: E[b j ] = β j for j = 0,..., d. In particular, the sampling distribution for b is a multivariate normal, with mean β = [β 0 β d ] and covariance matrix S b. b N p (β, S b ) 21

Coefficient covariance matrix var(b) : the p p covariance matrix for random vector b is S b = var(b 0 ) cov(b 0, b 1 ) cov(b 1, b 0 ) var(b 1 )... var(b d 1 ) cov(b d 1, b d ) cov(b d, b d 1 ) var(b d ) Standard errors are the square root of the diagonal of S b. 22

) 1 Calculating the covariance is easy : S b = s (ˆX 2 ˆX > X <- cbind(1, P1, P2) > cov.b <- summary(salesmlr)$sigma^2*solve(t(x)%*%x) > print(cov.b) P1 P2 5.542812e-05-5.393036e-05-3.451517e-05 P1-5.393036e-05 8.807396e-05 9.718699e-06 P2-3.451517e-05 9.718699e-06 4.127672e-05 > se.b <- sqrt(diag(cov.b)) > se.b P1 P2 0.007445007 0.009384773 0.006424696 Variance decreases with n and var(x); increases with s 2. 23

Standard errors Conveniently, R s summary gives you all the standard errors. > summary(salesmlr) ## abbreviated output Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.002688 0.007445 134.7 <2e-16 *** P1-1.005900 0.009385-107.2 <2e-16 *** P2 1.097872 0.006425 170.9 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 0.01453 on 97 degrees of freedom Multiple R-squared: 0.998, Adjusted R-squared: 0.9979 F-statistic: 2.392e+04 on 2 and 97 DF, p-value: < 2.2e-16 24

Inference for individual coefficients Intervals and t-statistics are exactly the same as in SLR. A (1 α)100% C.I. for β j is b j ± t α/2,n p s bj. z bj = (b j β 0 j )/s bj t n p (0, 1) is number of standard errors between the LS estimate and the null value. Intervals/testing via b j & s bj are one-at-a-time procedures: You are evaluating the j th coefficient conditional on the other X s being in the model, but regardless of the values you ve estimated for the other b s. 25

R 2 for multiple regression Recall that we already view R 2 as a measure of fit : R 2 = SSR SST = n i=1 (Ŷi Ȳ )2 n i=1 (Y i Ȳ ) 2. And the correlation interpretation is very similar: R 2 = corr 2 (Ŷ, Y ) = r 2 ŷy. (rŷy = r xy in SLR since cor(x, Ŷ ) = 1) 26

Consider our marketing model: > summary(salesmlr)$r.square [1] 0.9979764 P1 and P2 explain most of the variability in log volume. Consider the pickup regressions (1: make, 2: make + miles). > summary(trucklm1)$r.square [1] 0.01806738 > summary(trucklm2)$r.square [1] 0.3643183 Make is pretty useless, but miles gets us up to R 2 = 36%. 27

Forecasting in MLR Prediction follows exactly the same methodology as in SLR. For new data x f = [X 1,f X d,f ], E[Y f x f ] = Ŷf = b 0 + b 1 X 1f + + b d X df var[y f x f ] = var(ŷf ) + var(e f ) = sfit 2 + s2 = spred 2. With ˆX our design matrix (slide 9) and ˆx f = [1, X 1,f X d,f ] s 2 fit = s 2 ˆx f (ˆX ˆX) 1ˆx f A (1 α) level prediction interval is still Ŷ f ± t α/2,n p s pred. 28

The syntax in R is also exactly the same as before: > predict(salesmlr, data.frame(p1=1, P2=1), + interval="prediction", level=0.95) fit lwr upr 1 1.094661 1.064015 1.125306 And we can get s fit using our equation, or from R. > xf <- matrix(c(1,1,1), ncol=1) > X <- cbind(1, P1, P2) > s <- summary(salesmlr)$sigma > sqrt(drop(t(xf)%*%solve(t(x)%*%x)%*%xf))*s [1] 0.005227347 > predict(salesmlr, data.frame(p1=1, P2=1), + se.fit=true)$se.fit [1] 0.005227347 29

Glossary and equations MLR updates to the LS equations: b = (ˆX ˆX) 1 ˆX Y var(b) = S b = s 2 (ˆX ˆX ) 1 sfit 2 = s2 ˆx f (ˆX ˆX) 1ˆx f R 2 = SSR/SST = cor 2 (Ŷ, Y ) = r ŷy 2 30