Multiple Linear Regression

Similar documents
Correlation and Simple Linear Regression

Statistical Models in R

Generalized Linear Models

Using R for Linear Regression

Regression step-by-step using Microsoft Excel

Comparing Nested Models

Testing for Lack of Fit

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

ANOVA. February 12, 2015

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Psychology 205: Research Methods in Psychology

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

5. Linear Regression

Simple Linear Regression Inference

N-Way Analysis of Variance

Factors affecting online sales

Week 5: Multiple Linear Regression

Model Diagnostics for Regression

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Lucky vs. Unlucky Teams in Sports

Final Exam Practice Problem Answers

MIXED MODEL ANALYSIS USING R

Chapter 13 Introduction to Linear Regression and Correlation Analysis

SPSS Guide: Regression Analysis

Chapter 7: Simple linear regression Learning Objectives

Statistical Models in R

Univariate Regression

Two-way ANOVA and ANCOVA

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Exchange Rate Regime Analysis for the Chinese Yuan

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Part 2: Analysis of Relationship Between Two Variables

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

2013 MBA Jump Start Program. Statistics Module Part 3

Regression Analysis: A Complete Example

Chapter 3 Quantitative Demand Analysis

n + n log(2π) + n log(rss/n)

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Multivariate Logistic Regression

Multi Factors Model. Daniel Herlemont. March 31, Estimating using Ordinary Least Square regression 3

STAT 350 Practice Final Exam Solution (Spring 2015)

Regression Analysis (Spring, 2000)

MULTIPLE REGRESSION EXAMPLE

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Interaction between quantitative predictors

Simple Linear Regression

SAS Software to Fit the Generalized Linear Model

Exercises on using R for Statistics and Hypothesis Testing Dr. Wenjia Wang

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Section 13, Part 1 ANOVA. Analysis Of Variance

Module 5: Multiple Regression Analysis

Chapter 5 Analysis of variance SPSS Analysis of variance

1 Simple Linear Regression I Least Squares Estimation

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Estimation of σ 2, the variance of ɛ

Chapter 7 Section 1 Homework Set A

Week TSX Index

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Premaster Statistics Tutorial 4 Full solutions

2 Sample t-test (unequal sample sizes and unequal variances)

Section 1: Simple Linear Regression

12: Analysis of Variance. Introduction

Difference of Means and ANOVA Problems

Elementary Statistics Sample Exam #3

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Regression III: Advanced Methods

MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.

DATA INTERPRETATION AND STATISTICS

Data Analysis Tools. Tools for Summarizing Data

ABSORBENCY OF PAPER TOWELS

International Statistical Institute, 56th Session, 2007: Phil Everson

Least Squares Regression. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

ANALYSIS OF TREND CHAPTER 5

Part II. Multiple Linear Regression

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Linear Models in STATA and ANOVA

Independent t- Test (Comparing Two Means)

One-Way Analysis of Variance (ANOVA) Example Problem

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Time Series Analysis

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

An Introduction to Spatial Regression Analysis in R. Luc Anselin University of Illinois, Urbana-Champaign

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Simple linear regression

Chapter 23. Inferences for Regression

August 2012 EXAMINATIONS Solution Part I

Transcription:

Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is now modeled as a function of several explanatory variables. The function lm can be used to perform multiple linear regression in R and much of the syntax is the same as that used for fitting simple linear regression models. To perform multiple linear regression with p explanatory variables use the command: lm(response ~ explanatory_1 + explanatory_2 + + explanatory_p) Here the terms response and explanatory_i in the function should be replaced by the names of the response and explanatory variables, respectively, used in the analysis. Ex. Data was collected on 100 houses recently sold in a city. It consisted of the sales price (in $), house size (in square feet), the number of bedrooms, the number of bathrooms, the lot size (in square feet) and the annual real estate tax (in $). The following program reads in the data. > Housing = read.table("c:/users/martin/documents/w2024/housing.txt", header=true) > Housing Taxes Bedrooms Baths Price Size Lot 1 1360 3 2.0 145000 1240 18000 2 1050 1 1.0 68000 370 25000.. 99 1770 3 2.0 88400 1560 12000 100 1430 3 2.0 127200 1340 18000 Suppose we are only interested in working with a subset of the variables (e.g., Price, Size and Lot ). It is possible (but not necessary) to construct a new data frame consisting solely of these values using the commands: > myvars = c("price", "Size", "Lot") > Housing2 = Housing[myvars] > Housing2 Price Size Lot 1 145000 1240 18000 2 68000 370 25000.. 99 88400 1560 12000 100 127200 1340 18000

Before fitting our regression model we want to investigate how the variables are related to one another. We can do this graphically by constructing scatter plots of all pair-wise combinations of variables in the data frame. This can be done by typing: > plot(housing2) To fit a multiple linear regression model with price as the response variable and size and lot as the explanatory variables, use the command: > results = lm(price ~ Size + Lot, data=housing) > results Call: lm(formula = Price ~ Size + Lot, data = Housing) Coefficients: (Intercept) Size Lot -10535.951 53.779 2.840 This output indicates that the fitted value is given by yˆ 10536 53.8x 1 2. 8x2

Inference in the multiple regression setting is typically performed in a number of steps. We begin by testing whether the explanatory variables collectively have an effect on the response variable, i.e. H 0 : 1 2 p 0 If we can reject this hypothesis, we continue by testing whether the individual regression coefficients are significant while controlling for the other variables in the model. We can access the results of each test by typing: > summary(results) Call: lm(formula = Price ~ Size + Lot, data = Housing) Residuals: Min 1Q Median 3Q Max -81681-19926 2530 17972 84978 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.054e+04 9.436e+03-1.117 0.267 Size 5.378e+01 6.529e+00 8.237 8.39e-13 *** Lot 2.840e+00 4.267e-01 6.656 1.68e-09 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 30590 on 97 degrees of freedom Multiple R-squared: 0.7114, Adjusted R-squared: 0.7054 F-statistic: 119.5 on 2 and 97 DF, p-value: < 2.2e-16 The output shows that F = 119.5 (p < 2.2e-16), indicating that we should clearly reject the null hypothesis that the variables Size and Lot collectively have no effect on Price. The results also show that the variable Size is significant controlling for the variable Lot (p = 8.39e-13), as is Lot controlling for the variable Size (p=1.68e-09). In addition, the output also shows that R 2 = 0.7114 and R 2 adjusted = 0.7054. B. Testing a subset of variables using a partial F-test Sometimes we are interested in simultaneously testing whether a certain subset of the coefficients are equal to 0 (e.g. 3 = 4 = 0). We can do this using a partial F-test. This test involves comparing the SSE from a reduced model (excluding the parameters we hypothesis are equal to zero) with the SSE from the full model (including all of the parameters).

In R we can perform partial F-tests by fitting both the reduced and full models separately and thereafter comparing them using the anova function. Ex. Suppose we include the variables bedroom, bath, size and lot in our model and are interested in testing whether the number of bedrooms and bathrooms are significant after taking size and lot into consideration. The following code performs the partial F-test: > reduced = lm(price ~ Size + Lot, data=housing) # Reduced model > full = lm(price ~ Size + Lot + Bedrooms + Baths, data=housing) # Full Model > anova(reduced, full) # Compare the models Analysis of Variance Table Model 1: Price ~ Size + Lot Model 2: Price ~ Size + Lot + Bedrooms + Baths Res.Df RSS Df Sum of Sq F Pr(>F) 1 97 9.0756e+10 2 95 8.5672e+10 2 5083798629 2.8186 0.06469. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 The output shows the results of the partial F-test. Since F=2.82 (p-value=0.0647) we cannot reject the null hypothesis ( 3 = 4 = 0) at the 5% level of significance. It appears that the variables Bedrooms and Baths do not contribute significant information to the sales price once the variables Size and Lot have been taken into consideration. C. Confidence and Prediction Intervals We often use our regression models to estimate the mean response or predict future values of the response variable for certain values of the response variables. The function predict() can be used to make both confidence intervals for the mean response and prediction intervals. To make confidence intervals for the mean response use the option interval= confidence. To make a prediction interval use the option interval= prediction. By default this makes 95% confidence and prediction intervals. If you instead want to make a 99% confidence or prediction interval use the option level=0.99. Ex. Obtain a 95% confidence interval for the mean sales price of houses whose size is 1,000 square feet and lot size is 20,000 square feet. > results = lm(price ~ Size + Lot, data=housing)

> predict(results,data.frame(size=1000, Lot=20000),interval="confidence") fit lwr upr [1,] 100050.1 90711.45 109388.7 A 95% confidence interval is given by (90711, 109389) Ex. Obtain a 95% prediction interval for the sales price of a particular house whose size is 1,000 square feet and lot size is 20,000 square feet. > predict(results,data.frame(size=1000, Lot=20000),interval="prediction") fit lwr upr [1,] 100050.1 38627.08 161473.0 A 95% prediction interval is given by (38627, 161473). Note that this is quite a bit wider than the confidence interval, indicating that the variation about the mean is fairly large.