Comparing Nested Models



Similar documents
We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Multiple Linear Regression

Statistical Models in R

Correlation and Simple Linear Regression

SPSS Guide: Regression Analysis

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

5. Linear Regression

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Using R for Linear Regression

ANOVA. February 12, 2015

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Week 5: Multiple Linear Regression

Testing for Lack of Fit

n + n log(2π) + n log(rss/n)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Part 2: Analysis of Relationship Between Two Variables

Univariate Regression

Statistical Models in R

Module 5: Multiple Regression Analysis

Part II. Multiple Linear Regression

Psychology 205: Research Methods in Psychology

Lucky vs. Unlucky Teams in Sports

Regression step-by-step using Microsoft Excel

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Generalized Linear Models

Comparing Means in Two Populations

Final Exam Practice Problem Answers

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

N-Way Analysis of Variance

Chapter 7: Simple linear regression Learning Objectives

Factors affecting online sales

Week TSX Index

Interaction between quantitative predictors

MIXED MODEL ANALYSIS USING R

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Least Squares Regression. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Simple Linear Regression

Exercises on using R for Statistics and Hypothesis Testing Dr. Wenjia Wang

2 Sample t-test (unequal sample sizes and unequal variances)

Analysis of Variance ANOVA

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Exchange Rate Regime Analysis for the Chinese Yuan

Chapter 3 Quantitative Demand Analysis

Example: Boats and Manatees

2. Simple Linear Regression

Introduction to Regression and Data Analysis

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

1.5 Oneway Analysis of Variance

Section 13, Part 1 ANOVA. Analysis Of Variance

Causal Forecasting Models

International Statistical Institute, 56th Session, 2007: Phil Everson

Simple Linear Regression Inference

General Regression Formulae ) (N-2) (1 - r 2 YX

Session 7 Bivariate Data and Analysis

Discussion Section 4 ECON 139/ Summer Term II

1 Simple Linear Regression I Least Squares Estimation

Rockefeller College University at Albany

Chapter Four. Data Analyses and Presentation of the Findings

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Linear functions Increasing Linear Functions. Decreasing Linear Functions

MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.

MULTIPLE REGRESSION WITH CATEGORICAL DATA

individualdifferences

STAT 350 Practice Final Exam Solution (Spring 2015)

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Wooldridge, Introductory Econometrics, 4th ed. Chapter 7: Multiple regression analysis with qualitative information: Binary (or dummy) variables

Module 5: Statistical Analysis

Research Methods & Experimental Design

Data Analysis Tools. Tools for Summarizing Data

Time Series Analysis

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Introduction to Hierarchical Linear Modeling with R

1 Basic ANOVA concepts

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217

Hedge Effectiveness Testing

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Recall this chart that showed how most of our course would be organized:

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5

2. Linear regression with multiple regressors

Multiple Regression: What Is It?

Multivariate Logistic Regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Transcription:

Comparing Nested Models ST 430/514 Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller is the reduced (or restricted) model. Example: with two independent variables x 1 and x 2, possible terms are x 1, x 2, x 1 x 2, x 2 1, and so on. 1 / 20 Multiple Linear Regression Comparing Nested Models

Consider three models: First order: E(Y ) = β 0 + β 1 x 1 + β 2 x 2 ; Interaction: Full second order: E(Y ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 ; E(Y ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + β 4 x 2 1 + β 5 x 2 2. 2 / 20 Multiple Linear Regression Comparing Nested Models

The first order model is nested within both the Interaction model and the Full second order model. The Interaction model is nested within the Full second order model. We usually want to use the simplest (most parsimonious) model that adequately fits the observed data. One way to decide between a full model and a reduced model is by testing H 0 : reduced model is adequate; H a : full model is better. 3 / 20 Multiple Linear Regression Comparing Nested Models

When the full model has exactly one more term than the reduced model, we can use a t-test. E.g., testing the Interaction model E(Y ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 ; against the First order model E(Y ) = β 0 + β 1 x 1 + β 2 x 2. H 0 : reduced model is adequate is the same as H 0 : β 3 = 0. So the usual t-statistic is the relevant test statistic. 4 / 20 Multiple Linear Regression Comparing Nested Models

When the full model has more than one additional term, we use an F -test, which generalizes the t-test. Basic idea: fit both models, and test whether the full model fits significantly better than the reduced model: ) F = ( Drop in SSE Number of extra terms s 2 for full model where SSE is the sum of squared residuals. When H 0 is true, F follows the F -distribution, which we use to find the P-value. 5 / 20 Multiple Linear Regression Comparing Nested Models

E.g., testing the (full) second order model E(Y ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + β 4 x1 2 + β 5 x2 2 ; against the (reduced) interaction model E(Y ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2. Here H 0 is β 4 = β 5 = 0, and H a is the opposite. In R, the lm() method is not convenient for carrying out this test; aov() is better. 6 / 20 Multiple Linear Regression Comparing Nested Models

summary(aov(cost ~ Weight + Distance + I(Weight * Distance) + I(Weight^2) + I(Distance^2), express)) Df Sum Sq Mean Sq F value Pr(>F) Weight 1 270.55 270.55 1380.001 2.17e-15 *** Distance 1 143.63 143.63 732.616 1.72e-13 *** I(Weight * Distance) 1 31.27 31.27 159.487 4.84e-09 *** I(Weight^2) 1 3.80 3.80 19.383 0.000602 *** I(Distance^2) 1 0.09 0.09 0.451 0.512657 Residuals 14 2.74 0.20 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 summary(aov(cost ~ Weight + Distance + I(Weight * Distance), express)) Df Sum Sq Mean Sq F value Pr(>F) Weight 1 270.55 270.55 652.59 2.14e-14 *** Distance 1 143.63 143.63 346.45 2.89e-12 *** I(Weight * Distance) 1 31.27 31.27 75.42 1.88e-07 *** Residuals 16 6.63 0.41 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 7 / 20 Multiple Linear Regression Comparing Nested Models

These are sequential sums of squares, adding each term to the model in order. See Residuals line in each set of results: SSE(Full second order) = 2.74, SSE(Interaction) = 6.63, so F = (6.63 2.74)/2 0.20 = 9.75, P <.01 8 / 20 Multiple Linear Regression Comparing Nested Models

You can, in fact, calculate F from the output for the full model: Note that, because the terms are added sequentially, the sums of squares for the common terms (Weight, Distance, and Weight * Distance) are the same in both models. In the reduced model, the extra terms (Weight^2 and Distance^2) have gone away. Their combined sum of squares, 3.80 + 0.09 = 3.89, is exactly the increase in SSE, 6.63-2.74 = 3.89. 9 / 20 Multiple Linear Regression Comparing Nested Models

So we can also calculate F = (Sum Sq for Weight^ 2 + Sum Sq for Distance^ 2)/2 Mean Square for Residuals using only the output for the full model. Note that F was calculated imprecisely, because of rounding. We can get more digits using print(summary(...), digits = 8) for example, or calculate F to full precision: s <- summary(aov(cost ~ Weight + Distance + I(Weight * Distance) + I(Weight^2) + I(Distance^2), express))[[1]] sum(s[c("i(weight^2)", "I(Distance^2)"), "Sum Sq"]) / 2 / s["residuals", "Mean Sq"] 10 / 20 Multiple Linear Regression Comparing Nested Models

We could use the same F -test when there is only one additional term in the full model, based on just one line in the ANOVA table, provided it is the last term in the formula. It appears very different from the t-test described earlier. Some matrix algebra shows that it is, in fact, exactly the same test: The F -statistic is exactly the square of the t-statistic. The F critical values are exactly the squares of the (two-sided) t critical values. So the P-value is exactly the same. 11 / 20 Multiple Linear Regression Comparing Nested Models

Complete Example: Road Construction Cost Data from the Florida Attorney General s office y = successful bid; x 1 = DOT engineer s estimate of cost x 2 = indicator of fixed bidding: { 1 if fixed x 2 = 0 if competitive 12 / 20 Multiple Linear Regression A Complete Example

Get the data and plot them: flag <- read.table("text/exercises&examples/flag.txt", header = TRUE) pairs(flag[, -1]) Section 4.14 suggests beginning with the full second order model, and simplifying it as far as possible (but no further!). We ll take the opposite approach: begin with the first order model, and complicate it as far as necessary. Because x 2 is an indicator variable, the first order model is a pair of parallel straight lines: summary(lm(cost ~ DOTEST + STATUS, flag)) 13 / 20 Multiple Linear Regression A Complete Example

First order model Call: lm(formula = COST ~ DOTEST + STATUS, data = flag) Residuals: Min 1Q Median 3Q Max -2199.94-73.83 7.76 53.68 1722.42 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -20.537724 26.817718-0.766 0.444558 DOTEST 0.930781 0.009744 95.519 < 2e-16 *** STATUS 166.357224 49.287822 3.375 0.000864 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 306.3 on 232 degrees of freedom Multiple R-squared: 0.9755, Adjusted R-squared: 0.9752 F-statistic: 4610 on 2 and 232 DF, p-value: < 2.2e-16 14 / 20 Multiple Linear Regression A Complete Example

Both variables look important. The DOTEST coefficient is close to 1, so the winning bids roughly track the estimated cost. The positive STATUS coefficient means the line for STATUS = 1 is higher than the line for STATUS = 0. Are the slopes different? Try the interaction model: summary(lm(cost ~ DOTEST * STATUS, flag)) 15 / 20 Multiple Linear Regression A Complete Example

Interaction model Call: lm(formula = COST ~ DOTEST * STATUS, data = flag) Residuals: Min 1Q Median 3Q Max -2143.12-43.21 1.39 40.17 1765.99 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -6.428025 26.208287-0.245 0.806 DOTEST 0.921338 0.009723 94.755 < 2e-16 *** STATUS 28.673189 58.661711 0.489 0.625 DOTEST:STATUS 0.163282 0.040431 4.039 7.32e-05 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 296.7 on 231 degrees of freedom Multiple R-squared: 0.9771, Adjusted R-squared: 0.9768 F-statistic: 3281 on 3 and 231 DF, p-value: < 2.2e-16 16 / 20 Multiple Linear Regression A Complete Example

The interaction term is highly significant: reject the first order model in favor of the interaction model. The slopes are 0.921338 for STATUS = 0, and 0.921338 + 0.163282 = 1.08462 for STATUS = 1. So the competitive auctions are won with bids that fall slightly below the estimated cost, while the fixed winning bids fall slightly above the estimated cost. 17 / 20 Multiple Linear Regression A Complete Example

To validate the interaction model, we compare it with (finally!) the full second order model. Note When some variables are qualitative, the full second order model consists of the full second order (i.e., quadratic) model in the quantitative variables, plus the interactions of those terms with the qualititative variables: summary(lm(cost ~ DOTEST + STATUS + I(DOTEST * STATUS) + I(DOTEST^2) + I(DOTEST^2 * STATUS), flag)) 18 / 20 Multiple Linear Regression A Complete Example

Full second order model ST 430/514 Call: lm(formula = COST ~ DOTEST + STATUS + I(DOTEST * STATUS) + I(DOTEST^2) + I(DOTEST^2 * STATUS), data = flag) Residuals: Min 1Q Median 3Q Max -2143.50-35.38 1.27 46.58 1771.19 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -2.972e+00 3.089e+01-0.096 0.92344 DOTEST 9.155e-01 2.917e-02 31.385 < 2e-16 *** STATUS -3.673e+01 7.477e+01-0.491 0.62375 I(DOTEST * STATUS) 3.242e-01 1.192e-01 2.721 0.00702 ** I(DOTEST^2) 7.191e-07 3.404e-06 0.211 0.83288 I(DOTEST^2 * STATUS) -3.576e-05 2.478e-05-1.443 0.15041 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 296.6 on 229 degrees of freedom Multiple R-squared: 0.9773, Adjusted R-squared: 0.9768 F-statistic: 1970 on 5 and 229 DF, p-value: < 2.2e-16 19 / 20 Multiple Linear Regression A Complete Example

Full second order model, ANOVA summary(aov(cost ~ DOTEST + STATUS + I(DOTEST * STATUS) + I(DOTEST^2) + I(DOTEST^2 * STATUS), flag)) Df Sum Sq Mean Sq F value Pr(>F) DOTEST 1 864038187 864038187 9818.947 < 2e-16 *** STATUS 1 1069006 1069006 12.148 0.000589 *** I(DOTEST * STATUS) 1 1435733 1435733 16.316 7.32e-05 *** I(DOTEST^2) 1 15 15 0.000 0.989487 I(DOTEST^2 * STATUS) 1 183210 183210 2.082 0.150411 Residuals 229 20151321 87997 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 The significance of the two terms that were added is tested using an F statistic with 2 degrees of freedom in the numerator; the value is only slightly greater than 1, and is completely consistent with the null hypothesis that the interaction model is adequate. 20 / 20 Multiple Linear Regression A Complete Example