Chapter 11: Linear Regression - Inference in Regression Analysis - Part 2

Similar documents
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

One-Way Analysis of Variance (ANOVA) Example Problem

1 Simple Linear Regression I Least Squares Estimation

Regression Analysis: A Complete Example

Chapter 7: Simple linear regression Learning Objectives

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Simple Linear Regression Inference

Chapter 13 Introduction to Linear Regression and Correlation Analysis

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

12: Analysis of Variance. Introduction

A POPULATION MEAN, CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Section 13, Part 1 ANOVA. Analysis Of Variance

Descriptive Statistics

Multiple Linear Regression

Statistiek II. John Nerbonne. October 1, Dept of Information Science

2. Simple Linear Regression

Final Exam Practice Problem Answers

Using R for Linear Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Module 5: Multiple Regression Analysis

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

CHAPTER 13. Experimental Design and Analysis of Variance

Estimation of σ 2, the variance of ɛ

Notes on Applied Linear Regression

Study Guide for the Final Exam

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

One-Way Analysis of Variance

International Statistical Institute, 56th Session, 2007: Phil Everson

August 2012 EXAMINATIONS Solution Part I

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

Part 2: Analysis of Relationship Between Two Variables

Coefficient of Determination

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Univariate Regression

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Section 1: Simple Linear Regression

Statistics Review PSY379

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

2013 MBA Jump Start Program. Statistics Module Part 3

1.5 Oneway Analysis of Variance

UNDERSTANDING THE TWO-WAY ANOVA

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Factors affecting online sales

Interaction between quantitative predictors

Simple Regression Theory II 2010 Samuel L. Baker

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Statistical Functions in Excel

Multiple Linear Regression in Data Mining

Elementary Statistics Sample Exam #3

Unit 26: Small Sample Inference for One Mean

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

MULTIPLE REGRESSIONS ON SOME SELECTED MACROECONOMIC VARIABLES ON STOCK MARKET RETURNS FROM

Regression step-by-step using Microsoft Excel

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Causal Forecasting Models

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Statistical Models in R

17. SIMPLE LINEAR REGRESSION II

Pearson's Correlation Tests

Outline. Definitions Descriptive vs. Inferential Statistics The t-test - One-sample t-test

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Lecture Notes Module 1

CALCULATIONS & STATISTICS

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Independent t- Test (Comparing Two Means)

Simple linear regression

SPSS Guide: Regression Analysis

11. Analysis of Case-control Studies Logistic Regression

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Math 58. Rumbos Fall Solutions to Review Problems for Exam 2

Chapter 5 Analysis of variance SPSS Analysis of variance

General Regression Formulae ) (N-2) (1 - r 2 YX

False. Model 2 is not a special case of Model 1, because Model 2 includes X5, which is not part of Model 1. What she ought to do is estimate

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Premaster Statistics Tutorial 4 Full solutions

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Recall this chart that showed how most of our course would be organized:

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Example: Boats and Manatees

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Hypothesis testing - Steps

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Chapter 7 Section 7.1: Inference for the Mean of a Population

Chapter 4 and 5 solutions

Regression III: Advanced Methods

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Exercise 1.12 (Pg )

Transcription:

Chapter 11: Linear Regression - Inference in Regression Analysis - Part 2 Note: Whether we calculate confidence intervals or perform hypothesis tests we need the distribution of the statistic we will use. Below is a quick review of hypothesis testing. Please also see Chapters 5 and 6. Terminology statistical hypothesis - a conjecture about a population parameter. null hypothesis - symbolized by H 0. The null hypothesis indicates that a parameter is either equal to a specific value (often that value will be zero) or perhaps or. alternative hypothesis - symbolized by H A. For our purpose, will indicate that a parameter is not equal (or possibly > or <) to a specific value. statistical test - uses sample data to make a decision about H 0. test statistic - (a.k.a. test value) - is a value obtained from the sample data (for example x or s 2. level of significance - maximum probability of comitting a Type I error (usually denoted by α). critical value - separates the critical region (the range of values that would indicate a significant difference, and hence the rejection of H 0 ) from the noncritical region. Possible outcomes of a hypothesis test. H 0 is true H 0 is false Reject H 0 Type I Error correct decision Do Not Reject H 0 correct decision Type II Error Example - Hypothesis test for the mean using a t-test. A researcher is interested in testing the hypothesis that male freshman gain more than 5 lbs in their first academic year. 25 freshman participated in the study. Their beginning and ending weights are obtained. The difference between the ending and beginning weights are computed for each. The following statistics have been calculated. x = 7.5, s 2 = 56.25, n = 25. H 0 : µ 5 H A : µ > 5 test statistic: t = x µ 0 s 2 /n t-dist w/ d.f. = n - 1 1

For this example, t = 56.26/26 7.5 5 = 1.67. The critical value for α =.05 with 24 degrees of freedom is 1.711. Since the t < c.v. we can not reject H 0. Conclusion: The evidence does not suggest weight gain of at least 5 lbs. Fig. 1 Inference about β 1 The distribution of the estimate of β 1 is given by ˆβ 1 N ( β 1, σ 2 ) Σ(x i x) 2 We will use an estimate of the variance of ˆβ 1. An estimate of σ 2 is given by the mean square error (MSE). Thus, s 2 ( ˆβ MSE 1 ) = Σ(x i x). 2 Using the distribution of our test statistic ˆβ 1 β 1 s 2 ( ˆβ 1 ) t (n 2) it follows that a (1 α)100% confidence interval for ˆβ 1 is given by where s 2 is previously defined above. ˆβ 1 ± t (1 α s 2,d.f.=n 2) 2 ( ˆβ 1 ) Example - We will use the SAS program reg example1.sas. The data represent 20 college entrance exam scores (our predictor variable) and the respective gpa (response) at the end of the freshman year. From the output we note the following: 2

ˆβ 1 =.83991. s 2 ( ˆβ 1 ) =.14405. We can construct a 95% confidence interval for β 1. For α =.05, the correct percentiles come from a t-dist with 20 2 = 18 degrees of freedom. Because of the symmetry of the t-dist, the 2.5th and 97.5th percentile are given by -2.101 and + 2.101, respectively. Our 95% confidence interval is calculated as.83991 ± (2.101) (.14405) which yields the interval (.537261, 1.142559). Interpretation: With 95% confidence, we estimate the mean increase in GPA to be between.54 and 1.14 (per unit increase in the entrance exam score). Does the 95% confidence for β 1 include 0? What does this mean? Testing for a linear relationship: The statistical hypothesis to test for a linear relationship (in the simple linear regression case) is given by H 0 : β 1 = 0 Vs H A : β 1 0 The test statistic is given by t = ˆβ 1 s 2 ( ˆβ t (n 2) 1 ) Recall from the SAS program reg example1.sas the following: ˆβ 1 =.83991 and s 2 ( ˆβ 1 ) =.14405. Our test statistic is given by t =.83991.14404 = 5.8307. Our critical value for the test (based on α =.05 ) is from a t-dist with 18 degrees of freedom equals 2.101. 3

Fig. 2 Conclusion: Reject H 0. There appears to be a linear relationship between entrance exam scores and grade point average. Inference about β 0. If the scope of the model includes x = 0, we may want to make inference about β 0. We will use the following to construct a confidence interval for β 0. ˆβ 0 = ȳ ˆβ 1 x. An estimate of the variance of ˆβ 0 is given by s 2 ( ˆβ 0 ) = MSE ( ) 1 n + x Σ i (x i x) 2 where MSE = SSE n 2. Recall that SSE is given by SSE = Σ n i=1e 2 i = Σ i (Y i Ŷi) 2. Using the above we have the following: ˆβ 0 β 0 s 2 ( ˆβ 0 ) t (n 2) A (1 α) 100 percent confidence interval for β 0 is given by ˆβ 0 ± t (1 α 2 ; df = n 2) s 2 ( ˆβ 0 ) 4

Notes on inference regarding β 0 and β 1. The sampling distributions of ˆβ 0 and ˆβ 1 will still be approximately normal for minor departures of normality. If departures from normality are serious, large sample sizes will still provide estimates of β 0 and β 1 that are asymptotically normal. Inference about β 0 and β 1 are made under the assumption of repeated sampling from the same scope of X. Interval Estimation of E[Y h ] Let x h be a value in the sample or w/i the scope of the model. Ŷh = ˆβ 0 + ˆβ 1 x h. An estimate of the variance of Ŷh is given by s 2 (Ŷh) = MSE ( 1 n + (x h x) 2 ) Σ i (x i x) 2 where MSE = SSE n 2. Using the above we have the following: Ŷ h E[Y h ] s 2 (Ŷh) t (n 2) A (1 α) 100 percent confidence interval for E[Y h ] is given by Ŷ h ± t (1 α 2 ; df = n 2) s 2 (Ŷh) Comments: Recall that Ŷi = ˆβ 0 + ˆβ 1 x i plots through ( x, ȳ). The variance of Ŷh increases as the distance between x h and x increases. The confidence interval formula for E[Ŷh] applies to a single mean response. We will use an alternate formula when we want simultaneous prediction intervals for several mean responses. Prediction of a New Observation for a given level of X Recall E[Y h ] represented the mean of Y at a given level of X and our interest was in estimating the mean of Y. 5

Y h(new) represents a prediction of a draw from the distribution of Y for a specific X h. Why can we not just use the confidence interval for E[Y h ]? We need to account for two sources of variation. Fig 3. 1. Y is a random variable - we must account for variation in location of the distribution of Y. (Think in terms of the model - only source of variation is model error variance) 2. Once the location of Y is fixed, we need to account for the variation within the distribution of Y. (Think in terms of estimating the model parameters - now we incorporate the variation from having to estimate the parameters) Let s 2 (pred) denote an estimate of the variance of predicting Y h(new). We can calculate s 2 (pred) using the following: s 2 (pred) = MSE ( 1 + 1 n + (x h x) 2 ) Σ i (x i x) 2 A (1 α)100% confidence interval for a future value of Y at X h is given by Ŷ h(new) ± t (1 α 2 ; df = n 2) s 2 (pred) Note the two sources of variation accounted for in s 2 (pred). s 2 (pred) = MSE + MSE = MSE }{{} + s 2 (Ŷh) }{{} location w/i dist ( 1 n + (x h x) 2 ) Σ i (x i x) 2 6

Confidence Band for the Regression Line A confidence band encompasses the entire regression line. The confidence limits for E[Y h ] apply to a single value X h. The formula is similar to that of a confidence interval for E[Y h ] except for the multiplier W rather than percentile from a t-dist. Suppose W 2 = 2 F (1 α, 2, n 2) The boundary limits at X h for a given level of α are given by Ŷ h ± W s 2 (Ŷh) The result of using W rather than the percentile from a t-distribution results in a wider interval (we must encompass the entire regression line rather than a single point). Analysis of Variance (ANOVA) approach to regression We are only looking at ANOVA for a different perspective. This will make more sense in multiple regression. Breakdown of the Sums of Squares: Y i }{{ Ȳ = (Ŷi } Ȳ ) + (Y i }{{}}{{ Ŷi) } total deviation deviation of fit Vs mean deviation of obs. Vs fit After some algebra we have n n n (Y i Ȳ )2 = (Ŷi Ȳ )2 + (Y i Ŷi) 2 i=1 i=1 i=1 Total Sums of Squares (SST) = SS Reg + SSE The breakdown of degrees of freedom is as follows: SST = SSR + SSE General case n - 1 = p - 1 + n - p Simple Lin Reg n - 1 = 1 + n - 2 7

ANOVA Table Source Sums of Squares df Mean Square (MS) F Regression SSR p - 1 SSR/(p - 1) MSR/MSE Error SSE n - p SSE/(n - p) Total SST n - 1 Comments: For the simple linear regression model we can use the F -statistic to test the following: H 0 : β 1 = 0 H A : β 1 0 The test statistic has the following distribution: t.s. = MSR MSE F (1 α; df1 = 1; df2 = n 2) See Table 8 on p. 1102 of the text for tables of the F -distribution. We will reject H 0 if the following condition holds: t.s. > F (1 α; df1 = 1; df2 = n 2) Descriptive Measures of X and Y in the Regression Model Coefficient of Determination denoted by r 2 r 2 = SSR SST r 2 measures the proportion of variation explained by the model (recall what SSR and SST represent) the range of r 2 : 0 r 2 1. the closer r 2 is to 1, the closer the degree of a linear relationship between X and Y Coefficient of Correlation (a.k.a. correlation coefficient) r = ± r 2 Use a + if the sign of ˆβ 1 is + Use a - if the sign of ˆβ 1 is - the range of r: 1 r 1. 8

r is just a measure of association with no clear-cut interpretation. One can use it to make relative comparisons Example - Suppose we have corr(y, X 1 ) =.7 and corr(y, X 2 ) =.2. Thus, it seems that X 1 is more correlated to Y than X 2. Limitations of r 2 and r A high r does not imply the model is useful for predictions. Why? Due to variability, the confidence intervals could be too wide to be useful. A high r does not imply the estimated regression line has a good fit. Below is an example of where one may calculate a high r value but a curve explains the relationship between X and Y. r = 0 does not imply X and Y are not related. Below is an example where r = 0 but X and Y are quadratically related. Notes on applying regression analysis: Rejecting H 0 : β 1 0 does not imply a cause-and-effect relationship. Predictions using the estimated regression line are only valid w/i the scope of the data used in estimation. 9