In Chapter 2, we used linear regression to describe linear relationships. The setting for this is a

Similar documents
Correlation and Regression

Regression Analysis: A Complete Example

MULTIPLE REGRESSION EXAMPLE

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Univariate Regression

August 2012 EXAMINATIONS Solution Part I

2. Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Simple linear regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

17. SIMPLE LINEAR REGRESSION II

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Nonlinear Regression Functions. SW Ch 8 1/54/

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Rockefeller College University at Albany

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Module 5: Multiple Regression Analysis

Multiple Regression: What Is It?

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Chapter 5 Analysis of variance SPSS Analysis of variance

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Chapter 23. Inferences for Regression

Coefficient of Determination

Jinadasa Gamage, Professor of Mathematics, Illinois State University, Normal, IL, e- mail:

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

ch12 practice test SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

The Volatility Index Stefan Iacono University System of Maryland Foundation

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Chapter 7: Simple linear regression Learning Objectives

Multiple Linear Regression in Data Mining

Lecture 15. Endogeneity & Instrumental Variable Estimation

2 Sample t-test (unequal sample sizes and unequal variances)

Solution Let us regress percentage of games versus total payroll.

SPSS Guide: Regression Analysis

An analysis method for a quantitative outcome and two categorical explanatory variables.

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Getting Correct Results from PROC REG

Paired 2 Sample t-test

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

1.1. Simple Regression in Excel (Excel 2010).

Interaction effects between continuous variables (Optional)

International Statistical Institute, 56th Session, 2007: Phil Everson

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Notes on Applied Linear Regression

The importance of graphing the data: Anscombe s regression examples

1.5 Oneway Analysis of Variance

Estimation of σ 2, the variance of ɛ

Simple Linear Regression Inference

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

Introduction to Regression and Data Analysis

Copyright 2007 by Laura Schultz. All rights reserved. Page 1 of 5

1 Simple Linear Regression I Least Squares Estimation

STAT 350 Practice Final Exam Solution (Spring 2015)

Using R for Linear Regression

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Example: Boats and Manatees

One-Way Analysis of Variance (ANOVA) Example Problem

Chapter 4 and 5 solutions

One-Way ANOVA using SPSS SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

MTH 140 Statistics Videos

Handling missing data in Stata a whirlwind tour

Multiple Linear Regression

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Regression step-by-step using Microsoft Excel

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

SPSS Tests for Versions 9 to 13

The Dummy s Guide to Data Analysis Using SPSS

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Part 2: Analysis of Relationship Between Two Variables

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Section 1: Simple Linear Regression

Data Analysis Methodology 1

Standard errors of marginal effects in the heteroskedastic probit model

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

4. Multiple Regression in Practice

MULTIPLE REGRESSIONS ON SOME SELECTED MACROECONOMIC VARIABLES ON STOCK MARKET RETURNS FROM

Quick Stata Guide by Liz Foster

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

11. Analysis of Case-control Studies Logistic Regression

Multinomial and Ordinal Logistic Regression

Comparing Means in Two Populations

Factors affecting online sales

Section 13, Part 1 ANOVA. Analysis Of Variance

Linear Models in STATA and ANOVA

Transcription:

Math 143 Inference on Regression 1 Review of Linear Regression In Chapter 2, we used linear regression to describe linear relationships. The setting for this is a bivariate data set (i.e., a list of cases/subjects for which two variables have been measured for each case) with both variables being quantitative. We use graphs (scatterplots and residual plots, using the horizontal axis for our explanatory (x) variable and the vertical axis for the response (y) variable) to help us discern whether there is, in fact, a linear relationship y = b 0 + b 1 x between the variables. The correlation r provides a quantitative (objective) measure of whether there is such a relationship, with r close to (±1) indicating a strong linear relationship, and r 0 indicating little to know linear relationship. While we can draw scatterplots and calculate the sample statistics b 0, b 1 and r by hand, one rarely does so, instead relying on a software package to take care of such tedium. Now we want to think of the regression line we computed from our sample (y = b 0 + b 1 x) as an estimate of the true regression line in the population (y = β 0 + β 1 x), just as x in univariate quantitative data has served as an estimate of the true (population) mean µ. So we will learn how to compute CI and significance tests for β 0 and β 1. We will also learn to compute confidence intervals for predictions we make. Example: A tax consultant studied the current relation between the selling price and assessed valuation of one-family dwellings in a large tax district. He obtained a random sample of recent sales of one-family dwellings and, after plotting the data, found a regression line of: (selling price in $1000s) = 4.98 + 2.40 (assessed valuation in $1000). At the same time, a second tax consultant obtained a second random sample of recent sales of one-family dwellings, and after plotting the data, found a regression line of: (selling price in $1000s) = 1.89 + 2.63 (assessed valuation in $1000). Both consultants attempted to model the true linear relationship they believed to exist between price and valuation. The regression lines they found were sample estimates of the true (but unknown) population relationship between selling price (y) and assessed valuation (x). The Simple Linear Model and Related Assumptions We write the true linear relationship between the sampled variables x and y as y = β 0 + β 1 x + ɛ, where ɛ is a random error component. Assumptions for the simple linear model:

Math 143 Inference on Regression 2 1. At each fixed explanatory (x) value, the (population) distribution of the responses (y-values) is normal, with mean µ = beta 0 + β 1 x and standard deviation σ. Note: The mean generally changes for different x values, while the standard deviation is the same for at every x. 2. If we denote by (x i, y i ) the individual observed data pairs, and write ŷ i for the predicted value of the response at x = x i, then the residuals e i = y i ŷ i = y i (β 0 + β 1 x i ) are independent that is, the size and sign of a particular residual are not influenced by the size and sign of the preceding residuals. Software output for linear regression includes the values of b 0 and b 1 which are used to estimate β 0 andβ 1, respectively. Since b 0 and b 1 are statistics (i.e., computed from the sampled coordinate pairs (x i, y i )), they exhibit the kind of random fluctuation from one sample to the next that is to be expected from all random variables. In the real-estate example above, the first tax consultant would estimate β 0 and β 1 to be 4.98 and 2.4 respectively, while the second would estimate them to be 1.89 and 2.63. Other quantities provided by software output for linear regression include the value of r 2 (or R 2 ), which is the square of the correlation, and tells us how much of the observed variation (in the y-values) is explained by the regression line, and s (called the root mean square error), which serves as an estimate for σ. From the value of r 2 we may determine the sample correlation r by taking the square root (choosing the negative/positive sign based on whether the line falls or rises from left to right), which itself is sample statistic estimating the value of the true correlation ρ. Testing Model Utility The reason for doing regression is generally to make predictions. So, one should ask in which situations the resulting estimated regression line y = b 0 + b 1 x is useful. Put another way, given that there is a linear relationship between x and y, in what situations would it be valid to say that knowledge of the x-value is not helpful in predicting a corresponding y? Choose one: 0 5 10 15 20 0 5 10 15 20

Math 143 Inference on Regression 3 Answer: When the value of β 1 (or, equivalently, of ρ) is zero. It is entirely possible, however, that β 1 = 0 while b 1 (our estimate for β 1 ) is nonzero due just to random fluctuations between samples. Thus, we require an inference procedure. The hypotheses for the model utility test (which is a test of significance) are H 0 : β 1 = 0, H a : β 1 = 0. To carry out this test of significance, one would only need to know the sampling distribution for b 1. Under the assumptions for simple linear regression stated above, b 1 N(β 1,σ b1 ), where σ b1 := σ (x i x) 2. As before, our inability to know the value of σ requires us to use SE b1 in its place: and this leads to a test statistic SE b1 := s (x i x) 2, t = b 1 SE b1, having (n 2) degrees of freedom (since two parameters β 0 and β 1 have been estimated to get this far). One may look up a P-value in Table D, much as with other t-tests, comparing it to a specified significance level α. Alternatively, one might construct a 100 (1 α)% confidence interval for β 1 : b 1 ± (critical t-value) (SE b1 ), and draw the same conclusion as for the test of significance by observing whether zero is in this confidence interval. Here the critical t-value is determined using (n 2) degrees of freedom. If you re interested in the P-value associated with model utility, or if you want the 95% confidence interval for either β 0 or β 1, these are included in Stata output for linear regression. And, even if you want a different level CI, you don t have to compute SE b1, because once again, it is included next to b 1 in the column marked Std. Err.. Example: Is the winning time (in seconds) for the men s 1500-meter run in the Olympics decreasing significantly over time? Below is a scatterplot of the winning time vs. year including those years between 1900 and 1972 when there were summer olympics, as well as the result from regression analysis. winning_time 210 220 230 240 250 1900 1920 1940 1960 1980 olympic_year Source SS df MS Model 1654.64697 1 1654.64697 Residual 98.2170282 14 7.01550201 Total 1752.864 15 116.8576 Number of obs = 16 F( 1, 14) = 235.86 Prob > F = 0.0000 R-squared = 0.9441 Adj R-squared = 0.9401 Root MSE = 2.6481 winning time Coef. Std. Err. t P> t [95% Conf. Interval] olympic year -.43772.0285019-15.36 0.000 -.4988504 -.3765896 cons 1077.663 55.19781 19.52 0.000 959.2759 1196.051

Math 143 Inference on Regression 4 What is the result of the linear regression on this data? We get the line ŷ = 1077.663 0.438x. Is this a useful model? (Hypothesis Test) Yes. The P-value for the model utility test is essentially 0. Construct a 95% confidence interval for β 0. There is no need. It is provided as (959.276, 1196.051). Construct a 99% CI for β 1. It is 0.438 ± (2.977)(0.0285), or ( 0.523, 0.353). Using the Model to Estimate and Predict While the regression line is generally used for estimation/prediction, always remember that extrapolation (prediction at x-values beyond the scope of your data) is always more tenuous than interpolation (prediction at x-values within the range of your data). What winning time would our linear model predict for the 1500 m in the year 300 B.C.? Example: A tax consultant studied the current relation between the selling price and assessed valuation of single-family dwellings in a large tax district. He obtained a random sample of recent sales of such homes and, after plotting the data, found a regression line of (selling price in $1000 s) = 2.58 + 2.64 (assessed valuation in $1000 s) or ŷ = 2.58 + 2.64x. (a) Estimate the average selling price of single-family residential homes in this district which have an assessed valuation of $30,000. $81,780 (b) Predict the selling price of the house at 1993 Calvin Ave., a single-family residential home in this district, which has an assessed valuation of $30,000. $81,780 (c) Which number, if either, do you think may have the greater error (greater variability, wider confidence interval)? The prediction of selling price (y) for a single home at x = 30 will have more variability than the average selling price at x = 30.

Math 143 Inference on Regression 5 Level C Confidence Interval for µ x0 1 ŷ ± (t crit. val.) s n + (x 0 x) 2 (x i x) 2 Level C Prediction Interval for µ x0 ŷ ± (t crit. val.) s 1 + 1 n + (x 0 x) 2 (x i x) 2 Confidence intervals are at their narrowest when the chosen x 0 is the same as x. Selling price 50 100 150 20 30 40 50 60 Assessed valuation Example: A scatterplot of the daily attendance and the number of hotdog sales for a sample of 9 games of a minor league baseball team suggests that the relationship between attendance and sales may be linear. The regression output (from Minitab, a different piece of statistical software) is as follows: Hotdog sales 4000 5000 6000 7000 5000 6000 7000 8000 9000 10000 Attendance The regression equation is hotdog sales = 386 +.671 Attendance Predictor Coef SE Coef T P Constant 386.1 728.7 0.53 0.613 Attendance 0.67095 0.09667 6.94 0.000 S = 467.4 R-Sq = 87.3% R-Sq(adj) = 85.5% Analysis of Variance Source DF SS MS F P Regression 1 10523992 10523992 48.18 0.000 Residual Error 7 1529106 218444 Total 8 12053098 Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 5083 160 ( 4705, 5461) ( 3914, 6251) Values of Predictors for New Observations New Obs Attendance 1 7000