17. SIMPLE LINEAR REGRESSION II



Similar documents
Regression Analysis: A Complete Example

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

2. Simple Linear Regression

Chapter 7: Simple linear regression Learning Objectives

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Univariate Regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Simple Linear Regression Inference

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Simple Regression Theory II 2010 Samuel L. Baker

Exercise 1.12 (Pg )

1 Simple Linear Regression I Least Squares Estimation

Part 2: Analysis of Relationship Between Two Variables

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

5. Linear Regression

AP Physics 1 and 2 Lab Investigations

c 2015, Jeffrey S. Simonoff 1

Analysis of Bayesian Dynamic Linear Models

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

4. Simple regression. QBUS6840 Predictive Analytics.

MULTIPLE REGRESSION EXAMPLE

Econometrics Simple Linear Regression

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Factors affecting online sales

Correlation key concepts:

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Testing for Lack of Fit

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

JUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson

CURVE FITTING LEAST SQUARES APPROXIMATION

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

1. How different is the t distribution from the normal?

August 2012 EXAMINATIONS Solution Part I

The Use of Event Studies in Finance and Economics. Fall Gerald P. Dwyer, Jr.

Causal Forecasting Models

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

The Big Picture. Correlation. Scatter Plots. Data

3.3. Solving Polynomial Equations. Introduction. Prerequisites. Learning Outcomes

Assignment objectives:

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Simple linear regression

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS

Chapter 23. Inferences for Regression

Time series Forecasting using Holt-Winters Exponential Smoothing

How To Run Statistical Tests in Excel

Week 5: Multiple Linear Regression

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

REGRESSION: FORECASTING USING EXPLANATORY FACTORS

Estimation of σ 2, the variance of ɛ

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Module 5: Multiple Regression Analysis

BA 275 Review Problems - Week 5 (10/23/06-10/27/06) CD Lessons: 48, 49, 50, 51, 52 Textbook: pp

The Volatility Index Stefan Iacono University System of Maryland Foundation

Forecasting in supply chains

Final Exam Practice Problem Answers

2013 MBA Jump Start Program. Statistics Module Part 3

Statistical Models in R

Premaster Statistics Tutorial 4 Full solutions

TRINITY COLLEGE. Faculty of Engineering, Mathematics and Science. School of Computer Science & Statistics

Elements of statistics (MATH0487-1)

The correlation coefficient

Basic Probability and Statistics Review. Six Sigma Black Belt Primer

Point Biserial Correlation Tests

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

1.1. Simple Regression in Excel (Excel 2010).

11. Analysis of Case-control Studies Logistic Regression

Example G Cost of construction of nuclear power plants

Statistics 151 Practice Midterm 1 Mike Kowalski

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Illustration (and the use of HLM)

Section 1: Simple Linear Regression

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

Introduction to Regression and Data Analysis

Regression III: Advanced Methods

WEB APPENDIX. Calculating Beta Coefficients. b Beta Rise Run Y X

Introduction to Basic Reliability Statistics. James Wheeler, CMRP

y = a + bx Chapter 10: Horngren 13e The Dependent Variable: The cost that is being predicted The Independent Variable: The cost driver

APPLICATION OF LINEAR REGRESSION MODEL FOR POISSON DISTRIBUTION IN FORECASTING

Chapter 7: Dummy variable regression

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

Data Mining: Algorithms and Applications Matrix Math Review

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!

Module 3: Correlation and Covariance

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

CHAPTER 13. Experimental Design and Analysis of Variance

Part 1 : 07/27/10 21:30:31

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

2.2 Elimination of Trend and Seasonality

Getting Correct Results from PROC REG

Transcription:

17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X. In real applications there will almost always be some random variability which blurs any underlying systematic relationship which might exist. We start by assuming that for each value of X, the corresponding value of Y is random, and has a normal distribution. y E(Y X) = α + βx We formalize these ideas by writing down a model which specifies exactly how the systematic and random components come together to produce our data. Note that a model is simply a set of assumptions about how the world works. Of course, all models are wrong, but they can help us in understanding our data, and (if judiciously selected) may serve as useful approximations to the truth. Error probability distribution x 1 x 2 x 3 x Eg: Consider the heights in inches (X) of male MBA graduates From the University of Pittsburgh and their monthly salary (Y, Dollars) for their first post-mba job. 6750 6500 6250 Monthly Salary (Dollars) vs. Height (Inches) Even if we just consider graduates who are all of the same height (say 6 feet), their salaries will fluctuate, and therefore are not perfectly predictable. Perhaps the mean salary for 6-foot-tall graduates is $6300/Month. Clearly, the mean salary for 5 ½ - foot-tall graduates is less than it is for 6-foot-tall graduates. Salary 6000 5750 5500 Based on our data, we are going to try to estimate the mean value of y for a given value of x, denoted by E(Y X). (Can you suggest a simple way to do this?) 65.0 67.5 70.0 Height 72.5 75.0 77.5

In general, E(Y X) will depend on X. Viewed as a function of X, E(Y X) is called the true regression of Y on X. In linear regression analysis, we assume that the true regression function E(Y X) is a linear function of X: E ( Y x) =α + β x In an advertising application, for example, β might represent the mean increase in sales of Pepsi for each additional 30 second ad. In the Salary vs. Height example, β is the mean increase in monthly salary for each additional inch of height. The parameter β is the slope of the true regression line, and can be interpreted as the population mean change in y for a unit change in x. The parameter α is the intercept of the true regression line and can be interpreted as the mean value of Y when X is zero. For this interpretation to apply in practice, however, it is necessary to have data for X near zero. In the advertising example, α would represent the mean baseline sales level without any advertisements. Unfortunately, the notation α is traditionally used for both the significance level of a hypothesis test and for the intercept of the true regression line. In practice, α and β will be unknown parameters, which must be estimated from our data. The reason is that our observed data will not lie exactly on the true regression line. (Why?) Instead, we assume that the true regression line is observed with error, i.e., that y i is given by y i = α + β x i + ε i, (1) for i = 1, L, n. Here, ε i is a normally distributed random variable (called the error) with mean 0 and variance σ 2 which does not depend on x. We also assume that the values of ε associated with any two values of y are independent. Thus, the error at one point does not affect the error at any other point.

If (1) holds, then the values of y will be randomly scattered about the true regression line, and the mean value of y for a given x will be this true regression line, E ( y x) =α + β x. The error term ε accounts for all of the variables -- measurable and un-measurable -- that are not part of the model. For the heights of graduates and their salaries, suppose that E(Y X = x) = 900 + 100 x. Then the salary of a 6-foot-tall graduate will differ from the mean value of $6300/Month by an amount equal to the error term ε, which lumps together a variety of effects such as the graduate s skills, desire to work hard, etc. The variance of the error, σ 2, measures how close the points are to the true line, in terms of expected squared vertical deviation. Under the model (1) which has normal errors, there is a 95% probability that a y value will fall within ± 2 σ from the true line (measured vertically). So σ measures the "thickness" of the band of points as they scatter about the true line. Furthermore, if the model (1) holds, then the salaries for graduates who are 5 ½ feet tall will fluctuate about their own mean ($5700/Month) with the same level of fluctuation as before. The Least Squares Estimators Given our data on x and y, we want to estimate the unknown intercept and slope α and β of the true regression line. One approach is to find the line which best fits the data, in some sense. In principle, we could try all possible lines a + bx and pick the line which comes "closest" to the data. This procedure is called "fitting a line to data". See website which allows you to do this interactively, at To objectively fit a line to data, we need a way to measure the "distance" between a line (a + b x) and our data. We use the sum of squared vertical prediction errors, f ( a, b) = n i= 1 [ y 2 ( a + bx )]. The smaller f (a, b) is, the closer the line a + bx is to our data. Thus the best fitting line is obtained by minimizing f (a, b). The solution is denoted by ˆα, βˆ, and may be found by calculus methods. i i http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html

The resulting line y ˆ =αˆ + βˆ x is called the least squares line, since it makes the sum of squared vertical prediction errors as small as possible. The least squares line is also referred to as the fitted line, and the regression line, but do not confuse the regression line y ˆ =αˆ + βˆ x with the true regression line E ( y x) =α + β x. It can be shown that the sample statistics αˆ and βˆ are unbiased estimators of the population parameters α and β. That is, E( ˆ) α = α, E( βˆ) = β. Give interpretations of ˆα and βˆ in the Salary vs. Height example, and the Beer example (Calories / 12 oz serving vs. %Alcohol). Instead of using complicated formulas, you can obtain the least squares estimates ˆα and βˆ from Minitab. Eg: For the y = Salary, x = Height data, Minitab gives ˆ α = 902 and βˆ = 100.4. Thus, the fitted model is ŷ = 902 + 100.4 x. Give an interpretation of this fitted model. Regression Analysis: Salary versus Height Coefficients Term Coef SE Coef T-Value P-Value Constant -902 837-1.08 0.290 Height 100.4 12.0 8.35 0.000 Regression Equation Salary = -902 + 100.4 Height

Fitted Line Plot for Salary vs. Height Salary = - 902.2 + 100.4 Height 6750 6500 S 192.702 R-Sq 71.4% R-Sq(adj) 70.3% 6250 Salary 6000 5750 5500 65.0 67.5 70.0 Height 72.5 75.0 77.5