Simple Linear Regression Models

Similar documents
Regression Analysis: A Complete Example

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Simple Regression Theory II 2010 Samuel L. Baker

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Simple linear regression

CS 147: Computer Systems Performance Analysis

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Example: Boats and Manatees

Part 2: Analysis of Relationship Between Two Variables

2. Simple Linear Regression

Univariate Regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Module 5: Multiple Regression Analysis

Session 7 Bivariate Data and Analysis

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

Regression III: Advanced Methods

Simple Linear Regression Inference

Hypothesis testing - Steps

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

5. Linear Regression

The importance of graphing the data: Anscombe s regression examples

Algebra 1 Course Title

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

2013 MBA Jump Start Program. Statistics Module Part 3

Introduction to Linear Regression

Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?

Homework 11. Part 1. Name: Score: / null

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Using R for Linear Regression

Introduction to Linear Regression

Premaster Statistics Tutorial 4 Full solutions

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

5. Multiple regression

Section 1: Simple Linear Regression

Correlation key concepts:

Chapter 7: Simple linear regression Learning Objectives

Module 3: Correlation and Covariance

Regression and Correlation

STAT 350 Practice Final Exam Solution (Spring 2015)

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Introduction to Regression and Data Analysis

Factors affecting online sales

Econometrics Simple Linear Regression

Notes on Applied Linear Regression

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Estimation of σ 2, the variance of ɛ

Multiple Linear Regression

How To Run Statistical Tests in Excel

International Statistical Institute, 56th Session, 2007: Phil Everson

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

17. SIMPLE LINEAR REGRESSION II

A Primer on Forecasting Business Performance

The correlation coefficient

August 2012 EXAMINATIONS Solution Part I

Exercise 1.12 (Pg )

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Coefficient of Determination

1 Simple Linear Regression I Least Squares Estimation

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Lin s Concordance Correlation Coefficient

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

SPSS Guide: Regression Analysis

DATA INTERPRETATION AND STATISTICS

SIMPLE LINEAR REGRESSION

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Dealing with Data in Excel 2010

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Interaction between quantitative predictors

Scatter Plot, Correlation, and Regression on the TI-83/84

Causal Forecasting Models

What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of y = mx + b.

This is a square root. The number under the radical is 9. (An asterisk * means multiply.)

Elementary Statistics Sample Exam #3

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Final Exam Practice Problem Answers

The Big Picture. Correlation. Scatter Plots. Data

3.1 Solving Systems Using Tables and Graphs

Forecasting in STATA: Tools and Tricks

4. Simple regression. QBUS6840 Predictive Analytics.

Correlation. What Is Correlation? Perfect Correlation. Perfect Correlation. Greg C Elvers

11. Analysis of Case-control Studies Logistic Regression

An introduction to Value-at-Risk Learning Curve September 2003

Correlation and Simple Linear Regression

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

Transcription:

Simple Linear Regression Models 14-1

Overview 1. Definition of a Good Model 2. Estimation of Model parameters 3. Allocation of Variation 4. Standard deviation of Errors 5. Confidence Intervals for Regression Parameters 6. Confidence Intervals for Predictions 7. Visual Tests for verifying Regression Assumption 14-2

Simple Linear Regression Models Regression Model: Predict a response for a given set of predictor variables. Response Variable: Estimated variable Predictor Variables: Variables used to predict the response. predictors or factors Linear Regression Models: Response is a linear function of predictors. Simple Linear Regression Models: Only one predictor 14-3

Definition of a Good Model y y y x x x Good Good Bad 14-4

Good Model (Cont) Regression models attempt to minimize the distance measured vertically between the observation point and the model line (or curve). The length of the line segment is called residual, modeling error, or simply error. The negative and positive errors should cancel out Zero overall error Many lines will satisfy this criterion. 14-5

Good Model (Cont) Choose the line that minimizes the sum of squares of the errors. where, is the predicted response when the predictor variable is x. The parameter b 0 and b 1 are fixed regression parameters to be determined from the data. Given n observation pairs {(x 1, y 1 ),, (x n, y n )}, the estimated response for the ith observation is: The error is: 14-6

Good Model (Cont) The best linear model minimizes the sum of squared errors (SSE): subject to the constraint that the mean error is zero: This is equivalent to minimizing the variance of errors (see Exercise). 14-7

Estimation of Model Parameters Regression parameters that give minimum error variance are: where, and 14-8

Example 14.1 The number of disk I/O's and processor times of seven programs were measured as: (14, 2), (16, 5), (27, 7), (42, 9), (39, 10), (50, 13), (83, 20) For this data: n=7, Σ xy=3375, Σ x=271, Σ x 2 =13,855, Σ y=66, Σ y 2 =828, = 38.71, = 9.43. Therefore, The desired linear model is: 14-9

Example 14.1 (Cont) 14-10

Error Computation Example 14. (Cont) 14-11

Derivation of Regression Parameters The error in the ith observation is: For a sample of n observations, the mean error is: Setting mean error to zero, we obtain: Substituting b0 in the error expression, we get: 14-12

Derivation of Regression Parameters (Cont) The sum of squared errors SSE is: 14-13

Derivation (Cont) Differentiating this equation with respect to b 1 and equating the result to zero: That is, 14-14

Allocation of Variation Error variance without Regression = Variance of the response and 14-15

Allocation of Variation (Cont) The sum of squared errors without regression would be: This is called total sum of squares or (SST). It is a measure of y's variability and is called variation of y. SST can be computed as follows: Where, SSY is the sum of squares of y (or Σ y 2 ). SS0 is the sum of squares of and is equal to. 14-16

Allocation of Variation (Cont) The difference between SST and SSE is the sum of squares explained by the regression. It is called SSR: or The fraction of the variation that is explained determines the goodness of the regression and is called the coefficient of determination, R 2 : 14-17

Allocation of Variation (Cont) The higher the value of R 2, the better the regression. R 2 =1 Perfect fit R 2 =0 No fit Coefficient of Determination = {Correlation Coefficient (x,y)} 2 Shortcut formula for SSE: 14-18

Example 14.2 For the disk I/O-CPU time data of Example 14.1: The regression explains 97% of CPU time's variation. 14-19

Standard Deviation of Errors Since errors are obtained after calculating two regression parameters from the data, errors have n-2 degrees of freedom SSE/(n-2) is called mean squared errors or (MSE). Standard deviation of errors = square root of MSE. SSY has n degrees of freedom since it is obtained from n independent observations without estimating any parameters. SS0 has just one degree of freedom since it can be computed simply from SST has n-1 degrees of freedom, since one parameter must be calculated from the data before SST can be computed. 14-20

Standard Deviation of Errors (Cont) SSR, which is the difference between SST and SSE, has the remaining one degree of freedom. Overall, Notice that the degrees of freedom add just the way the sums of squares do. 14-21

Example 14.3 For the disk I/O-CPU data of Example 14.1, the degrees of freedom of the sums are: The mean squared error is: The standard deviation of errors is: 14-22

Confidence Intervals for Regression Params Regression coefficients b 0 and b 1 are estimates from a single sample of size n Random Using another sample, the estimates may be different. If β 0 and β 1 are true parameters of the population. That is, Computed coefficients b 0 and b 1 are estimates of β 0 and β 1, respectively. 14-23

Confidence Intervals (Cont) The 100(1-α)% confidence intervals for b 0 and b 1 can be be computed using t [1-α/2; n-2] --- the 1-α/2 quantile of a t variate with n-2 degrees of freedom. The confidence intervals are: And If a confidence interval includes zero, then the regression parameter cannot be considered different from zero at the at 100(1-α)% confidence level. 14-24

Example 14.4 For the disk I/O and CPU data of Example 14.1, we have n=7, =38.71, =13,855, and s e =1.0834. Standard deviations of b 0 and b 1 are: 14-25

Example 14.4 (Cont) From Appendix Table A.4, the 0.95-quantile of a t-variate with 5 degrees of freedom is 2.015. 90% confidence interval for b 0 is: Since, the confidence interval includes zero, the hypothesis that this parameter is zero cannot be rejected at 0.10 significance level. b 0 is essentially zero. 90% Confidence Interval for b 1 is: Since the confidence interval does not include zero, the slope b 1 is significantly different from zero at this confidence level. 14-26

Case Study 14.1: Remote Procedure Call 14-27

Case Study 14.1 (Cont) UNIX: 14-28

Case Study 14.1 (Cont) ARGUS: 14-29

Case Study 14.1 (Cont) Best linear models are: The regressions explain 81% and 75% of the variation, respectively. Does ARGUS takes larger time per byte as well as a larger set up time per call than UNIX? 14-30

Case Study 14.1 (Cont) Intervals for intercepts overlap while those of the slopes do not. Set up times are not significantly different in the two systems while the per byte times (slopes) are different. 14-31

Confidence Intervals for Predictions This is only the mean value of the predicted response. Standard deviation of the mean of a future sample of m observations is: m =1 Standard deviation of a single future observation: 14-32

CI for Predictions (Cont) m = Standard deviation of the mean of a large number of future observations at x p : 100(1-α)% confidence interval for the mean can be constructed using a t quantile read at n-2 degrees of freedom. 14-33

CI for Predictions (Cont) Goodness of the prediction decreases as we move away from the center. 14-34

Example 14.5 Using the disk I/O and CPU time data of Example 14.1, let us estimate the CPU time for a program with 100 disk I/O's. For a program with 100 disk I/O's, the mean CPU time is: 14-35

Example 14.5 (Cont) The standard deviation of the predicted mean of a large number of observations is: From Table A.4, the 0.95-quantile of the t-variate with 5 degrees of freedom is 2.015. 90% CI for the predicted mean 14-36

Example 14.5 (Cont) CPU time of a single future program with 100 disk I/O's: 90% CI for a single prediction: 14-37

Visual Tests for Regression Assumptions Regression assumptions: 1. The true relationship between the response variable y and the predictor variable x is linear. 2. The predictor variable x is non-stochastic and it is measured without any error. 3. The model errors are statistically independent. 4. The errors are normally distributed with zero mean and a constant standard deviation. 14-38

1. Linear Relationship: Visual Test Scatter plot of y versus x Linear or nonlinear relationship 14-39

2. Independent Errors: Visual Test 1. Scatter plot of ε i versus the predicted response All tests for independence simply try to find dependence. 14-40

Independent Errors (Cont) 2. Plot the residuals as a function of the experiment number 14-41

3. Normally Distributed Errors: Test Prepare a normal quantile-quantile plot of errors. Linear the assumption is satisfied. 14-42

4. Constant Standard Deviation of Errors Also known as homoscedasticity Trend Try curvilinear regression or transformation 14-43

Example 14.6 For the disk I/O and CPU time data of Example 14.1 CPU time in ms Residual Residual Quantile Number of disk I/Os Predicted Response Normal Quantile 1. Relationship is linear 2. No trend in residuals Seem independent 3. Linear normal quantile-quantile plot Larger deviations at lower values but all values are small 14-44

Example 14.7: RPC Performance Residual Residual Quantile Predicted Response 1. Larger errors at larger responses 2. Normality of errors is questionable 14-45 Normal Quantile

Summary Terminology: Simple Linear Regression model, Sums of Squares, Mean Squares, degrees of freedom, percent of variation explained, Coefficient of determination, correlation coefficient Regression parameters as well as the predicted responses have confidence intervals It is important to verify assumptions of linearity, error independence, error normality Visual tests 14-46

Exercise 14.7 The time to encrypt a k byte record using an encryption technique is shown in the following table. Fit a linear regression model to this data. Use visual tests to verify the regression assumptions. 14-53

Exercise 2.1 From published literature, select an article or a report that presents results of a performance evaluation study. Make a list of good and bad points of the study. What would you do different, if you were asked to repeat the study? 14-54

Read Chapter 14 Homework Submit answers to exercise 14.7 Submit answer to exercise 2.1 14-55