CHAPTER 2 AND 10: Least Squares Regression

Similar documents
Chapter 23. Inferences for Regression

SPSS Guide: Regression Analysis

Simple linear regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Chapter 7: Simple linear regression Learning Objectives

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Chapter 5 Analysis of variance SPSS Analysis of variance

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Correlation and Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Exercise 1.12 (Pg )

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Univariate Regression

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Introduction to Regression and Data Analysis

2013 MBA Jump Start Program. Statistics Module Part 3

Formula for linear models. Prediction, extrapolation, significance test against zero slope.

Module 5: Multiple Regression Analysis

2. Simple Linear Regression

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Regression Analysis: A Complete Example

STAT 350 Practice Final Exam Solution (Spring 2015)

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Simple Regression Theory II 2010 Samuel L. Baker

Correlation and Regression Analysis: SPSS

Example: Boats and Manatees

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Linear Models in STATA and ANOVA

Relationships Between Two Variables: Scatterplots and Correlation

Fairfield Public Schools

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

MTH 140 Statistics Videos

Chapter 2 Probability Topics SPSS T tests

The importance of graphing the data: Anscombe s regression examples

Multiple Linear Regression

Final Exam Practice Problem Answers

The Dummy s Guide to Data Analysis Using SPSS

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Chapter 7 Section 7.1: Inference for the Mean of a Population

Premaster Statistics Tutorial 4 Full solutions

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Correlation key concepts:

An analysis method for a quantitative outcome and two categorical explanatory variables.

Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

1.5 Oneway Analysis of Variance

Part 2: Analysis of Relationship Between Two Variables

Directions for using SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Using R for Linear Regression

Homework 8 Solutions

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

ABSORBENCY OF PAPER TOWELS

SPSS Explore procedure

Module 3: Correlation and Covariance

MULTIPLE REGRESSION EXAMPLE

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Nonlinear Regression Functions. SW Ch 8 1/54/

Introduction to Quantitative Methods

How To Run Statistical Tests in Excel

Interaction between quantitative predictors

Factors affecting online sales

Simple Predictive Analytics Curtis Seare

II. DISTRIBUTIONS distribution normal distribution. standard scores

MULTIPLE REGRESSION ANALYSIS OF MAIN ECONOMIC INDICATORS IN TOURISM. R, analysis of variance, Student test, multivariate analysis

Regression step-by-step using Microsoft Excel

Independent t- Test (Comparing Two Means)

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Predictability Study of ISIP Reading and STAAR Reading: Prediction Bands. March 2014

Regression III: Advanced Methods

Moderator and Mediator Analysis

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Chapter 7. One-way ANOVA

Projects Involving Statistics (& SPSS)

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

AP STATISTICS REVIEW (YMS Chapters 1-8)

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

SPSS TUTORIAL & EXERCISE BOOK

An SPSS companion book. Basic Practice of Statistics

When to use Excel. When NOT to use Excel 9/24/2014

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

The correlation coefficient

Simple Linear Regression Inference

Table of Contents. Preface

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Elements of statistics (MATH0487-1)

11. Analysis of Case-control Studies Logistic Regression

One-Way ANOVA using SPSS SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Normality Testing in Excel

ANALYSIS OF TREND CHAPTER 5

Transcription:

CHAPTER 2 AND 0: Least Squares Regression In chapter 2 and 0 we will be looking at the relationship between two quantitative variables measured on the same individual. General Procedure:. Make a scatterplot and describe the form, direction and strength of the relationship. Note: fitting a line only makes sense if the overall pattern of the scatterplot is roughly linear. 2. Look for outliers and influential observations on the scatterplot. a. Note: Inference is not safe if there are influential points as results depend strongly on these few points. It is often helpful to rework the data without the influential variables and compare the results. 3. Find the correlation r to get a numerical measure of direction and strength of the linear relationship. 4. Find r 2, the fraction of variation in the values of y that is explained by the least squares regression of y on x. 5. If the data is reasonably linear, find the least squares regression line for the data. Note: The line can be used to predict y for a given x. 6. Make a residual plot and normal probability plot to check the regression assumptions. 7. If and only if your data was collected using random sampling techniques, you can look at the hypothesis tests and confidence interval for the correlation, slope and intercept. 8. If and only if your data was collected using random sampling techniques, you can look at the hypothesis tests and confidence intervals for the mean response and prediction intervals.

Association Between Variables: Two variables measured on the same individuals are associated if some values of one variable tend to occur more often with some values of the second variable than with other values of that variable. Just because two variables are associated doesn t mean that a change in one variable causes a change in the other. (Causation section 2.6) Also, the relationship between two variables might not tell the whole story. Other variables may affect the relationship. These other variables are called lurking variables. Positive association: When above average values of one variable tend to accompany aboveaverage values of the other, and below average values also tend to occur together. Negative association: When above average values of one variable tend to accompany below average values of the other and visa versa. No association: Hard to find a pattern showing a relationship between the variables. Response variable: measures an outcome of a study. Dependent variable Y Explanatory variable: explains or causes changes in the response variable. Independent variable X 2

Example : A forester has become adept at estimating the volume (in cubic feet) of trees on a particular site prior to a timber sale. Since his operation has now expanded, he would like to train another person to assist in estimating the cubic foot volume of trees. He decided to create a model that will allow him to obtain the actual tree volume based on his assistant s estimation. The forester selects a random sample of trees to be felled. For each tree, the assistant is to guess the cubic foot volume of the tree. The forester also obtains the actual foot volume after the tree has been chopped down. Below is his data: Tree Estimated Volume Actual Volume 2 3 2 4 4 3 8 9 4 2 5 5 7 9 6 6 20 7 4 6 8 4 5 9 5 7 0 7 8 STEP : Make a scatterplot; describe the form, direction and strength of the relationship. Before doing the scatterplot you need to decide which variable is the explanatory variable and which is the response variable. For Example, identify the explanatory and the response variables. Explanatory Variable: Response Variable: A scatterplot shows the relationship between two quantitative variables measured on the same individual. The explanatory variable is plotted on the x axis; the response variable is plotted on the y axis. Look at the overall pattern. The overall pattern can be described by form, direction and strength. Form: is the scatterplot linear, quadratic, etc. Direction: is the association positive or negative? Strength: of the relationship. Describe the scatterplot in Example. Form: Strength: Direction: 3

Scatterplot Using SPSS: >Graphs >Scatter/Dot Select simple and click define Pull estimate into the X Axis box and actual into the Y Axis box then click OK. Note: To get the fitted line you need to double click on your graph to bring up the chart editor. You will then need to click on a button that looks like a sdatterplot with a fitted line through it. Select linear and then close. Estimated Volume versus Actual Volume of Trees 20.00 8.00 6.00 Actual 4.00 2.00 0.00 R Sq Linear = 0.876 8.00 8.00 0.00 2.00 4.00 6.00 8.00 Estimate STEP 2: Look for outliers and influential observations on the scatterplot. Look for striking deviations from the overall pattern. Outlier: An observation that lies outside of the overall pattern of the other observations. Points that are outliers in the y direction of a scatterplot have large regression residuals, but other outliers need not have large residuals. Influential observations: an observation that if removed would markedly change the results of the regression calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least squares regression line. Are there any outliers or influential observations in our data? Note: To add a categorical variable to a scatterplot, use a different plot color or symbol for each category. 4

STEP 3: Find the correlation r. The correlation measures the direction and strength of the linear relationship between two quantitative variables. Correlation is usually written as r. x ix yiy r n s x s y Properties of correlation: It makes no difference which variable you call x and which you call y since correlation does not make use of the distinction between the explanatory variable and the response variable. Both variables need to be quantitative to calculate correlation. The correlation r does not change if we change the units of measurements of x, y, or both. A positive r corresponds to a positive relationship between the variables. A negative r corresponds to a negative relationship between the variables. ( r ) Values near 0 indicate a weak relationship and values close to or indicate a strong relationship. Correlation measures the strength of only a linear relationship. Like the mean and standard deviation, the correlation is not resistant. The correlation r is strongly affected by a few outlying observations. Use r with caution when outliers appear in the scatterplot. Correlation is not a complete description of two variable data. You should give the mean and standard deviations of both x and y along with the correlation. For Example, find the correlation between estimated volume and actual volume. We will use the Pearson Correlation output from SPSS here. 5

Note: The SPSS manual tells you where to find r using the least squares regression output, but this r is actually the ABSOLUTE VALUE OF r, so you need to figure out the sign yourself by looking at the association (positive or negative) of your data. Estimate Actual Correlations Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Estimate Actual.936**..000 0 0.936**.000. 0 0 **. Correlation is significant at the 0.0 level (2-tailed). STEP 4: Find r 2. r 2 is the percent of variation in y explained by the regression line (the closer to 00%, the better). We can get this from the regression output by squaring the correlation r. For Example, find the percent of variation in actual volume of trees explained by the regression line. STEP 5: Find the least squares regression line for the data. The least squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. We have data on an explanatory variable x and a response variable y for n individuals. The means and standard deviations of the data are x and s x for x and y and s y for y; the correlation between x and y is r. The regression Model for the population is: yi 0 x i The sample prediction equation of the least squares regression is: y b bx 0 The slope is: sy b r s x Where the slope measures the amount of change caused in the response variable when the explanatory variable is increased by one unit. The intercept is: b0 y bx 6

Using SPSS: >Analyze >Regression >linear. Put estimate into the independent box and actual into the dependent box. Click OK. Model Model Summary Adjusted Std. Error of R R Square R Square the Estimate.936 a.876.86.9525 a. Predictors: (Constant), Estimate Model Regression Residual Total a. Predictors: (Constant), Estimate b. Dependent Variable: Actual ANOVA b Sum of Squares df Mean Square F Sig. 80.97 80.97 56.678.000 a.429 8.429 92.400 9 Model (Constant) Estimate a. Dependent Variable: Actual Unstandardized Coefficients Coefficients a Standardized Coefficients B Std. Error Beta t Sig..308 2.066.49.885.00.46.936 7.528.000 For Example, find the least squares regression line. Based on the r 2 value you found previously, do you think this line will be useful for predicting actual tree volumes? For Example, use the regression line to predict the actual volume of a tree with an estimated volume of 3 cubic feet? We can use a regression line to make predictions as long as we follow the following rules: Only use the least squares regression line to find y for a specific value of x. (Don t use it to find x for a specific value of y!) Extrapolation involves using the line to find y values corresponding to x values that are outside the range of our data x values. Typically we want to avoid this since the line may not be valid for wide ranges of x values. 7

Example 2: (From Moore and McCabe fourth edition) During the period after birth, a male white rat gains 40 grams (g) per week till about 0 weeks of age. (This is unusually regular in his growth, but 40g per week is a realistic rate.) a. If the rat weighed 00g at birth, give an equation for his weight after a week. What is the slope of this line? b. Would you be willing to use this line to predict the rat s weight at age 2 years: Do the prediction and think about the reasonableness of the results? (There are 454 g in a pound. To help you asses the results, a large cat weighs about 0 pounds. Prediction Intervals Predicting a future observation under conditions similar to those used in the study. Since there is variability involved in using a model created from sample data, a prediction interval is better than a single prediction. They re related to confidence intervals. Use SPSS to calculate these intervals. Residual: The vertical distance between the observed y value and the corresponding predicted y value. residual e y y y ( b b x ) i i i i 0 i For example 2, find the residual for tree number and tree number 7. Assumptions for Regression Inference and Regression Model:. Repeated responses of y are independent of each other. This basically means the data comes from a simple random sample. (To check this assumption examine the way in which the units were selected.) 2. For any fixed value of x, the response y varies according to a normal distribution. (To check the assumption of normality you can do a normal probability plot of the residuals on SPSS). 3. The relationship is linear. (To check the linearity assumption, you can make a scatterplot or a residual plot of the data.) 4. The standard deviation of y is the same for all values of x. The value of is unknown. (To check for constant variability you can look at the residual plot of the data.) 8

STEP 6: Make a residual plot and normal probability plot to check the regression assumptions. It is always important to check that the assumptions of the regression model have been met to determine whether your results are valid. This is also important to do before you proceed with inference. Normal Probability plots: If your points fall in a relatively straight line, then you can assume that your response is relatively normal and the second assumption has been met. To check the normality assumption we make a normal probability plot by doing the following: Using SPSS: >Analyze >Regression >linear. Put estimate into the independent box and actual into the dependent box. Then select plots and click on the box for normal probability plot and click continue followed by OK. Normal P-P Plot of Regression Standardized Residual.0 Dependent Variable: Actual 0.8 Expected Cum Prob 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8.0 Observed Cum Prob For example 2, has the normality assumption has been met? Residual Plots: A residual plot is a scatterplot of the regression residuals against the explanatory variable. It is used to assess the fit of a regression line and to check for a constant variability. The residual plot magnifies the deviations from the line to make the pattern easy to read. If the points are random with no pattern and approximately the same number of points above and below the center line, you can feel confident that assumptions three and four have been met. If you have a funnel shape this shows you that the assumption of constant variance has not been met. If you have some other pattern like a parabola, this shows you that the linearity assumption has not been met. 9

Using SPSS: >Analyze >Regression >linear. Put estimate into the independent box and actual into the dependent box. Click on the save button, check unstandardized residuals, and click continue. The residuals will appear in the data editor. Make a scatterplot of the residuals on the y axis against the estimated volume on the x axis. To get the line at y=0, when in the chart editor right click on the graph and select Add Y Axis reference line. Then select reference line and plug in a zero for y axis position. Residual Plot 2.00000 For example 2, have the assumptions of linearity and constant variability been met? Unstandardized Residual.00000 0.00000 -.00000-2.00000 8.00 0.00 2.00 4.00 6.00 8.00 Estimate Lastly, it is important to check for outliers and influential variables. Looking for high residuals or points that are far from the other points is important. Often we will want to do the analysis both with and without the outliers, particularly if they are influential variables as well. Are there any outliers or influential variables in example 2? 0

Scatterplot Without the Influential Variable 20.00 20.00 9.00 9.00 Actual 8.00 7.00 6.00 5.00 8.00 7.00 6.00 5.00 Derived from Actual 4.00 R Sq Linear = 0.74 4.00 3.00 3.00 2.00 3.00 4.00 5.00 6.00 7.00 Estimate Model Model Summary Adjusted Std. Error of R R Square R Square the Estimate.86 a.74.704.27602 a. Predictors: (Constant), Estimate Model (Constant) Estimate a. Dependent Variable: Actual Unstandardized Coefficients Coefficients a Standardized Coefficients B Std. Error Beta t Sig..689 3.522.96.850.075.240.86 4.475.003

STEP 7: Look at the hypothesis tests and confidence intervals Up until this point we have looked at some regression related concepts that can be used in an exploratory data analysis setting as well as a more formal setting. We will now look at inference for regression. Before we do this, however, it must be understood that the tests and confidence intervals that we find from now on out can only be found on data that has been collected using a random sampling technique such as simple random sampling. If we did not collect our data using a random sample, or if we have conducted a census, these techniques are meaningless. Test for a Zero Population Correlation: State null and alternative hypotheses H : 0 versus 0 a : 0 a Find the test statistic r n 2 t 2 r H, H : 0 or H : 0 where n is the sample size and r is the sample correlation Calculate the P value in terms of a random variable T having the ( 2) The P value for a test of H 0 against H a : 0 is PT ( t) H a : 0 is PT ( t) H : 0 is 2 P( T t ) a Compare the P value to the α level If P value α, then reject H 0 If P value > α, fail to then reject State your conclusions in terms of the problem H 0 a tn distribution. For example, test H : 0 versus 0 H a : 0 2

Confidence Intervals for Regression Slope and Intercept: A level C confidence interval for the intercept is 0 b t SE * 0 b 0 A level C confidence interval for the slope is b t SE * b SPSS will also give you these confidence intervals for 95%, but you may have to use the estimates for the coefficients and their standard errors to find the other confidence intervals. (Use the t table and n 2 degrees of freedom to get t*). Using SPSS: >Analyze >Regression >linear. Put estimate into the independent box and actual into the dependent box. Click on statistics and select confidence intervals and click continue followed by OK Model (Constant) Estimate a. Dependent Variable: Actual Unstandardized Coefficients Coefficients a Standardized Coefficients 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta.308 2.066.49.885-4.457 5.072.00.46.936 7.528.000.763.437 For example 2, find the 95% and 99% confidence intervals for slope and y intercept. 3

Hypothesis test for the regression slope: State the null and Alternative Hypotheses H0 : 0 versus Ha : 0, a : 0 Find the test statistic t b, with df = n 2 SEb H or H : 0 (SPSS will give you the test statistic) Calculate the P value (SPSS will give you the 2 sided P value. If you have a one sided test, you will have to divide the P value by 2). Compare the P value to the α level If P value α, then reject H 0 If P value > α, then fail to reject H State your conclusions in terms of the problem 0 a The test statistic for correlation is numerically identical to the test statistic used to test slope. Therefore, you can read the test statistic and P value off of the SPSS output for the slope when doing a test for the correlation. For example 2, perform a significance test to see whether the slope of the regression line is positive. 4

Example 4: This example will use data that is part of a data set from Dr. T.N.K. Raju, Department of Neonatology, University of Illinois at Chicago. IMR=Infant Mortality rate PQLI=Physical Quality of life Index (Indicator of average wealth) Case PQLI IMR 7 0 2 24 78 3 28 88 4 29 35 5 29 55 6 33 38 Case PQLI IMR 7 33 79 8 35 33 9 36 20 0 36 92 43 25 How does the physical quality of life index affect infant mortality rate? Answer the questions below based on the output that follows. a. Describe the form, direction and strength of the relationship. b. What is the correlation? c. What percent of the variation in infant mortality rate is explained by the regression line? d. Give an estimate for the standard deviation of the model. (Find s.) e. Do a hypothesis test to test H : 0 0 versus H : 0. a f. What is the equation of the least squares regression line? g. Use the regression line to predict a PQLI of 25. 5

h. Is the prediction in part 6 good? Why? i. Find the residual for case. j. Find a 99% confidence interval for the slope. k. What assumptions need to be met for the above to be of use? How Physical Quality of Life affects Infant Mortality Rate 30 0 IMR 90 70 50 5 25 35 45 PQLI Model Model Summary Adjusted Std. Error of R R Square R Square the Estimate.300 a.090 -.0 28.053 a. Predictors: (Constant), PQLI Model (Constant) PQLI a. Dependent Variable: IMR Unstandardized Coefficients Coefficients a Standardized Coefficients 95% Confidence Interval for B t Sig. Lower Bound Upper Bound B Std. Error Beta 68.43 39.79.76.20-2.708 57.994.249.322.300.945.369 -.74 4.239 6

Residual Plot 40 20 0 Unstandardized Residual -20-40 -60 0 20 30 40 50 PQLI Normal P-P Plot of Regression Standardized Residual.0 Dependent Variable: IMR 0.8 Expected Cum Prob 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8.0 Observed Cum Prob Does the relationship make sense? For example, does it make sense that the infant mortality rate will go up as the physical quality of life index gets better? What could be a potential lurking variable here? Now let s look at what happens if we add a categorical variable to the picture. Case PQLI IMR Location 7 0 rural 2 24 78 rural 3 28 88 rural 4 29 35 urban 5 29 55 rural 6 33 38 urban 7 33 79 rural 8 35 33 urban 9 36 20 urban 0 36 92 rural 43 25 urban 7

40 How Physical Quality of Life Affects Infant Mo 20 00 80 60 LOCATION urban IMR 40 0 20 30 40 50 rural PQLI 8