Chapter 13 Introduction to Linear Regression and Correlation Analysis

Size: px
Start display at page:

Download "Chapter 13 Introduction to Linear Regression and Correlation Analysis"

Transcription

1 Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing and describing relationship among variables Fall 2006 Fundamentals of Business Statistics 2

2 Chapter 3 Student Lecture Notes 3-2 Methods for Studing Relationships Graphical Scatterplots Line plots 3-D plots Models Linear regression Correlations Frequenc tables Fall 2006 Fundamentals of Business Statistics 3 Two Quantitative Variables The response variable, also called the dependent variable, is the variable we want to predict, and is usuall denoted b. The eplanator variable, also called the independent variable, is the variable that attempts to eplain the response, and is denoted b. Fall 2006 Fundamentals of Business Statistics 4 2

3 Chapter 3 Student Lecture Notes 3-3 YDI 7. Response ( ) Eplanator () Height of son Weight Fall 2006 Fundamentals of Business Statistics 5 Scatter Plots and Correlation A scatter plot (or scatter diagram) is used to show the relationship between two variables Correlation analsis is used to measure strength of the association (linear relationship) between two variables Onl concerned with strength of the relationship No causal effect is implied Fall 2006 Fundamentals of Business Statistics 6 3

4 Chapter 3 Student Lecture Notes 3-4 Eample The following graph shows the scatterplot of Eam score () and Eam 2 score () for 354 students in a class. Is there a relationship? Fall 2006 Fundamentals of Business Statistics 7 Scatter Plot Eamples Linear relationships Curvilinear relationships Fall 2006 Fundamentals of Business Statistics 8 4

5 Chapter 3 Student Lecture Notes 3-5 Scatter Plot Eamples No relationship (continued) Fall 2006 Fundamentals of Business Statistics 9 Correlation Coefficient The population correlation coefficient ρ (rho) measures the strength of the association between the variables (continued) The sample correlation coefficient r is an estimate of ρ and is used to measure the strength of the linear relationship in the sample observations Fall 2006 Fundamentals of Business Statistics 0 5

6 Chapter 3 Student Lecture Notes 3-6 Features of ρ and r Unit free Range between - and The closer to -, the stronger the negative linear relationship The closer to, the stronger the positive linear relationship The closer to 0, the weaker the linear relationship Fall 2006 Fundamentals of Business Statistics Eamples of Approimate r Values Tag with appropriate value: -, -.6, 0, +.3, Fall 2006 Fundamentals of Business Statistics 2 6

7 Chapter 3 Student Lecture Notes 3-7 Earlier Eample Eam Eam2 Correlations Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Eam Eam2.400** ** **. Correlation is significant at the 0.0 level (2 il d) Fall 2006 Fundamentals of Business Statistics 3 YDI 7.3 What kind of relationship would ou epect in the following situations: age (in ears) of a car, and its price. number of calories consumed per da and weight. height and IQ of a person. Fall 2006 Fundamentals of Business Statistics 4 7

8 Chapter 3 Student Lecture Notes 3-8 YDI 7.4 Identif the two variables that var and decide which should be the independent variable and which should be the dependent variable. Sketch a graph that ou think best represents the relationship between the two variables.. The size of a persons vocabular over his or her lifetime. 2. The distance from the ceiling to the tip of the minute hand of a clock hung on the wall. Fall 2006 Fundamentals of Business Statistics 5 Introduction to Regression Analsis Regression analsis is used to: Predict the value of a dependent variable based on the value of at least one independent variable Eplain the impact of changes in an independent variable on the dependent variable Dependent variable: the variable we wish to eplain Independent variable: the variable used to eplain the dependent variable Fall 2006 Fundamentals of Business Statistics 6 8

9 Chapter 3 Student Lecture Notes 3-9 Simple Linear Regression Model Onl one independent variable, Relationship between and is described b a linear function Changes in are assumed to be caused b changes in Fall 2006 Fundamentals of Business Statistics 7 Tpes of Regression Models Positive Linear Relationship Relationship NOT Linear Negative Linear Relationship No Relationship Fall 2006 Fundamentals of Business Statistics 8 9

10 Chapter 3 Student Lecture Notes 3-0 Population Linear Regression The population regression model: Dependent Variable Population intercept Population Slope Coefficient = Independent Variable β0 + β + ε Random Error term, or residual Linear component Random Error component Fall 2006 Fundamentals of Business Statistics 9 Linear Regression Assumptions Error values (ε) are statisticall independent Error values are normall distributed for an given value of The probabilit distribution of the errors is normal The probabilit distribution of the errors has constant variance The underling relationship between the variable and the variable is linear Fall 2006 Fundamentals of Business Statistics 20 0

11 Chapter 3 Student Lecture Notes 3- Population Linear Regression = β0 + β + ε (continued) Observed Value of for i Predicted Value of for i ε i Random Error for this value Slope = β Intercept = β 0 i Fall 2006 Fundamentals of Business Statistics 2 Estimated Regression Model The sample regression line provides an estimate of the population regression line Estimated (or predicted) value Estimate of the regression intercept Estimate of the regression slope ŷi = b0 + b Independent variable The individual random error terms e i have a mean of zero Fall 2006 Fundamentals of Business Statistics 22

12 Chapter 3 Student Lecture Notes 3-2 Earlier Eample Fall 2006 Fundamentals of Business Statistics 23 Residual A residual is the difference between the observed response and the predicted response ŷ. Thus, for each pair of observations ( i, i ), the i th residual is e i = i ŷ i = i (b 0 + b ) Fall 2006 Fundamentals of Business Statistics 24 2

13 Chapter 3 Student Lecture Notes 3-3 Least Squares Criterion b 0 and b are obtained b finding the values of b 0 and b that minimize the sum of the squared residuals e 2 = = ( ŷ) 2 ( (b 0 + b )) 2 Fall 2006 Fundamentals of Business Statistics 25 Interpretation of the Slope and the Intercept b 0 is the estimated average value of when the value of is zero b is the estimated change in the average value of as a result of a one-unit change in Fall 2006 Fundamentals of Business Statistics 26 3

14 Chapter 3 Student Lecture Notes 3-4 The Least Squares Equation The formulas for b and b 0 are: b ( )( ) = 2 ( ) algebraic equivalent: = ( n ) b 2 2 n and b0 = b Fall 2006 Fundamentals of Business Statistics 27 Finding the Least Squares Equation The coefficients b 0 and b will usuall be found using computer software, such as Ecel, Minitab, or SPSS. Other regression measures will also be computed as part of computer-based regression analsis Fall 2006 Fundamentals of Business Statistics 28 4

15 Chapter 3 Student Lecture Notes 3-5 Simple Linear Regression Eample A real estate agent wishes to eamine the relationship between the selling price of a home and its size (measured in square feet) A random sample of 0 houses is selected Dependent variable () = house price in $000s Independent variable () = square feet Fall 2006 Fundamentals of Business Statistics 29 Sample Data for House Price Model House Price in $000s () Square Feet () Fall 2006 Fundamentals of Business Statistics 30 5

16 Chapter 3 Student Lecture Notes 3-6 SPSS Output The regression equation is: house price = (square feet) Model Model Summar Adjusted Std. Error of R R Square R Square the Estimate.762 a a. Predictors: (Constant), Square Feet Model (Constant) Square Feet Unstandardized Coefficients a. Dependent Variable: House Price Coefficients a Standardized Coefficients B Std. Error Beta t Sig Fall 2006 Fundamentals of Business Statistics 3 Graphical Presentation House price model: scatter plot and regression line Intercept = House Price ($000s) Square Feet Slope = 0.0 house price = (square feet) Fall 2006 Fundamentals of Business Statistics 32 6

17 Chapter 3 Student Lecture Notes 3-7 Interpretation of the Intercept, b 0 house price = (square feet) b 0 is the estimated average value of Y when the value of X is zero (if = 0 is in the range of observed values) Here, no houses had 0 square feet, so b 0 = just indicates that, for houses within the range of sizes observed, $98, is the portion of the house price not eplained b square feet Fall 2006 Fundamentals of Business Statistics 33 Interpretation of the Slope Coefficient, b house price = (square feet) b measures the estimated change in the average value of Y as a result of a one-unit change in X Here, b =.0977 tells us that the average value of a house increases b.0977($000) = $09.77, on average, for each additional one square foot of size Fall 2006 Fundamentals of Business Statistics 34 7

18 Chapter 3 Student Lecture Notes 3-8 Least Squares Regression Properties The sum of the residuals from the least squares regression line is 0 ( ( ˆ) = 0 ) The sum of the squared residuals is a minimum 2 (minimized ( ˆ) ) The simple regression line alwas passes through the mean of the variable and the mean of the variable The least squares coefficients are unbiased estimates of β 0 and β Fall 2006 Fundamentals of Business Statistics 35 YDI 7.6 The growth of children from earl childhood through adolescence generall follows a linear pattern. Data on the heights of female Americans during childhood, from four to nine ears old, were compiled and the least squares regression line was obtained as ŷ = where ŷ is the predicted height in inches, and is age in ears. Interpret the value of the estimated slope b = Would interpretation of the value of the estimated -intercept, b 0 = 32, make sense here? What would ou predict the height to be for a female American at 8 ears old? What would ou predict the height to be for a female American at 25 ears old? How does the qualit of this answer compare to the previous question? Fall 2006 Fundamentals of Business Statistics 36 8

19 Chapter 3 Student Lecture Notes 3-9 Coefficient of Determination, R 2 The coefficient of determination is the portion of the total variation in the dependent variable that is eplained b variation in the independent variable The coefficient of determination is also called R-squared and is denoted as R 2 0 R 2 Fall 2006 Fundamentals of Business Statistics 37 Coefficient of Determination, R 2 (continued) Note: In the single independent variable case, the coefficient of determination is where: 2 2 R = r R 2 = Coefficient of determination r = Simple correlation coefficient Fall 2006 Fundamentals of Business Statistics 38 9

20 Chapter 3 Student Lecture Notes 3-20 Eamples of Approimate R 2 Values Fall 2006 Fundamentals of Business Statistics 39 Eamples of Approimate R 2 Values R 2 = 0 No linear relationship between and : R 2 = 0 The value of Y does not depend on. (None of the variation in is eplained b variation in ) Fall 2006 Fundamentals of Business Statistics 40 20

21 Chapter 3 Student Lecture Notes 3-2 SPSS Output Model Model Summar Adjusted Std. Error of R R Square R Square the Estimate.762 a a. Predictors: (Constant), Square Feet Model Regression Residual Total ANOVA b Sum of Squares df Mean Square F Sig a a. Predictors: (Constant), Square Feet b. Dependent Variable: House Price Model (Constant) Square Feet Unstandardized Coefficients a. Dependent Variable: House Price Coefficients a Standardized Coefficients B Std. Error Beta t Sig Fall 2006 Fundamentals of Business Statistics 4 Standard Error of Estimate The standard deviation of the variation of observations around the regression line is called the standard error of estimate s ε The standard error of the regression slope coefficient (b ) is given b s b Fall 2006 Fundamentals of Business Statistics 42 2

22 Chapter 3 Student Lecture Notes 3-22 SPSS Output s ε = Model Model Summar Adjusted Std. Error of R R Square R Square the Estimate.762 a a. Predictors: (Constant), Square Feet s = b Model (Constant) Square Feet Unstandardized Coefficients a. Dependent Variable: House Price Coefficients a Standardized Coefficients B Std. Error Beta t Sig Fall 2006 Fundamentals of Business Statistics 43 Comparing Standard Errors Variation of observed values from the regression line Variation in the slope of regression lines from different possible samples small s ε small s b large s ε large s b Fall 2006 Fundamentals of Business Statistics 44 22

23 Chapter 3 Student Lecture Notes 3-23 Inference about the Slope: t Test t test for a population slope Is there a linear relationship between and? Null and alternative hpotheses H 0 : β = 0 H : β 0 Test statistic b β t = s (no linear relationship) (linear relationship does eist) b Fall 2006 Fundamentals of Business Statistics 45 d.f. = n 2 where: b = Sample regression slope coefficient β = Hpothesized slope s b = Estimator of the standard error of the slope Inference about the Slope: t Test (continued) House Price in $000s () Square Feet () Estimated Regression Equation: house price = (sq.ft.) The slope of this model is Does square footage of the house affect its sales price? Fall 2006 Fundamentals of Business Statistics 46 23

24 Chapter 3 Student Lecture Notes 3-24 Inferences about the Slope: ttest Eample H 0 : β = 0 H A : β 0 d.f. = 0-2 = 8 α/2=.025 Reject H 0 Test Statistic: t = From Ecel output: Intercept Square Feet α/2=.025 Reject H t 0 (-α/2) Do not reject H -t 0 α/ Coefficients Standard Error sb t Stat Decision: Reject H 0 Conclusion: There is sufficient evidence that square footage affects house price Fall 2006 Fundamentals of Business Statistics 47 b t P-value Regression Analsis for Description Confidence Interval Estimate of the Slope: b ± t α/2 s d.f. = n - 2 b ( ) Ecel Printout for House Prices: Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept Square Feet At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.858) Fall 2006 Fundamentals of Business Statistics 48 24

25 Chapter 3 Student Lecture Notes 3-25 Regression Analsis for Description Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept Square Feet Since the units of the house price variable is $000s, we are 95% confident that the average impact on sales price is between $33.70 and $85.80 per square foot of house size This 95% confidence interval does not include 0. Conclusion: There is a significant relationship between house price and square feet at the.05 level of significance Fall 2006 Fundamentals of Business Statistics 49 Residual Analsis Purposes Eamine for linearit assumption Eamine for constant variance for all levels of Evaluate normal distribution assumption Graphical Analsis of Residuals Can plot residuals vs. Can create histogram of residuals to check for normalit Fall 2006 Fundamentals of Business Statistics 50 25

26 Chapter 3 Student Lecture Notes 3-26 Residual Analsis for Linearit residuals Not Linear Linear Fall 2006 Fundamentals of Business Statistics 5 residuals Residual Analsis for Constant Variance residuals Non-constant variance Constant variance Fall 2006 Fundamentals of Business Statistics 52 residuals 26

27 Chapter 3 Student Lecture Notes 3-27 Residual Output RESIDUAL OUTPUT Predicted House Price Residuals Residuals House Price Model Residual Plot Square Feet Fall 2006 Fundamentals of Business Statistics 53 27