Least Squares Regression Alan T. Arnholt Department of Mathematical Sciences Appalachian State University arnholt@math.appstate.edu Spring 2006 R Notes 1 Copyright c 2006 Alan T. Arnholt
2 Least Squares Regression Overview of Regression The R Script
3 Least Squares Regression When a linear pattern is evident from a scatter plot, the relationship between the two variables is often modeled with a straight line.
4 Least Squares Regression When a linear pattern is evident from a scatter plot, the relationship between the two variables is often modeled with a straight line. When modeling a bivariate relationship, Y is called the response or dependent variable, and x is called the predictor or independent variable.
5 Least Squares Regression When a linear pattern is evident from a scatter plot, the relationship between the two variables is often modeled with a straight line. When modeling a bivariate relationship, Y is called the response or dependent variable, and x is called the predictor or independent variable. The simple linear regression model is written Y i = β 0 + β 1 x i + ε i (1)
6 OLS The goal is to estimate the coefficients β 0 and β 1 in (1). The most well known method of estimating the coefficients β 0 and β 1 is to use ordinary least squares (OLS). OLS provides estimates of β 0 and β 1 by minimizing the sum of the squared deviations of the Y i s for all possible lines. Specifically, the sum of the squared residuals (ˆε i = e i = Y i Ŷi) is minimized when the OLS estimators of β 0 and β 1 are b 1 = b 0 = ȳ b 1 x (2) n i=1 (x i x) (y i ȳ) n i=1 (x i x) 2 (3) respectively. Note that the estimated regression function is written as Ŷ i = b 0 + b 1 x i.
6 5 4 Y 3 Ŷ 2 } 2 ˆε 2 = Y 2 Ŷ2 1 0 Y 2 0 1 2 3 4 5 6 x Figure: Graph depicting residuals. The vertical distances shown with a dotted line between the Y i s, depicted with a solid circle, and the Ŷis, depicted with a clear square, are the residuals.
8 Example Use the data frame Gpa from the BSDA package to: 1. Create a scatterplot of CollGPA versus HSGPA.
9 Example Use the data frame Gpa from the BSDA package to: 1. Create a scatterplot of CollGPA versus HSGPA. 2. Find the least squares estimates of β 0 and β 1 using Equations (2) and (3) respectively.
10 Example Use the data frame Gpa from the BSDA package to: 1. Create a scatterplot of CollGPA versus HSGPA. 2. Find the least squares estimates of β 0 and β 1 using Equations (2) and (3) respectively. 3. Find the least squares estimates of β 0 and β 1 using the R function lm().
11 Example Use the data frame Gpa from the BSDA package to: 1. Create a scatterplot of CollGPA versus HSGPA. 2. Find the least squares estimates of β 0 and β 1 using Equations (2) and (3) respectively. 3. Find the least squares estimates of β 0 and β 1 using the R function lm(). 4. Add the least squares line to the scatterplot created in 1 using the R function abline().
12 R Code Code for part 1. > library(bsda) > attach(gpa) > Y <- CollGPA > x <- HSGPA > plot(x, Y, col="blue", + main="scatterplot of College Versus High School GPA", + xlab="high School GPA",ylab="College GPA")
Scatterplot of GPA Scatterplot of College Versus High School GPA College GPA 1.5 2.0 2.5 3.0 3.5 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 High School GPA Figure: Scatterplot requested in part 1
14 Using Equations (2) and (3) to Find b 0 and b 1 Using Equations (2) and (3) to answer part 2. > b1 <- sum( (x-mean(x))*(y-mean(y)) ) / + sum( (x-mean(x))^2 ) > b0 <- mean(y)-b1*mean(x) > c(b0,b1) [1] -0.950366 1.346999
15 Using abline() Using the R function abline() to add the least squares regression line to Figure 2 on page 13. abline() adds one or more straight lines to the current plot.
16 Using abline() Using the R function abline() to add the least squares regression line to Figure 2 on page 13. abline() adds one or more straight lines to the current plot. The arguments to abline() are a=b 0 and b=b 1
17 Using abline() Using the R function abline() to add the least squares regression line to Figure 2 on page 13. abline() adds one or more straight lines to the current plot. The arguments to abline() are a=b 0 and b=b 1 > abline(model,col="blue",lwd=2) Note: the object model contains b 0 and b 1.
18 Scatterplot of GPA with Superimposed Least Squares Regression Line Scatterplot of College Versus High School GPA College GPA 1.5 2.0 2.5 3.0 3.5 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 High School GPA Figure: Scatterplot requested in part 4
19 Residuals and Predicted (Fitted) Values The i th residual is defined to be e i = Y i Ŷi.
20 Residuals and Predicted (Fitted) Values The i th residual is defined to be e i = Y i Ŷi. The resulting value Ŷi from Equation (1) given an x i is referred to as the predicted value, as well as the fitted value.
21 Residuals and Predicted (Fitted) Values The i th residual is defined to be e i = Y i Ŷi. The resulting value Ŷi from Equation (1) given an x i is referred to as the predicted value, as well as the fitted value. The R functions predict() and fitted() can be used on lm objects.
22 Using fitted() and predict() > yhat <- b0+b1*x > yhatrp <- predict(model) > yhatrf <- fitted(model) > e <- Y - yhat > er <- resid(model) > COMPARE <- rbind(yhat,yhatrp,yhatrf,e,er) > COMPARE[,1:4] # all rows columns 1:4 1 2 3 4 yhat 2.68653 3.2253294 1.8783309 3.3600293 yhatrp 2.68653 3.2253294 1.8783309 3.3600293 yhatrf 2.68653 3.2253294 1.8783309 3.3600293 e -0.48653-0.4253294 0.5216691 0.4399707 er -0.48653-0.4253294 0.5216691 0.4399707
23 Sum of Squares Due to Error The sum of squares due to error (also called the residual sum of squares) is defined as SSE = n (Y i Ŷi) 2 = e 2 i (4) i=1 Use the definition in (4) and the R function anova() to compute the SSE for the regression of Y on x (Gpa).
24 R Code > SSE <- sum(e^2) > SSE [1] 1.502284 > anova(model) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) x 1 3.7177 3.7177 19.798 0.002141 ** Residuals 8 1.5023 0.1878 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 > anova(model)[2,2] [1] 1.502284
25 Pretty ANOVA Table Df Sum Sq Mean Sq F value Pr(>F) x 1 3.718 3.718 19.798 0.002 Residuals 8 1.502 0.188
26 Link to the R Script Go to my web page Script for Regression Homework: problems 2.35-2.40, 2.42-2.46 See me if you need help!