Lecture 1: Simple Linear Regression

Size: px

Start display at page:

Download "Lecture 1: Simple Linear Regression"

Dorthy Floyd
7 years ago
Views:

1 Lecture 1: Simple Linear Regression Maochao Xu Department of Mathematics Illinois State University

2 Review basic concepts in inference Assume X 1,..., X n are i.i.d. Normal random variables with mean µ and variance σ 2. That is, X i N(µ, σ 2 ), i = 1,..., n. Sample mean Sample variance Unbiased estimates n X = X i /n. n S 2 ( = Xi X ) 2 /(n 1). E( X) = µ, E(S 2 ) = σ 2. Standard normal distribution X µ σ/ N(0, 1). n t distribution X µ S/ n t n 1.

3 χ 2 distribution Let Z 1,..., Z n be n i.i.d. standard normal random variables. Then, the chi-square random variable is defined as It is known that Recall that F distribution (how to use it?) where χ 2 m and χ2 n are independent. χ 2 n = Z Z 2 n. E(χ 2 n) = n, Var(χ 2 n) = 2n. (n 1)S 2 σ 2 χ 2 n 1. F m,n = χ2 m/m χ 2 n/n,

4 Estimation methods Maximum Likelihood estimator (MLE) Suppose Y 1,..., Y n are random variables with density function f (y; θ), where θ is an unknown parameter. Given independent observations y 1,..., y n, the likelihood function can be expressed as L(θ) = n f (y i ; θ). The MLE can be obtained by maximizing L(θ) or log(l(θ)). That is, ˆθ = arg θ max L(θ). Least square estimator (LSE) The sample observations are assumed to be of the form Y i = f i (θ) + ɛ i, i = 1,..., n, where f i (θ) is a known function of the parameter θ and the ɛ i are random variables. The LSE can be obtained by minimizing the following function That is, n Q(θ) = [Y i f i (θ)] 2. ˆθ = arg θ min Q(θ).

5 Simple linear regression Example 1: Mother and daughter s heights During the period , E. S. Pearson organized the collection of n = 1375 heights of mothers in the United Kingdom under the age of 65 and one of their adult daughters over the age of 18. Questions: 1. How are mother s height related to daughter s height? 2. Can daughter s height be predicted from mother s height?

6 Example 2: Atmospheric pressure and the boiling point A Scottish physicist named James D. Forbes discussed a series of experiments that he had done concerning the relationship between atmospheric pressure and the boiling point of water. He collected 17 points from different locations. Questions: 1. How are pressure and boiling point related? 2. Can pressure be predicted from boiling point and how well?

7 Example 3: House values What is the fair market value of a house? Questions: 1. If a house have 8 rooms, what is the price? 2. What is the price range for a house? 3. What is an average price for a room?

8 Model and notations 1. Bivariate data (X, Y ): (x 1, y 1, ),..., (x n, y n) 2. X: Independent variable 3. Y : Dependent variable (response) 4. Regression equation: Y = β 0 + β 1 X + ɛ, where β 0 is the intercept, and β 1 is the slope, which represents number of units increase in Y if X increases one unit. For ith trail, 5. Assumptions on ɛ i : Independence: Homoscedasticity (constant variance): Y i = β 0 + β 1 X i + ɛ i. Cov(ɛ i, ɛ j ) = 0, i j. Zero mean: Var(ɛ i X = x) = σ 2. E[ɛ i X = x i ] = 0.

9 Ordinary least square estimation: OLS Minimize residual sum of squares. 1. OLS: n RSS(β 0, β 1 ) = [y i (β 0 + β 1 x i )] 2. Find β 0 and β 1 to minimize RSS(β 0, β 1 ) (how?): (b 0, b 1 ) (or ( ˆβ 0, ˆβ 1 )). 2. Residual sum of squares: n RSS = RSS(b 0, b 1 ) = [y i (b 0 + b 1 x i )] 2 = ei Residual: 4. Fitted value: e i = y i (b 0 + b 1 x i ), i = 1,..., n. ŷ i = b 0 + b 1 x i.

10 Ordinary least square estimation: OLS b 1 = n (X i X)(Y i Ȳ ) n (X i X) 2. b 0 = Ȳ b 1 X.

11 Estimating σ 2 The regression equations give the mean of the group. What is the standard deviation of this group? That is, we have to estimate σ. A nature estimator for σ 2 is Y i = β 0 + β 1 X i + ɛ i, Var(ɛ i ) = σ 2. 1 n n (ɛ i E[ɛ]) 2 = 1 n ɛ 2 i n = 1 n (y i β 0 β 1 x i ) 2. n Since, β 0 and β 1 are unknown, we use the estimates: 1 n n ei 2 = 1 n (Y i b 0 b 1 X i ) 2 = SSE/n. n Since ˆβ 0, ˆβ 1 are estimated, êi 2 are no longer independent, we use the following estimator: s 2 = SSE n 2 = MSE, where n 2 is the degree of freedom (df) (why?), and MSE stands for error mean square or residual mean square. Generally, df = number of cases-number of parameters.

12 Properties of OLS The estimates b 0 and b 1 can both be written as linear combinations of Y 1,..., Y n. (prove!) b 1 = k i Y i, where k i = X i X n (X i X) 2. Some interesting properties of k i : k i = 0; k i X i = 1; k 2 i = 1/ (X i X) 2. Properties of b 1 : 1. Mean-Unbiased 2. Variance E(b 1 ) = β 1. Var(b 1 ) = ki 2 σ 2 Var(Y i ) = (Xi X) Estimated variance S 2 (b 1 ) = MSE (Xi X) 2, which is an unbiased estimate of Var(b 1 ).

13 The fitted value at x = x is Properties of OLS Ê(Y X = x) = b 0 + b 1 x = ȳ b 1 x + b 1 x = ȳ. so the fitted line must pass through the point ( x, ȳ), intuitively the center of the data. So, we have b 0 = Ȳ b 1 X. Under the linear model assumptions, the least squares estimates are unbiased: Further, Similarly, the estimated variance is E(b 0 ) = β 0. ( ) Var(b 0 ) = σ 2 1 n + X 2 (Xi X). 2 S 2 (b 0 ) = MSE The two estimates are correlated, with covariance ( ) 1 n + X 2 (Xi X). 2 X Cov(b 0, b 1 ) = σ 2 (Xi X). 2 (Question: What happens if the data becomes more spread?)

14 Properties The sum of residuals is zero: n e i = 0. The sum of observed values Y i equals to the sum of the fitted values Ŷi : n n Y i = Ŷ i. The sum of the weighted residuals is zero: n X i e i = 0. Also n Ŷ i e i = 0.

15 Gauss-Markov theorem The OLS estimates are the best unbiased linear estimators (BLUE). For example, if b 1 = k i Y i, E(b 1 ) = β 1, then for any other unbiased linear estimate b 1 (prove!) Var(b 1 ) Var(b 1). If we further assume that ɛ i N(0, σ 2 ), then the ols estimates are also maximum likelihood estimates (MLE). Under the normal assumption, ( ) σ ˆβ 2 1 N β 1, (Xi X), 2 ˆβ 0 N ( )) (β 0, σ 2 1 n + x 2 (Xi X) 2 These quantities will be used to construct confidence intervals, testing, and other statistical inferences.

16 Linear model assumptions: Confidence intervals and tests y i = β 0 + β 1 x i + ɛ i, ɛ i N(0, σ 2 ). Under this model: b 0 β 0 S(b 0 ) t n 2, b 1 β 1 S(b 1 ) Hence, 100(1 α)% confidence interval for β 0 is t n (1 α)% confidence interval for β 1 is b 0 ± t n 2,α/2 S(b 0 ); b 1 ± t n 2,α/2 S(b 1 ). A hypothesis test of is obtained by computing H 0 : β 0 = β 0 vs H a : β 0 β 0, t = b 0 β 0 S(b 0 ) t n 2, under H 0. Then, reject H 0 if t > t n 2,α/2.

17 Confidence intervals and tests Similarly, a hypothesis test of is obtained by computing H 0 : β 1 = β 1 vs H a : β 1 β 1, t = b 1 β 1 S(b 1 ) t n 2, under H 0. Then, reject H 0 if t > t n 2,α/2. p-values can be computed as Considering test problem: we have p = 2P(T > t). H 0 : β 1 = 0 vs H a : β 1 0, ( ) 2 t 2 b1 = = b2 1 S(b 1 ) S 2 (b 1 ) F. So the square of a t statistic with d df is equivalent to an F-statistic with (1, d) df.

18 Interval estimation of EY h A common objective in regression analysis is to estimate the mean for one or more probability distribution of Y. Let X h denote the level of X for which we wish to estimate the mean response. Then, by the regression equation we have Ŷ h = b 0 + b 1 X h. Sampling distribution of Ŷh Ŷ is a Normal random variable. (Why?) Mean ) E (Ŷh = β 0 + β 1 X h. Variance Estimated variance t-distribution [ Var(Ŷh) = σ 2 1 n + (X h X) ] 2 (Xi X). 2 S 2 (Ŷh) = MSE ) Ŷ h E (Ŷh S(Ŷh) Hence, the 100(1 α)% confidence interval is [ 1 n + (X h X) ] 2 (Xi X). 2 t n 2. Ŷ h ± t α/2;n 2 S(Ŷh).

19 Prediction In prediction we have a new case, possibly a future value, not one used to estimate parameters, with observed value of the predictor X. We would like to know the value Y, the corresponding response, but it has not yet been observed. We can use the estimated mean function to predict it. A nature estimation is The variance of prediction error Y = β 0 + β 1 X + ɛ, Var(ɛ ) = σ 2. Ỹ = b 0 + b 1 X. ( ) Var(pred) = Var(Ỹ Y ) = σ2 + σ 2 1 (X X) 2 + n (Xi X). 2 The estimated standard error of prediction at X S(pred) = MSE ( n + (X X) 2 (Xi X) 2 ) 1/2.

20 Hence, So, the prediction interval for Y is Y Ỹ S(pred) t n 2. Ỹ ± S(pred)t α/2,n 2.

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation