The Classical Linear Model and OLS Estimation

Transcription

1 Econ 507. Econometric Analysis. Spring 2009 January 19, 2009

2 The Classical Linear Model

3

4

5

6 Social sciences: non-exact relationships. Starting point: a model for the non-exact relationship between y (explained variable) and a set of variables x (the explanatory variables).

7 Assumption 1 (linearity): y i = β 1 x 1i + β 2 x 2i + + β K x Ki + u i, i = 1,..., n y i : explained variable for observation i. Its realizations are observed x ik, k = 1,..., K: K explanatory variables. Observed realizations. u i is a random variable with unobserved realizations. Represents the non-exact nature of the relationship. β k, k = 1,..., K are the regression coefficients. Assumption 1: the underlying relationship is linear for all observations.

8 The model in matrix notation Define the following vectors and matrices: y 1 β 1 y 2 β 2 Y =. y n n 1 β =. β K K 1 x 11 x x K1 x 12 x 22 x K2 X =..... x 1n x Kn u = n K u 1 u 2. u n n 1

9 Then the linear model can be written as: y 1 x 11 x 21 x K1. = x 22. y n x 1n x Kn β 1. β K + u 1. u n Y = Xβ + u This is the linear model in matrix form.

10 Basic Results on Matrices and Random Vectors Before we proceed, we need to establish some results involving matrices and vectors. Let A be a m n matrix. A: n column vectors, or m row vectors. The column rank of A is defined as the maximum number of columns linearly dependent. Similarly, the row rank is the maximum numbers of rows that are linearly dependent. The row rank is equal to the column rank. So we will talk, in general, about the rank of a matrix A, and will denote it as ρ(a) Let A be a square (m m) matrix. A is non singular if A 0. In such case, there exists a unique non-singular matrix A 1 called the inverse of A such that AA 1 = A 1 A = I m.

11 A a square m m matrix. If ρ(a) = m A 0 If ρ(a) < m A = 0 X a n K matrix, with ρ(x) = K (full column rank): ρ(x) = ρ(x X) = k This results guarantees the existence of (X X) 1 based on the rank of X.

12 Let b and a be two K 1 vectors. Then we define (b a) = a b Let b be a K 1 vector and A a symmetric K K matrix. (b Ab) b = 2Ab

13 Let Y be a vector of K random variables: Y = Y 1. Y k E(Y ) = µ = E(Y 1 ) E(Y 2 ). E(Y K )

14 V (Y ) = E[(Y µ)(y µ) ] E(Y 1 µ 1 ) 2 E(Y 1 µ 1 )(Y 2 µ 2 ) E(Y 2 µ 2 ) 2 =... = V (Y 1 ) Cov(Y 1, Y 2 )... Cov(Y 1 Y K ) V (Y 2 )... V (Y K ) E(Y k µ K ) 2 Tthe variance of a vector is called its variance-covariance matrix, an K K matrix If V (Y ) = Σ and c is an K 1 vector, then V (c Y ) = c V (Y )c = c Σc.

15 Conditional Expectations E(Y X = x) = y f Y X dy Idea: how the expected value of Y changes when X changes. It is a function that depends on X. If X is a random variable, then E(Y X) es una variable aleatoria. Properties E(g(X) X) = g(x) Y = a + bx + U, then E(Y X) = a + bx + E(U X). E(Y ) = E [E(Y X)] (Law of Iterated Expectations).

16 Assumption 2: Strict Exogeneity E(u i X) = 0, i = 1, 2,..., n In basic courses it is assumed that E(u i ) = 0. Which one is stronger?

17 Implications of strict exogeneity: E(u i ) = 0, i = 1,..., n. Proof: By the law of iterated expectations and strict exogeneity: E(u) = E[E(u X)] = E(0) = 0 In words: on average, the model is exactly linear. E(x jk u i ) = 0, j, i = 1,..., n; k = 1,..., K In words: explanatory variables are uncorrelated with the error terms of all observations. Proof: as excercise.

18 Assumption 3: No Multicollinearity Rank? ρ(x) = K, w.p.1 All columns of the realizations of X must be linearly independent. Careful: this prohibits exact linear relations between columns of X. The model admits non-exact relations and/or non-linear relations. Examples.

19 Assumption 4: spherical error variance Homoskedasticity: E(u 2 i X) = σ2 > 0, i = 1,..., n No serial correlation: E(u i u j X) = 0, i, j = 1,..., n.,, i j.

20 Homoskedasticity: by strict exogeneity V (u i X) = E[u 2 i X] E(u i X) 2 = E(u 2 i X) so the assumption implies constant conditional variance for the error term. No serial correlation: also by strict exogeneity Cov(u i, u j X) = E(u i u j X) so no serial correlation implies that given X all error terms of all observations are uncorrelated.

21 Assumption 4 in matrix terms: V (u X) = E(uu X) = σ 2 I n Recall that for any random vector Z of n elements: V (Z) E [ (Z E(Z))(Z E(Z)) ], an n n matrix with typical element v ij v ij = Cov(Z i, Z j ). Homoskedasticity (V (u 2 i X) = σ2 ) implies that all the diagonal elements of V (u X) are equal to σ 2. No sereial correlation implies that all the off-diagonal elements of V (u X) are zero.

22 Summary The Classical Linear Model The Classical Linear Model 1 Linearity: Y = Xβ + u. 2 Strict exogeneity: E(u X) = 0 3 No Multicollinearity: ρ(x) = K, w.p.1. 4 No heteroskedasticity/ serial correlation: V (u X) = σ 2 I n.

23 Details and Interpretations Fixed Regressors In basic treatments X is taken as a fixed, non-random matrix. This is more compatible with experimental sciences. It simplifies some computations. The Intercept Consider the case x 1i = 1, i = 1,..., n y i = β 1 + β 2 x 2i + + β K x Ki + u i, i = 1,..., n Then β 1 is the intercept of the model. Careful with interpretations.

24 Interpretations E(y i X) = β 1 + β 2 x 2i + + β K x Ki If E(y i X) is differentiable with respect to x ki, which is functionally unrelated to all other variables: E(y i X) = β k x k Careful: this is a partial derivative. A constant marginal effect.

25 Dummy explanatory variables: Suppose x ki is a binary variable, taking two values, indicating that the i-th observation belongs (1) or does not belong to a certain class (0) (male-female, for example). We cannot use the previous result for an interpretation (why?) Compute the following magnitudes E(y i X, x ki = 1) = β 1 + β 2 x 2i + + β k β K x Ki E(y i X, x ki = 0) = β 1 + β 2 x 2i + + β k β K x Ki Then β k = E(y i X, x ki = 1) E(y i X, x ki = 0) Example: gender differences.

26 The linear model is not that linear y i = β 1 + β 2 x 2i + + β K x Ki + u i Linear? Linear in variables. Linear in parameters. For estimation purposes, what matters is linearity in parameters.

27 A small catalog of non-linear models that can be handled with the classical linear model Quadratic: Y i = β 1 + β 2 X 2i + β 3 X2i 2 + u i Inverse: Y i = β 1 + β 2 X2i 1 + u i Interactive: Y i = β 1 + β 2 X 2i + β 3 X 3i + β 4 X 2i X 3i u i Logarithmic: ln Y i = β 1 + β 2 ln X 2i + u i Semilogarithmic: ln Y i = β 1 + β 2 X 2i + u i We will explore interpretations and examples in the homework.

28 Goal: recover β based on a sample y i, x i, i = 1,..., n. Let β be any estimator of β. Define Ỹ X β (our prediction of Y ). Define ẽ = Y Ỹ (estimation errors). Note that if n > k we cannot produce an estimator by forcing ẽ = 0. Why?. We need a criterion to derive a sensible and feasible estimator.

29 Consider the following penalty function: SSR( β) n ẽ 2 i = ẽ ẽ = (y X β) (y X β) i=1 SSR( β) is the aggregation of squared errors if we choose β as an estimator. The least squares estimator ˆβ will be: ˆβ = argmin β SSR( β)

30 Result: ˆβ = (X X) 1 X Y SSR( β) = ẽ ẽ = (Y X β) (Y X β) = Y Y β X Y Y X β + β X X β = Y Y 2 β X Y + β X X β In the second line, note that β X Y is a scalar, and hence it is trivially equal to its transpose, Y X β, that is how we obtain the result in the third line. SSR can be easily shown to be a strictly convex, differentiable function of β, so first order conditions for a stationary point are sufficient for a global minimum.

31 First order conditions are: ẽ ẽ β = 0 Using the derivation rules introduced before: ẽ ẽ β = 2X Y + 2X X β = 0 which is a system of K linear equations with K unknowns ( β). Solving for β gives the desired solution: ˆβ = (X X) 1 X Y

32 Some comments and details Existence and uniqueness: guaranteed by the rank assumption ρ(x) = K. Second order conditions: X X is positive definite, also by the rank assumption. The role of the assumptions: which of the assumptions have been used to derive the OLS estimator and guarantee its existence and uniqueness? Notation: Ŷ X ˆβ, e = Y Ŷ (the OLS residuals).

33 Recall the FOC s from the least squares problem: X Y + X X ˆβ = 0 X (Y X ˆβ) = 0 X e = 0 These are the normal equations The algebraic properties are those that can be derived from the normal equations.

34 Sum of errors: if the model has an intercept, one of the columns of X is a vector of ones, so X e = 0 implies: n e i = 0 i=1 Orthogonality: X e = 0. Implying that OLS residuals are uncorrelated to all explanatory variables. Linearity: ˆβ is a linear function of Y, that is, there exists a K n matrix A that depends solely on X, with ρ(a) = K such that ˆβ = AY. Proof: trivial. Set A = (X X) 1 X

35 Goodness of Fit First check some easy results, when there is an intercept in the model: Ȳ i = Ŷ i Start with Y i = Ŷi + e i. Take averages in both sides, ē i = 0 by the previous property. n i=1 (Y i Ȳ )2 = n i=1 (Ŷi Ȳ )2 + n i=1 e2 i Start with (Y i Ȳ ) = (Ŷi Ȳ ) + e i. Take squares. Then show e i (Ŷi Ȳ ) = e X ˆβ Ȳ e i = 0 by previous properties.

36 (Yi Ȳ )2 = (Ŷi Ȳ )2 + e 2 i The total variation in Y around its mean can be decomposed in two additive terms: one corresponding to the model and the second one to the estimation errors. If all errors are zero, then all the varation is due to the model: the fitted linear model explaines all the variation. This suggest the following measure of goodness of fit: R 2 n i=1 (Ŷi Ȳ n )2 n i=1 (Y i Ȳ = 1 i=1 e2 i n )2 i=1 (Y i Ȳ )2 This is the (centered) coeficient of determination: the proportion of the total variability explained by the fitted linear model.

37 Comments and properties (as homework) 0 R 2 1. ˆβ maximizes R 2. R 2 is non-decreasing in the number of explanatory variables, K. Use and abuse of R 2.

38 In some cases we will use the uncentered R 2 : The last equality holds since: R 2 u = Ŷ 2 i Y 2 i = 1 e 2 i Y 2 i Y 2 i = Y Y = (Ŷ + e) (Ŷ + e) = Ŷ Ŷ + e e + 2Ŷ e = Ŷ Ŷ + e e + 2 ˆβ X e = Ŷ Ŷ + e e by the orthogonality property.

39 Estimation of σ 2 We will need an estimator for σ 2. We will propose: n S 2 i=1 = e2 i n K = e e n K Later on we will establish its properties with more detail.