RNy, econ460, 06 Lecture note. Regression equations and other model equations Hendry and Nilsen (HN), start with the joint probability function (pdf) for the observable random variables, for example in 5. in HN about the two-variable regression model. Extending the joint distribution of the data from Lecture, we can think of f(x, X,..., X k ) () as the pdf for the k + random variables (X 0i, X i,..., X ki ). When we choose to use a regression model equation, we specify one random variable as a regressand and the other as regressors. Let Y X 0 define the regressand. We can always write the joint pdf as the product of a conditional pdf and a marginal pdf: f(y, X,..., X k ) f(y X,..., X k ) f(x,..., X k ) () Note that f(x,..., X k ) is a joint pdf in its own right, but it is marginal relative to the full pdf on the left hand side. From f(y X,..., X k ) we can always construct the conditional expectation function: E(Y X,..., X k ) (3) and define the disturbance ε as: ε Y E(Y X,..., X k ). (4) For each realization of the X-variables, the conditional expectation E(Y x,..., x k ) is deterministic. But we can consider the expectation for any realization of X, and by the Law of iterated expectations we get: E(E(ε X,..., X k )) E(ε) E([Y E(Y X,..., X k ) X,..., X k ) 0 E(ε) 0 (5) Added together, E(Y X,..., X k ) and ε gives the regression model in equation form: Y E(Y X,..., X k ) + ε (6) However, we keep in mind that E(Y X,..., X k ) needs not be linear in the parameters, and that the specification of functional-form often is the main modelling task. An important reference case is however when the joint pdf is a multivariate normal distribution. In that case it is implied that the conditional expectation function is linear, as is assumed directly in the model specification on page 68, 5., in HN: Assumption number ii) conditional normality. This assumption would have been implied if HN had assumed that the joint pdf on page 66 was a normal distribution. In that case, the marginal pdf is also normal (a well known property of the multivariate normal distribution). Conversely, when conditional normality is used in the model specification, nothing is implied about normality of the joint pdf. It can be whatever it wants, because the marginal pdf can also be whatever it wanth. In that sense, the assumption of conditional normality is weaker than the assumption about joint normality. The same remarks apply to the model specification for multiple regression in Ch 7 of HN as well. For k 3, the equation form of the model becomes Y i β + β X i + β X 3i + ε i, i,,..., n (7)
as a consequence of the assumed conditional normality, but we could also derive the same equation if we had a reason to assume that the joint pdf of f(y, X, X, X 3 ) was normal. Either way, ε i is a normally distributed variable with expectation zero, and constant variance. Note that in (7), we have added the usual subscript i for observation, and we have defined X as a constant with value. To be relevant, the parameters of the regression E(Y X,..., X k ), or derived parameters from the regression,should include our parameters of interest (focus parameters). Examples of parameters or from a regression model: The average response in Y to a change in one X j. The best prediction of Y given x,..., x k. Short-run, medium run and long-run multipliers (in dynamic regression models). Short-term and long run forecasts (in dynamic regression models). An alternative to regression modelling, is instead to put the joint pdf f(y, X,..., X k ) on equation form. We call this system modelling and it will be a central time later in the course. The distinction between regression model equations and other model equations can be subtle. We often focus on a single equation in the joint pdf, for example the demand function in a two equation model for price and quantity in a product market. The model equation (for demand Y ) will then look like a linear regression model: Y i γ + γ X i + γ 3 X 3i + ɛ i, i,,..., n, (8) but because E(ɛ i X i, X i ) 0, it is not a regression equation but a structural equation which is part of a system-of-equation representation of the joint pdf.. Projection matrices and TSS and RSS, and all that The matrix M I X(X X) X (9) is often called the residual maker. That nickname is easy to understand, since: My (I X(X X) X )y y X(X X) X y y X ˆβ ˆε M plays a central role in many derivations. The following properties are worth noting (and showing for yourself) : M M, symmetric matrix (0) M M, idempotent matrix () MX 0, regression of X on X gives perfect fit () Another way of interpreting MX 0 is that since M produces the OLS residuals, M is orthogonal to X. M does not affect the residuals: Since y Xβ + ε, we also have: Mˆε M(y X ˆβ) My MX ˆβ ˆε (3) M ε M(y Xβ) ˆε. (4) Note that this gives: ˆε ˆε ε Mε (5)
which shows that RSS is a quadratic form in the theoretical disturbances. A second important matrix in regression analysis is: which is called the prediction matrix, since P X(X X) X (6) ŷ X ˆβ X(X X) X y Py (7) P is also symmetric and idempotent. In linear algebra, M and P are both known as projection matrices. M and P are orthogonal: MP PM 0 (8) Since M and P are also complementary projections which gives and therefore: The scalar product y y then becomes: M I P, M + P I (9) (M + P)y Iy y ŷ + ˆε Py + My ŷ + ˆε (0) y y(y P + y M)(Py + My) ŷ ŷ + ˆε ˆε Written out, this is: Y i T SS Ŷ i ESS + ˆε i RSS You may be more used to write this famous decomposition as: () (Y i Ȳ ) } {{ } T SS (Ŷi Ŷ ) } {{ } ESS + ˆε i RSS () where we the Y i s and Ŷis have been de-meaned or centered. As long as there is a constant in the model there is no contradiction between () and (), since X ˆε X (y X ˆβ) 0 as result of OLS and, when there is an intercept in the regression: X [ ι X this gives in the first row of X ˆε 0, and therefore: ι ˆε ˆε i 0 (3) Using this, you can show that () can be written as (). Ȳ Ŷ. (4) 3
Alternative ways of writing RSS that sometimes come in handy: ˆε ˆε (y M )My y My y ˆε ˆε y (5) ˆε ˆεy y y P Py y y y [ X(X X) X [ X(X X) X y (6) y y y X(X X) X y y y y X ˆβ y y ˆβ X y. R can also be defined with reference to (): R ˆε ˆε n (Y i Ȳ ) ˆε ˆε (M ι y) (M ι y) M ι I ιι (7) where we used the the so called centering matrix M ι.it has the property that when we pre-multiply an n dimentional vector with M ι, we get a vector where the mean is subtracted, which is the same as the residual from regressing a variable on a constant. 3. Partitioned regression We now let β [ ˆβ where β has k parameters and β has k parameters. k + k k. [ The corresponding partitioning of X is X X n k X. The OLS estimated model with this notation: y X ˆβ +X ˆβ + ˆε, (8) We begin by writing out the normal equations: ˆε My. n k (X X) ˆβ X y in terms of the partitioning: ( ) ( ) X X X X ˆβ X X X X ( ) X y X. (9) y where we have used [ X X [X X X X see page 0 and Exercise.9 in DM. The vector of OLS estimators can therefore be expressed as: ( ) ( ) ˆβ X X X ( ) X X y X X X X X. (30) y We need a result for the inverse of a partitioned matrix. There are several versions, but probably the most useful for us is: ( ) ( A A A A A A A ) A A A A A A (3) where A A A A A (3) A A A A A. (33) 4
With the aid of (3)-(33) it can be shown that ( ) ˆβ ( (X M X ) ) X M y (X M X ), (34) X M y where M and M are the two residual-makers: M I X (X X ) X (35) M I X (X X ) X. (36) As an application, start with: X [ X X [ ι X where ι ι n so ι X. n X... X k X... X k.. X n... X nk M I ι (ι ι) ι M I ιι M ι the centering matrix. Show that it is: n M ι..... n n By using the general formula for partitioned regression above, we get (X M X ) X M y ( [M ι X [M ι X ) [Mι X y [ (X X ) (X X ) (X X ) y (37) where X is the n (k ) matrix with the variable means in the k columns. In the seminar set you are invited to show the same result for more directly, without the use of the results from partitioned regression, by re-writing the regression model. 4. Frisch-Waugh-Lovell (FWL) theorem The expression for in (34) suggests that there is another simple method for finding that involves M : We premultiply the model (8), first with M and then with X : M y M X ˆβ + ˆε (38) X M y X M X ˆβ (39) 5
where we have used M X 0 and M ˆε ˆε in (38) and X ˆε 0 in (38), since the influence of the k second variables has already been regressed out. We can find from this normal equation: It can be shown that the covariance matrix is ( X M X ) X M y (40) ( ) Cov ˆβ ˆσ (X M X ). where ˆσ denotes the unbiased estimator for the common variance of the disturbances: The corresponding expressions for ˆβ is: ˆσ ˆεˆε n k. ˆβ (X M X ) X M y (4) ( ) Cov ˆβ ˆσ (X M X ). Interpretation of (40):Use symmetry and idempotency of M : ((M X ) (M X )) (M X ) M y Here, M y is the residual vector ˆε y X and M X are the matrix with k residuals obtained from regressing each variable in X on X, ˆε X X. But this means that can be obtained by first running these two auxiliary regressions,saving the residuals in ˆε Y X and ˆε X X and then running a second regression with ˆε Y X as regressand, and with ˆε X X as the matrix with regressors: (M X ) (M X ) ˆε X X ˆε X X (M X ) (M y ) (4) ˆε X X ˆε Y X We have, in fact, shown the Frisch-Waugh(-Lovell) theorem! If we want to estimate the partial effects on Y of changes in the variables in X (i.e., controlled for the influence of X ), we can do that in two ways. Either by estimating the full multivariate model y X ˆβ +X ˆβ + ˆε or, by following the above two-step procedure of obtaining residuals with X as regressor and then obtaining from (4). Other applications: Trend correction Seasonal adjustment by seasonal dummy variables Leverage (or influential observations) 6