CITY AND REGIONAL PLANNING Setting Up Models. Philip A. Viton. January 5, What is conditional expectation? 2. 3 The Modelling Setup 4

CITY AND REGIONAL PLANNING 870.3 Setting Up Models Philip A. Viton January 5, 2010 Contents 1 Introduction 1 2 What is conditional expectation? 2 3 The Modelling Setup 4 4 Understanding the Assumptions 7 5 Estimation 8 6 Proofs 8 Warning This is likely to contain errors and inconsitencies. Please bring any you find to my attention. 1 Introduction Here are some questions planners might want to investigate: Theory tells us that a cost function depends on output and factor prices. How do total costs change, as output increases? Or more specifically, how do total costs change per unit increase in output, holding factor prices constant? 1

If the housing capitalization hypothesis is correct, and assuming that kids attend local schools, then housing prices should increase with the quality of local schools. What is that impact? That is, if school quality goes up by one unit, what will happen to housing prices, holding other determinants of house prices constant? Or, in the jargon that is often used, what is the marginal effect (on housing prices) of a change in school quality? Demand theory suggests that increases in the price of gasoline should make it more likely that a commuter will choose to travel by public transit. How much more likely? That is, what is the unit impact on the choice probability of a unit increase in the price of gasoline, holding other characteristics of autos and transit constant? What is the cross-price elasticity of the demand for public transit? What is common to all these questions is a focus on the impact of one variable on another, holding everything else constant. How are we to formalize the notion of holding everything else constant? For many researchers, that is captured by the conditional expectation function. 2 What is conditional expectation? Consider two discrete random variables: x whose possible values are (1, 2, 5, 7) and y whose possible values are ( 1, 0, 2), and assume that their joint probability density function is as follows p(y, x) = x 1 2 5 7 1 0.074 0.159 0.037 0.099 y 0 0.067 0.071 0.019 0.008 2 0.109 0.101 0.122 0.135 For example, we see that the probability of observing a value y = 0 and x = 5 is p(0, 5) = p(y = 0, x = 5) = 0.019. Note that x,y p(y, x) = 1. The marginal distributions are obtained by summation: p(x) = y p(y, x) and p(y) = x p(y, x). Thus, for example, we have p(x = 1) = 0.074+0.067+ 0.109 = 0.25 : this is the column-sum of the p(y, x) array corresponding to the value x = 1. Similarly, p(x = 2) = 0.330, p(x = 5) = 0.178 and p(x = 7) = 0.242. 2

For each possible value of x we have a distribution (density) of the y values: this is the conditional density of y given x. By definition (assuming that p(x) > 0) this is p(y, x) p(y x) = p(x) and for our example this turns out to be p(y x) = x 1 2 5 7 1 0.295 0.481 0.207 0.411 y 0 0.269 0.214 0.108 0.033 2 0.436 0.306 0.685 0.556 and we have, for example, p(y = 0 x = 5) = 0.108. Note that y p(y x) = 1 (the column sums of the p(y x) array are all 1). For each possible value of x we can compute the expected value of y given that x takes the value in question. This is the conditional expectation of y, given that x takes the value in question; and is obtained in the usual way: we sum the products of the possible values of y, each weighed by its conditional density value. For example, for x = 1, the conditional expectation of y is E(y x = 1) = ( 1 p(y = 1 x = 1)) + (0 p(y = 0 x = 1)) + (2 p(y = 2 x = 1)) = 1.013 or more generally E(y x = x 0 ) = y y p(y x = x 0 ) For our example, the four possible values of E(y x = x 0 ) are given in the last row of the table below: x 1 2 5 7 1 0.295 0.481 0.207 0.411 y 0 0.269 0.214 0.108 0.033 2 0.436 0.306 0.685 0.556 E(y x = 1) E(y x = 2) E(y x = 5) E(y x = 7) = 1.013 = 0.437 = 1.847 = 1.258 We now generalize this to consider the values of the conditional expectations as a group: the result is the conditional expectation function, E(y x), telling us the 3

expected (=average) value of y given any value of x. Note that this is a function only of x, since its value depends on which x value we are talking about, and we average over y values: E(y x) = h(x). 1 In this respect, the notation E(y x) can be a bit confusing: some people prefer to write E y (y x) to remind us of what is being summed (integrated) over. It should now be apparent that the conditional expectation function formalizes our intuitive notion of the way the average value of y varies, given a particular value of x. The conditional expectation function (CEF) has some useful properties, summarized in the following (proofs in the final section). Theorem 1 (Properties of the CEF). If z = h(x, y) a. E(z) = E(E(z x)) b. E(y) = E(E(y x)) c. E(x, y) = E(x E(y x)) d. Cov(x, E(y x) = Cov(x, y) The first two results are often referred to as the Law of Iterative Expectations. (Clearly, (b) follows from (a) by taking z = y). So far, so good. But what is the connection to statistical modelling? 3 The Modelling Setup We are given a dependent random variable y and we assume that theory (or intuition) tells us that it depends on (is partly explained by) a k 1 vector of independent random variables x. As we ve argued, we want to be able to investigate the part of y that is explained by x, and this leads us to focus on the conditional expectation E(y x). We can always decompose y into the part that is explained by x and the rest, and we therefore write y = E(y x) + u 1 Of course, a parellel devlopment leads to another conditional expectation function, E(x y) : there is nothing sacrosanct about the E(y x) we have been studying. 4

where, by definition, u = y E(y x). When will this be an acceptable statistical model? It turns out that we need to make an assumption about u for the decomposition to go through; this is stated in the following result. Theorem 2 (Decomposition Theorem). If we write y = E(y x) + u Then E(u x) = 0 Corollary 3 For any function h(x) E(u h(x)) = Cov(u, hx(x)) = 0 E(ux) = Cov(u, x) = 0 The Theorem says that, conditional on x, u has expectation zero. The Corollary says that u is uncorrelated with any function of x. In other words, to apply this setup, the error u must not involve either x or any function of x. If we take h(x) 1 then we see that this also implies E(u) = 0. But now a question arises: why focus on this particular decomposition y = E(y x) + u? One reason is, as we ve seen, the conditional expectation function is a way of getting at what we want to understand, namely the way that the average value of y varies with given x. But there is an additional reason. Suppose we would like to predict or explain y by some function of x, say m(x). And suppose we agree to evaluate different m s according to a mean squared error criterion: the preferred m is the one that minimizes the mean (expected) squared error. Then we have the following result: Theorem 4 The function m(x) that minimizes the mean squared error E(y m(x)) 2 is m(x) = E(y x) 5

In other words, the conditional expectation function E(y x) is the best (in the sense of minimum mean squared error, MMSE) predictor/explainer of y in terms of x. Of course, we don t know what the CEF is. So let s lower our sights a bit, and consider predicting y linearly, that is, by a function of the form x β. If we retain the MMSE criterion, we have the following result: Theorem 5 Suppose that [E(xx )] 1 exists 2, and consider a linear-in-parameters approximation x β to y. The optimal β (in the MMSE sense) is with ˆβ = [E(xx )] 1 E(x y] Corollary 6 If we write then v = y x ˆβ E(v) = 0 Cov(x, v) = 0 The quantity [E(xx )] 1 E(x y] is called the population regression coefficient vector of y on x, and x ˆβ is the population regression of y on x. So the theorem says that the population regression of y on x is the best linear approximation of y, also called the Best Linear Predictor (BLP) of y. The Corollary gives two properties of the BLP residual: it has mean zero and is uncorrelated with (all) the x s. Note that ˆβ is the coefficient vector in the population regression. Despite what you might think at first glance, it is not the least-squares coefficient vector that you may have seen before: note particularly the expectation operators involved. Finally, we can tie all this together in terms of the object we re interested in, namely the CEF. Theorem 7 Suppose that [E(xx )] 1 exists, and consider a linear-in-parameters approximation x β to E(y x). The optimal β (in the MMSE sense) is with ˆβ = [E(xx )] 1 E(x y] 2 Remember that x is a random variable, so strictly speaking we should require that E[(xx ) 1 ] exists with probability 1 (or perhaps, with probability approaching 1 in large samples). 6

In other words, with ˆβ standing for the population regression coefficient vector, the quantity x ˆβ is the optimal linear approximation to the CEF (as well as the optimal linear approximation to y). Note that it may be possible to get a better approximation if we allow non-linear functions of x. Still, it is conventional to restrict ourselves (at least initially) to linear approximations. One reason for this is that the linearity involved here is linearity-in-parameters: the x-vector could well contain squares, exponentials (etc) of the individual terms, but note that it cannot contain an exact linear combination of the x s, or else [E(xx )] 1 will not exist. 4 Understanding the Assumptions At this point we can try to clarify what is going on when we talk about linearity. Remember that we are interested in setting up a model of the conditional expectation function. First, consider y = x β + u As it stands, this is vacuous: we can always write u = y x β. In other words, just writing down y = x β + u is a completely pointless exercise. Next, consider y = x β + u E(u) = 0 This asserts only that E(y) = x β, so is not an assertion about the conditional expectation function at all. Third, consider y = x β + u E(u) = 0 Cov(x, u) = 0 The assertions E(u) = 0 and Cov(x, u) = 0 are characteristics of the best linear predictor (BLP) of y, as we saw in Corollary 6, so this statement is just the assertion that x β is the best linear predictor of y. Again, it is not an assertion of the object we really want to study, namely the conditional expectation function, though it may be useful if you are concerned with predicitng y. Finally consider y = x β + u E(u x) = 0 7

The assertion that E(u x) is a characteristic of the conditional expectation function, so here we are making a relevant assumption, namely that the conditional expectation function is linear. From Corollary 3 we see that for this model to work we must have the residual u = y E(y x) uncorrelated with all the variables making up x. It is a vital part of any empirical research that wants to estimate E(y x) to argue that this non-correlation condition really does hold for the problem of interest. Often it does not, and in that case we will need to consider carefully what to do. 5 Estimation If we confine our attention to linear functions, then there is a strong case for being interested in the population regression coefficient vector ˆβ. But to make progress we need to estimate it, given that we ordinarily have a sample of data. How can we do this? The analogy principle suggests that we estimate a population parameter by the corresponding sample parameter. In particular, given that we are interested in (population) expectations (averages), we could try estimating using the (sample) averages (means). This is certainly plausible: if, for example, we had a random sample from a normal population with known variance and unknown mean µ, then we are all used to estimating the unknown mean by the sample average. Using the analogy principle, it is easy to show that an estimate b of the population regression coefficient vector ˆβ is b = (X X) 1 X y where X is the matrix of sample data, with rows x. But of course, the analogy principle is just that: an idea leading to an estimator. What can we say about b that would consider it to be an interesting or good estimator? 6 Proofs Here we sketch proofs of the main results in the text. 8

Theorem 1 (Properties of the CEF). If z = h(x, y) a. E(z) = E x (E(z x)) b. E(y) = E(E(y x)) c. E(x, y) = E(x E(y x)) d. Cov(x, E(y x)+ = Cov(x, y) Proof: We prove these results for the continuous case only, under the assumption that all random variables have finite second moments. (For the discrete case, replace integrals by sums). Notation: The random variables x and y have joint density f xy (u, v), conditional density f x y (u, v) and marginal densities f x (u) = f xy (u, v) dv and f y (v) = f xy (u, v)du. For (a), let z = h(x, y). We have: E(z) = h(u, v) f xy (u, v)dv = h(u, v) f y x (u, v) f x (u) du dv ( ) = h(u, v) f y x (u, v) dv f x (u) du = E(z x) f x (u) du = E x (E(z x) For (b), use (a) with z = h(x, y) = y. For (c), For (d): E x (x E(y x)) = = ( ) x y f x y (u, v)dv f x (u)du xy f xy (u, v) du dv = E(x y) Cov(x, E(y x)) = E(x E(x y)) E(x)E E(y x) 9

By part (c), E(x E(x y)) = E(x y), and by part (b) E E(y x)) = E(y), so Cov(x, E(y x)) = E(x E(x y)) E(x)E E(y x) = E(xy) E(x)e(y) = Cov(xy) Theorem 2 (Decomposition Theorem). If we write y = E(y x) + u Then E(u x) = 0 Proof: E(u x) = E(y E(y x)) = E(y) E E(y x) = E(y) E(y) = 0 using the Law of Iterated Expectations. Corollary 3 For any function h(x) E(u h(x)) = Cov(u, hx(x)) = 0 E(ux) = Cov(u, x) = 0 Proof: For the first statement E(h(x) u) = E(h(x) E(u x)) using part (c) of the Properties of Conditional Expectation. But E(u x) = 0 by the Decomposition Theorem, so we have E(h(x) u) = E(h(x) E(u x)) = E(h(x) 0) = 0 For the second statement, just take h(x) = x. 10

Theorem 4 The function m(x) that minimizes the mean squared error E(y m(x)) 2 is m(x) = E(y x) Proof: Add and subtract E(y x) and write (y m(x)) 2 = ((y E(y x)) + (E(y x) m(x)) 2 = (y E(y x)) 2 + (E(y x) m(x)) 2 +2((y E(y x)) ((E(y x) m(x)) The first term doesn t involve m(x) and can be ignored. In the third term we have y E(y x) = u, so this is u h(x), which will have expectation zero by the Decomposition Theorem. That leaves the second term (m(x) E(y x)) 2 which is clearly minimized when m(x) = E(y x). Theorem 5 Suppose that [E(xx )] 1 exists, and consider a linear-in-parameters approximation x β to y. The optimal β (in the MMSE sense) is with ˆβ = [E(xx )] 1 E(x y] Proof: We want to choose β to The first-order condition (FOC) is min β E(y x β) 2 0 = E(x (y x β)) so, solving under the stated assumptions, = E(x y) E(xx )β ˆβ = [E(xx )] 1 E(x y] Corollary 6 If we write then v = y x ˆβ E(v) = 0 Cov(x, v) = 0 11

Proof of the Corollary: The first-order condition can be written E(x v) = 0 This holds for all elements in the vector x. If x has a 1 in it (for an intercept) then we see immediately that E(v) = 0. To show that Cov(x v) = 0 write Cov(x v) = E(x y) E(x)E(v) = E(x v) = 0 since E(v) = 0 and the last term is zero. Theorem 7 Suppose that [E(xx )] 1 exists, and consider a linear-in-parameters approximation x β to E(y x). The optimal β (in the MMSE sense) is with ˆβ = [E(xx )] 1 E(x y] Proof: We want to choose β to solve min β E( E(y x) x β) 2 Consider again the problem min β E(y x β) 2 whose solution ( ˆβ) we have just found. Look at (y x β) 2 and write it as (y x β) 2 = ((y E(y x)) + (E(y x) x β)) 2 = (y E(y x)) 2 + (E(y x) x β) 2 +2(y E(y x))(e(y x) x β) Take expectations. Remembering that we are interested in β, the first term can be ignored, since it doesn t involve β. In the third term, (y E(y x)) = u will have expectation zero by the Decomposition Theorem. Thus we have found that in our context E(y x β) 2 = (E(y x) x β) 2 so they have the same solution, namely ˆβ. 12