CITY AND REGIONAL PLANNING Setting Up Models. Philip A. Viton. January 5, What is conditional expectation? 2. 3 The Modelling Setup 4

Similar documents
Covariance and Correlation

Math 4310 Handout - Quotient Vector Spaces

Econometrics Simple Linear Regression

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Simple Regression Theory II 2010 Samuel L. Baker

Polynomial Invariants

Joint Exam 1/P Sample Exam 1

1 Teaching notes on GMM 1.

Factor analysis. Angela Montanari

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

The Bivariate Normal Distribution

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Linear Algebra Notes

Orthogonal Diagonalization of Symmetric Matrices

THE DIMENSION OF A VECTOR SPACE

160 CHAPTER 4. VECTOR SPACES

Section 4.1 Rules of Exponents

Hypothesis Testing for Beginners

Correlation in Random Variables

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

LS.6 Solution Matrices

CALCULATIONS & STATISTICS

Row Echelon Form and Reduced Row Echelon Form

Quadratic forms Cochran s theorem, degrees of freedom, and all that

4.5 Linear Dependence and Linear Independence

Multiple Linear Regression in Data Mining

Linearly Independent Sets and Linearly Dependent Sets

Regular Languages and Finite Automata

Notes on Factoring. MA 206 Kurt Bryan

Section 6.1 Joint Distribution Functions

The Gravity Model: Derivation and Calibration

Determine If An Equation Represents a Function

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Notes on Determinant

Multi-variable Calculus and Optimization

Lecture 3: Finding integer solutions to systems of linear equations

Revised Version of Chapter 23. We learned long ago how to solve linear congruences. ax c (mod m)

MATH Fundamental Mathematics IV

Note on growth and growth accounting

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

3. Mathematical Induction

Name: Section Registered In:

Chapter 6: Multivariate Cointegration Analysis

Vieta s Formulas and the Identity Theorem

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

1 Short Introduction to Time Series

The last three chapters introduced three major proof techniques: direct,

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Linear Algebra Notes for Marsden and Tromba Vector Calculus

SECTION 10-2 Mathematical Induction

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

SAS Software to Fit the Generalized Linear Model

6. LECTURE 6. Objectives

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

5 Homogeneous systems

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Linear Maps. Isaiah Lankham, Bruno Nachtergaele, Anne Schilling (February 5, 2007)

Module 5: Multiple Regression Analysis

Linear Programming Notes VII Sensitivity Analysis

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

Solution to HW - 1. Problem 1. [Points = 3] In September, Chapel Hill s daily high temperature has a mean

5544 = = = Now we have to find a divisor of 693. We can try 3, and 693 = 3 231,and we keep dividing by 3 to get: 1

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Inner Product Spaces

Economics 326: Duality and the Slutsky Decomposition. Ethan Kaplan

LEARNING OBJECTIVES FOR THIS CHAPTER

10 Evolutionarily Stable Strategies

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Unified Lecture # 4 Vectors

Pigeonhole Principle Solutions

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Some probability and statistics

Solutions to Math 51 First Exam January 29, 2015

Data Mining: Algorithms and Applications Matrix Math Review

Practice with Proofs

3.2 The Factor Theorem and The Remainder Theorem

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

v w is orthogonal to both v and w. the three vectors v, w and v w form a right-handed set of vectors.

Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

Introduction to Hypothesis Testing

Regression III: Advanced Methods

So let us begin our quest to find the holy grail of real analysis.

Binary Adders: Half Adders and Full Adders

Math 431 An Introduction to Probability. Final Exam Solutions

Topic 8. Chi Square Tests

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Penalized regression: Introduction

Correlation key concepts:

Section 1.5 Exponents, Square Roots, and the Order of Operations

1 Lecture: Integration of rational functions by decomposition

Common sense, and the model that we have used, suggest that an increase in p means a decrease in demand, but this is not the only possibility.

Introduction to General and Generalized Linear Models

Continued Fractions and the Euclidean Algorithm

Microeconomic Theory: Basic Math Concepts

6.3 Conditional Probability and Independence

Lecture L3 - Vectors, Matrices and Coordinate Transformations

Transcription:

CITY AND REGIONAL PLANNING 870.3 Setting Up Models Philip A. Viton January 5, 2010 Contents 1 Introduction 1 2 What is conditional expectation? 2 3 The Modelling Setup 4 4 Understanding the Assumptions 7 5 Estimation 8 6 Proofs 8 Warning This is likely to contain errors and inconsitencies. Please bring any you find to my attention. 1 Introduction Here are some questions planners might want to investigate: Theory tells us that a cost function depends on output and factor prices. How do total costs change, as output increases? Or more specifically, how do total costs change per unit increase in output, holding factor prices constant? 1

If the housing capitalization hypothesis is correct, and assuming that kids attend local schools, then housing prices should increase with the quality of local schools. What is that impact? That is, if school quality goes up by one unit, what will happen to housing prices, holding other determinants of house prices constant? Or, in the jargon that is often used, what is the marginal effect (on housing prices) of a change in school quality? Demand theory suggests that increases in the price of gasoline should make it more likely that a commuter will choose to travel by public transit. How much more likely? That is, what is the unit impact on the choice probability of a unit increase in the price of gasoline, holding other characteristics of autos and transit constant? What is the cross-price elasticity of the demand for public transit? What is common to all these questions is a focus on the impact of one variable on another, holding everything else constant. How are we to formalize the notion of holding everything else constant? For many researchers, that is captured by the conditional expectation function. 2 What is conditional expectation? Consider two discrete random variables: x whose possible values are (1, 2, 5, 7) and y whose possible values are ( 1, 0, 2), and assume that their joint probability density function is as follows p(y, x) = x 1 2 5 7 1 0.074 0.159 0.037 0.099 y 0 0.067 0.071 0.019 0.008 2 0.109 0.101 0.122 0.135 For example, we see that the probability of observing a value y = 0 and x = 5 is p(0, 5) = p(y = 0, x = 5) = 0.019. Note that x,y p(y, x) = 1. The marginal distributions are obtained by summation: p(x) = y p(y, x) and p(y) = x p(y, x). Thus, for example, we have p(x = 1) = 0.074+0.067+ 0.109 = 0.25 : this is the column-sum of the p(y, x) array corresponding to the value x = 1. Similarly, p(x = 2) = 0.330, p(x = 5) = 0.178 and p(x = 7) = 0.242. 2

For each possible value of x we have a distribution (density) of the y values: this is the conditional density of y given x. By definition (assuming that p(x) > 0) this is p(y, x) p(y x) = p(x) and for our example this turns out to be p(y x) = x 1 2 5 7 1 0.295 0.481 0.207 0.411 y 0 0.269 0.214 0.108 0.033 2 0.436 0.306 0.685 0.556 and we have, for example, p(y = 0 x = 5) = 0.108. Note that y p(y x) = 1 (the column sums of the p(y x) array are all 1). For each possible value of x we can compute the expected value of y given that x takes the value in question. This is the conditional expectation of y, given that x takes the value in question; and is obtained in the usual way: we sum the products of the possible values of y, each weighed by its conditional density value. For example, for x = 1, the conditional expectation of y is E(y x = 1) = ( 1 p(y = 1 x = 1)) + (0 p(y = 0 x = 1)) + (2 p(y = 2 x = 1)) = 1.013 or more generally E(y x = x 0 ) = y y p(y x = x 0 ) For our example, the four possible values of E(y x = x 0 ) are given in the last row of the table below: x 1 2 5 7 1 0.295 0.481 0.207 0.411 y 0 0.269 0.214 0.108 0.033 2 0.436 0.306 0.685 0.556 E(y x = 1) E(y x = 2) E(y x = 5) E(y x = 7) = 1.013 = 0.437 = 1.847 = 1.258 We now generalize this to consider the values of the conditional expectations as a group: the result is the conditional expectation function, E(y x), telling us the 3

expected (=average) value of y given any value of x. Note that this is a function only of x, since its value depends on which x value we are talking about, and we average over y values: E(y x) = h(x). 1 In this respect, the notation E(y x) can be a bit confusing: some people prefer to write E y (y x) to remind us of what is being summed (integrated) over. It should now be apparent that the conditional expectation function formalizes our intuitive notion of the way the average value of y varies, given a particular value of x. The conditional expectation function (CEF) has some useful properties, summarized in the following (proofs in the final section). Theorem 1 (Properties of the CEF). If z = h(x, y) a. E(z) = E(E(z x)) b. E(y) = E(E(y x)) c. E(x, y) = E(x E(y x)) d. Cov(x, E(y x) = Cov(x, y) The first two results are often referred to as the Law of Iterative Expectations. (Clearly, (b) follows from (a) by taking z = y). So far, so good. But what is the connection to statistical modelling? 3 The Modelling Setup We are given a dependent random variable y and we assume that theory (or intuition) tells us that it depends on (is partly explained by) a k 1 vector of independent random variables x. As we ve argued, we want to be able to investigate the part of y that is explained by x, and this leads us to focus on the conditional expectation E(y x). We can always decompose y into the part that is explained by x and the rest, and we therefore write y = E(y x) + u 1 Of course, a parellel devlopment leads to another conditional expectation function, E(x y) : there is nothing sacrosanct about the E(y x) we have been studying. 4

where, by definition, u = y E(y x). When will this be an acceptable statistical model? It turns out that we need to make an assumption about u for the decomposition to go through; this is stated in the following result. Theorem 2 (Decomposition Theorem). If we write y = E(y x) + u Then E(u x) = 0 Corollary 3 For any function h(x) E(u h(x)) = Cov(u, hx(x)) = 0 E(ux) = Cov(u, x) = 0 The Theorem says that, conditional on x, u has expectation zero. The Corollary says that u is uncorrelated with any function of x. In other words, to apply this setup, the error u must not involve either x or any function of x. If we take h(x) 1 then we see that this also implies E(u) = 0. But now a question arises: why focus on this particular decomposition y = E(y x) + u? One reason is, as we ve seen, the conditional expectation function is a way of getting at what we want to understand, namely the way that the average value of y varies with given x. But there is an additional reason. Suppose we would like to predict or explain y by some function of x, say m(x). And suppose we agree to evaluate different m s according to a mean squared error criterion: the preferred m is the one that minimizes the mean (expected) squared error. Then we have the following result: Theorem 4 The function m(x) that minimizes the mean squared error E(y m(x)) 2 is m(x) = E(y x) 5

In other words, the conditional expectation function E(y x) is the best (in the sense of minimum mean squared error, MMSE) predictor/explainer of y in terms of x. Of course, we don t know what the CEF is. So let s lower our sights a bit, and consider predicting y linearly, that is, by a function of the form x β. If we retain the MMSE criterion, we have the following result: Theorem 5 Suppose that [E(xx )] 1 exists 2, and consider a linear-in-parameters approximation x β to y. The optimal β (in the MMSE sense) is with ˆβ = [E(xx )] 1 E(x y] Corollary 6 If we write then v = y x ˆβ E(v) = 0 Cov(x, v) = 0 The quantity [E(xx )] 1 E(x y] is called the population regression coefficient vector of y on x, and x ˆβ is the population regression of y on x. So the theorem says that the population regression of y on x is the best linear approximation of y, also called the Best Linear Predictor (BLP) of y. The Corollary gives two properties of the BLP residual: it has mean zero and is uncorrelated with (all) the x s. Note that ˆβ is the coefficient vector in the population regression. Despite what you might think at first glance, it is not the least-squares coefficient vector that you may have seen before: note particularly the expectation operators involved. Finally, we can tie all this together in terms of the object we re interested in, namely the CEF. Theorem 7 Suppose that [E(xx )] 1 exists, and consider a linear-in-parameters approximation x β to E(y x). The optimal β (in the MMSE sense) is with ˆβ = [E(xx )] 1 E(x y] 2 Remember that x is a random variable, so strictly speaking we should require that E[(xx ) 1 ] exists with probability 1 (or perhaps, with probability approaching 1 in large samples). 6

In other words, with ˆβ standing for the population regression coefficient vector, the quantity x ˆβ is the optimal linear approximation to the CEF (as well as the optimal linear approximation to y). Note that it may be possible to get a better approximation if we allow non-linear functions of x. Still, it is conventional to restrict ourselves (at least initially) to linear approximations. One reason for this is that the linearity involved here is linearity-in-parameters: the x-vector could well contain squares, exponentials (etc) of the individual terms, but note that it cannot contain an exact linear combination of the x s, or else [E(xx )] 1 will not exist. 4 Understanding the Assumptions At this point we can try to clarify what is going on when we talk about linearity. Remember that we are interested in setting up a model of the conditional expectation function. First, consider y = x β + u As it stands, this is vacuous: we can always write u = y x β. In other words, just writing down y = x β + u is a completely pointless exercise. Next, consider y = x β + u E(u) = 0 This asserts only that E(y) = x β, so is not an assertion about the conditional expectation function at all. Third, consider y = x β + u E(u) = 0 Cov(x, u) = 0 The assertions E(u) = 0 and Cov(x, u) = 0 are characteristics of the best linear predictor (BLP) of y, as we saw in Corollary 6, so this statement is just the assertion that x β is the best linear predictor of y. Again, it is not an assertion of the object we really want to study, namely the conditional expectation function, though it may be useful if you are concerned with predicitng y. Finally consider y = x β + u E(u x) = 0 7

The assertion that E(u x) is a characteristic of the conditional expectation function, so here we are making a relevant assumption, namely that the conditional expectation function is linear. From Corollary 3 we see that for this model to work we must have the residual u = y E(y x) uncorrelated with all the variables making up x. It is a vital part of any empirical research that wants to estimate E(y x) to argue that this non-correlation condition really does hold for the problem of interest. Often it does not, and in that case we will need to consider carefully what to do. 5 Estimation If we confine our attention to linear functions, then there is a strong case for being interested in the population regression coefficient vector ˆβ. But to make progress we need to estimate it, given that we ordinarily have a sample of data. How can we do this? The analogy principle suggests that we estimate a population parameter by the corresponding sample parameter. In particular, given that we are interested in (population) expectations (averages), we could try estimating using the (sample) averages (means). This is certainly plausible: if, for example, we had a random sample from a normal population with known variance and unknown mean µ, then we are all used to estimating the unknown mean by the sample average. Using the analogy principle, it is easy to show that an estimate b of the population regression coefficient vector ˆβ is b = (X X) 1 X y where X is the matrix of sample data, with rows x. But of course, the analogy principle is just that: an idea leading to an estimator. What can we say about b that would consider it to be an interesting or good estimator? 6 Proofs Here we sketch proofs of the main results in the text. 8

Theorem 1 (Properties of the CEF). If z = h(x, y) a. E(z) = E x (E(z x)) b. E(y) = E(E(y x)) c. E(x, y) = E(x E(y x)) d. Cov(x, E(y x)+ = Cov(x, y) Proof: We prove these results for the continuous case only, under the assumption that all random variables have finite second moments. (For the discrete case, replace integrals by sums). Notation: The random variables x and y have joint density f xy (u, v), conditional density f x y (u, v) and marginal densities f x (u) = f xy (u, v) dv and f y (v) = f xy (u, v)du. For (a), let z = h(x, y). We have: E(z) = h(u, v) f xy (u, v)dv = h(u, v) f y x (u, v) f x (u) du dv ( ) = h(u, v) f y x (u, v) dv f x (u) du = E(z x) f x (u) du = E x (E(z x) For (b), use (a) with z = h(x, y) = y. For (c), For (d): E x (x E(y x)) = = ( ) x y f x y (u, v)dv f x (u)du xy f xy (u, v) du dv = E(x y) Cov(x, E(y x)) = E(x E(x y)) E(x)E E(y x) 9

By part (c), E(x E(x y)) = E(x y), and by part (b) E E(y x)) = E(y), so Cov(x, E(y x)) = E(x E(x y)) E(x)E E(y x) = E(xy) E(x)e(y) = Cov(xy) Theorem 2 (Decomposition Theorem). If we write y = E(y x) + u Then E(u x) = 0 Proof: E(u x) = E(y E(y x)) = E(y) E E(y x) = E(y) E(y) = 0 using the Law of Iterated Expectations. Corollary 3 For any function h(x) E(u h(x)) = Cov(u, hx(x)) = 0 E(ux) = Cov(u, x) = 0 Proof: For the first statement E(h(x) u) = E(h(x) E(u x)) using part (c) of the Properties of Conditional Expectation. But E(u x) = 0 by the Decomposition Theorem, so we have E(h(x) u) = E(h(x) E(u x)) = E(h(x) 0) = 0 For the second statement, just take h(x) = x. 10

Theorem 4 The function m(x) that minimizes the mean squared error E(y m(x)) 2 is m(x) = E(y x) Proof: Add and subtract E(y x) and write (y m(x)) 2 = ((y E(y x)) + (E(y x) m(x)) 2 = (y E(y x)) 2 + (E(y x) m(x)) 2 +2((y E(y x)) ((E(y x) m(x)) The first term doesn t involve m(x) and can be ignored. In the third term we have y E(y x) = u, so this is u h(x), which will have expectation zero by the Decomposition Theorem. That leaves the second term (m(x) E(y x)) 2 which is clearly minimized when m(x) = E(y x). Theorem 5 Suppose that [E(xx )] 1 exists, and consider a linear-in-parameters approximation x β to y. The optimal β (in the MMSE sense) is with ˆβ = [E(xx )] 1 E(x y] Proof: We want to choose β to The first-order condition (FOC) is min β E(y x β) 2 0 = E(x (y x β)) so, solving under the stated assumptions, = E(x y) E(xx )β ˆβ = [E(xx )] 1 E(x y] Corollary 6 If we write then v = y x ˆβ E(v) = 0 Cov(x, v) = 0 11

Proof of the Corollary: The first-order condition can be written E(x v) = 0 This holds for all elements in the vector x. If x has a 1 in it (for an intercept) then we see immediately that E(v) = 0. To show that Cov(x v) = 0 write Cov(x v) = E(x y) E(x)E(v) = E(x v) = 0 since E(v) = 0 and the last term is zero. Theorem 7 Suppose that [E(xx )] 1 exists, and consider a linear-in-parameters approximation x β to E(y x). The optimal β (in the MMSE sense) is with ˆβ = [E(xx )] 1 E(x y] Proof: We want to choose β to solve min β E( E(y x) x β) 2 Consider again the problem min β E(y x β) 2 whose solution ( ˆβ) we have just found. Look at (y x β) 2 and write it as (y x β) 2 = ((y E(y x)) + (E(y x) x β)) 2 = (y E(y x)) 2 + (E(y x) x β) 2 +2(y E(y x))(e(y x) x β) Take expectations. Remembering that we are interested in β, the first term can be ignored, since it doesn t involve β. In the third term, (y E(y x)) = u will have expectation zero by the Decomposition Theorem. Thus we have found that in our context E(y x β) 2 = (E(y x) x β) 2 so they have the same solution, namely ˆβ. 12