1 Another method of estimation: least squares erm: -estim.tex, Dec8, 009: 6 p.m. (draft - typos/writos likely exist) Corrections, comments, suggestions welcome. 1.1 Least squares in general Assume Y i is some rv with nite mean yi might not, know the form of f Yi (y : ). and variance y, where we might, or If one has a sample of n observations from this population, the least-squares estimator(s) of are those,, that minimize 1 (yi E[y i : x i ; ]) where the x i is a vector of observed explanatory variables, not random variables ( xed in repeated samples). Finding the least-squares estimate of requires that we specify the form of E[y i : x i ; ] but does not require that we specify f Yi (y i ; x i ; ). Note that maximum likelihood estimation typically requires that we specify f Yi (y i ; x i ; ), which implies E[y i : x i ; ]. For example, consider the following common additive speci cation for y i y i g(x i : ) + i i 1; ; :::; n and is a rv with zero mean (E[] 0) and nite variance, y. 3 We have a data set that consists of n fy i ; x i g pairs. Since E[y i : x i ] g(x i : ) the least-squares estimator(s) of are those that minimize (Yi g(x i : )) SSR where SSR denotes the sum of squared residua. Some books call it RSS (for example, Gujarati, page 171) 1 Note that I did not say random sample. While a random sample would be nice, leastsquares estimation is well de ned even if the sample is not random. That said, the least-squares estimators might lack desirable properties if the sample is not random. For a given x i, all the randomness in y i is invoked by the randomness in 3 Note that here I am not being completely general. I am assuming the random component is additive, which is not required for l.s. estimation. 1
Things to note about least-squares estimators if one is willing to assume y i g(x i : ) + i i 1; ; :::; n where i is a rv with zero mean (E[] 0) and nite variance, y: One does not need to assume a speci c distribution for (normal or otherwise), but one needs to put the above few restrictions on. g(x i : ) does not have to be linear in the, but that is the speci cation that you are most accustomed to. Some of the properties of the estimators of the will depend on what one assumes about the disribution of (normal or otherwise) and/or whether one assumes the Y i in the sample are independent of one another. 1.1.1 An aside: Note that while we are not accustomed to thinking this way, all that one needs to do least squares is to assume the rv of interest Y, has a density function such that E[Y ] exists. For example, one could assume Y has the Poisson distribution, f Y (y) e y y! for y 0; 1; ; 3; ::::, and use least squares to estimate, the expected value of Y (and ao its variance). 4 The least-squares estimator of,, is that that minimizes (Yi ). 5 A fun, and instructive exercise would be to nd the least squares estimator(s) for a rv Y assuming a few di erent forms for f Y (y : ). For example, could one proceed with least-square assuming Y has a Bernoulli distribution? Try it and see what happens. 1. Revert to the standard assumption that y i g(x i : ) + i, but now be more restrictive: assume linearity and x i a scalar g(x i : ) + x i In which case y i + x i + i i 1; ; :::; n 4 Note that here the random term is not additive. While assuming an additive term (y i g(x i : ) + i i 1; ; :::; n) is typical in least-squares, it, as I noted above, it is not necessary. 5 We know from earlier, that the maximium likelihood estimator of, ml, is the sample average. Is the least-squares estimator of ao the sample average?
where E[] 0 and has nite variance. This model is called the linear regression model (MGB 485, 486). It has three parameters:, and Y. Contrast the linear regression model with the classical linear regression model, which adds the assumption N(0; ). The least-squares estimates of and are those estimates, and, that minimize (yi E(y : x i )) (yi ( + x i )) 1..1 Let s nd these estimates. Minimize (let me know if you nd an typos in the following derivations) SSR (yi ( + x i )) wrt and. Since, we have put no restrictions on the ranges of and, we are looking for an interior solution in terms of these two variables @SSR @ (y i x i )( 1) (y i x i ) X n y i n [ny n nx] # x i Set this equal to zero, and solve for to obtain y x Now consider @SSR @ (y i x i )( x i ) n # X y i x i x i x i 3
Set this equal to zero and solve for 6 which Implies 0 y i x i x i x i y i x i x i y i x i nx y ix i nx x i x i x i Substitute in y x for to obtain which implies y ix i nx(y x) x i y ix i nxy + nx P n x i y ix i nx x i x i nxy y ix i + x n x i x i nxy Note the following rearrangement of the lhs, nx x i nx x i x i x i nx x i x i nx x i 6 Note that @SSR @ ) ) ) ) (y i x i )( x i ) 0 (y i x i )(x i ) 0 (y i x i )(x i ) 0 (y i y i )(x i ) 0 (^ i )(x i ) 0 One could check that one s least squares estimates imply this. It is a good check on your math. 4
so, replacing nx x i with [ x i nx ] x i, one obtains. x i nx x i y ix i x i nxy multiplying through one gets n # X x i nx y i x i nxy which implies that y ix i nxy P n y ix i nxy P x i nx n (x i x) This is the least-squares estimate for assuming g(x i : ) + x i. By substitution, the least-squares estimate for,, is y x Note that, in this case, and ml ml where ml and ml are the maximum likelihood estimates assuming the classical linear regression model. That is, if one assumes a classic linear-regression model, the ml estimators exist and are equal to the estimators, but if one assumes only the linear regression model (don t add the assumption that N(0; )), the estimators exist, but not the ml estimators. There are a number of di erent ways to write, they are all equal. 5
y ix i nxy x i nx y ix i nxy (x i x) ~x iy i ~x i ~x i~y i ~x i where ~x i x i x and ~y i y i y. One uses di erent characterizations in di erent situations - depending on what one wants to demonstrate. 1.. There is no least-squares estimate of y Note that since (y i x i )) is not a function of y, there is not a least-squares estimator for y. That is, what one minimizes to obtain the estimates is not a function of y. However, given and, one can estimate y with ^ y ^ (y i x i )) n The intuition for dividing by n the calculation of and. is that one loses two degrees of freedom in ^ is not a least-squares estimator, but is based on the least-squares estimators of and. It is possible to show that E[^ ]. See, for example, Gujarati Basic Econometrics, the appendix to chapter 3. 1.3 Remember Least-squares estimators exist even if g(x i : ) 6 + x i That is, g(x i : ) can be nonlinear in. For example, one could assume g(x i : ) e xi 6
so which is highly nonlinear in. @e xi @ x ie xi In which case, y i e xi + i i 1; ; :::; n where E[] 0 and has nite variance. and the least-squares estimates of is that estimate,, that minimizes SSR (yi e xi ) This is an example of nonlinear least squares. 7 1.4 Some properties of least-squares estimators of the form y i + x i + i where E[] 0 and has nite variance. i 1; ; :::; n Assume that the the Y i in the sample are independent of one another (we have a random sample) From above, and assuming the above linear form ~x iy i ~x i ~x i k y i where k ~x i w i y i where w i ~xi k. In words, is a linear combination (weighted sum) of the n random variables, y 1 ; y ; :::; y n, where the weights are a function of the x 0 s. We call estimators with this property linear estimators. 8. Note that determining it was a linear estimator did not require that f() have a particular form. 7 In contrast, note that if one assumed y i x i + i it would still be linear least-squares because the function is linear in. 8 Looking ahead, this is part of the famous, Gauss-Markov theorem. 7
Given that w iy i, and given the x i the y i are independent E[ ] E[ w i y i ] w i E[y i ] since the w i are constants: they vary with x but the x are assumed xed in repeated samples. Since E[y i ] + x i Since w i ~xi k k since ~x i (x i x) 0 Because ~x i x i Because ~x i 0 And because k ~x i w i ( + x i ) w i + ~x i + k k x ) x i ~x i + x w i x i ~x i x i ~x i x i ~x i (~x i + x) k n # X ~x i + x ~x i k k ~x i ~x i ~x i That is E[ ] 8
In words, is an unbiased estimator of. Note that this proof did not require that we assume a speci c distribution for. We need only the assumptions of the linear regression model, and a random sample (independent y i ). Note that at this point we have demonstrated that is a linear unbiased estimator, and this result does not depend on a normality assumption. It is ao possible to show that E[ ] I leave that as an exercise for you. In summary, the least-squares estimators of the parameters are linear and unbiased estimators. It is possible to show that E[^ ], but remember that ^ not a least-squares estimate. (yi x i)) n is 9
So, assuming the linear regression model and a random sample, and are linear estimators and unbiased estimators. This is good. It is possible to further show that in the class of linear unbiased estimators, the least-squares estimators have minimum variance. This earns them the adjective best. So, assuming the linear regression model and a random sample, and are BLUE (best linear unbiased estimators). This is the Gauss-Markov theorem. If one assumes E[] N(0; ) the estimators gain more desirable properties because they are ao the ml estimators. 10
1.4.1 One can use and to predict values of y j conditional on x j y j + x j y j is a random variable that, for xed x s, will vary from sample to sample. Since and are both unbiased estimates, y j is an unbiased estimate of y j ; that is E[y j ] y j. Think about the sampling distribution of y j, which is conditioned on x j 11
1.5 The variances of the least-squares estimators 1.5.1 The variance of The least-squares estimate of is a statistic and will vary from sample to sample, so has a sampling distribution, f (v). An issue at hand is determining the variance of this sampling distribution. 9 An important issue is whether we proceed assuming a knowledge of, or only knowledge of its estimate, b. We will start assuming knowledge of, and afterwards discuss how the variance of di ers when it is expressed as a function of b rather than. Knowing is atypical, but easier, so we start there. To emphasize that we are conditioning on, in the shortrun, denote f (v) more speci cally as f (v ) and write var( ) ( ). 10 Above we showed that is a linear estimator. That is, it can be written w i y i where the w i can be treated as constants. We ao know that var(ax) a var(x) if a is a constant. Combining these two pieces of information, along with knowledge of : var( ) wi y wi Recollect that y i + x i + i where E[] 0 and has a nite variance so y. 9 More generally, we would like to know the form of f (v). 10 In contrast to f (v b ) and var( b ) (b ) 1
Proceeding, var( ) wi ~x i k ~x i k ~x i ( ~x i ) ~x i (x i x) since k ~x i, and the standard error of is se of ( ) [var( )] :5 Notice that var( ) decreases as (x i x) increases What did we assume to derive var( ) ( ) P n? We ~x i assumed that y i + x i + i where E[] 0 and has a nite variance, and the Y i are independent. We did not need to assume that has a speci c distribution, such as the normal. It is ao possible to derive the var( ) as a function of P n var( ) ( ) ( n )( x i ) ~x i Note again that we cannot calculate var( ) or var( ) unless we assume a speci c value for. 11 11 Note that one can ao calculate cov ;. It is not 0 because both and are a function of. cov ; E [( E[ ]) ( E[ ])] E [( ) ( )] 13
Note that var( ) ( )) ~x i is a function of the x s in the sample, but not the y s. Therefore, if one makes the typical assumption that the x s are xed in repeated samples, ( ) is not a random variable. By the same argument, neither is ( ). This is because we are assuming knowledge of. This is an important point. ( ) and ( ) are not statistics and not things that are estimated; they are calculated given knowledge of and knowledge of the x leve in the data. Said another way, while the least-squares estimates of and will vary from sample to sample, ( ) and ( ) do not vary from sample to sample (assuming the x s are xed in repeated samples). Soon, we we will consider the problem of estimating var( b ) and var( b ). But rst, 1.5. The variance of y j as a function of One can ao show that (for example, Gujarati, Essentia page 185) # var y j yj ( ) 1 n + (x j x) ~x j Note that this is a function of and the x s but not the y s, so not something that is estimated. It is not a random variable. But, y x and y x, so cov ; E [y x (y x)) ( )] E [( x + x) ( )] xe h( ) i x (x i x) x ~x i The covariance decreases as the variation in the x s increases. 14
1.6 What is implied if one adds to the above the assumption that ~N(o; ). If y i + x i + i where ~N(o; ), then y i ~N( + x i ; ). From earlier we know that w iy i, so is a linear combination of normally distributed random variables, so is normally distributed. Speci cally ~N(; ) ~x i By the same logic ~N(; ( n )( x i )) ~x i When is known, neither or has a t distribution, both are normally distributed. I say this here because some people incorrectly believe that and always have a t distribution If ~N(; ) then ( ~x i ):5 ~N(0; 1) 15
So, if we assumed a value for, we could calculate (not estimate it) and then calculate a con dence interval for and test the null hypothesis that takes some speci c value such as zero. For example, since ~N (0; 1) Pr 1:96 1:96 :95 ) Pr 1:96 1:96 :95 ) Pr 1:96 1:96 :95 ) Pr 1:96 + 1:96 :95 1:96 + 1:96 is the 95% con dence interval for based on, and the assumption that is normally distributed. How do we interpret this interval? Note that this interval depends on which is a random variable, so the con dence interval is a random variable. 95% 1 13 of these interva will contain. Assuming that ~N(o; ), it follows that y j is ao normally distributed (it is a linear function of two normally distributed random variables ( and ). Speci cally, #! y j ~N + x j ; 1 n + (x j x) ~x j So one can ao get a con dence interval for y i 1 Note that one cannot say that there is a 95% chance that the true is between 1:96 ) and ( + 1:96. Further note that since is not a random variable if the x s are xed in repeated sample, the position of this con dence interval is a random variable, but not its length. 13 Note that none of the above has anything to do with the t distribution. 16
1.7 However, we don t typically assume a value for but estimate it with ^ Continue, for now, to assume that ~N(o; ), so assume the CLR model, but that we do not know, so have to estimate it ^ (y i x i ) (n ) and note the important distinction between ^ and, the rst is a rv, the second is a constant. The rst thing to note, as we demonstrate below, is even though ~N(o; ) where N (0; 1) ^ ^ ^ ~x i Toooooo bad. 14 14 Note that ~N(; ) because we are assuming ~N(o; ) 17
Since it is not normal, what distribution does ^ have? Let s try and demonstrate that it has a t distribution. The following is a bit di cult - think of it as walking backwards from the end of the trail back to your car, forgetting where you started. What I am doing is deriving the distribution of ^. Remember that that ~N (0; 1). 18
Now de ne another random variable, G, remember we are going backwards, such that G (n )^ note that I have de ned a function that is a linear function of the ratio ^ P (n ) (Y x i) (n ) the reason for (n ) above was so it would cancel here P (Yi x i ) P (Yi E[y j jx i ]) P (Yi E[Y i jx i ]) P (Yi E[Y i jx i ]) y (Yi E[Y i jx i ]) y because y j jx i is an unbiased estimate of E[Y i jx i ] Note that ^ does not explicitly appear in this last term - we started with it, but it disappeared. Further note that (yi E[y i jx i ]) ~N(0; 1) because y i ~N(E[y i jx i ]; y). y 19
Note, this is critical, our created random variable, G P n (Yi E[Y ijx i ]), y is the sum of the squares of a bunch of standard normal random variables. That means it has a distribution. 15 The important thing to remember at this point is that we have created a random variable G that is a linear function of the ratio ^ and we know its density function. You want to learn what you can about the Chi-squared distribution (keep in mind, saying k is the degrees of freedom of the distribution is just another way of saying the density function has one parameter, k). Speci cally, G (Yi E[Y i jx i ]) y (n )^ ~ n It is (n ) because of the parameter (number of degrees of freedom) is not the number of terms in the sum, but the number of independent terms in the sum, which is (n ) because we lose two degrees of freedom to get E[Y i jx i ] + x i. That is, G (n )^ parameter (n ). is a rv with a Chi-squared distribution with (The bottom line is someone worked backward and gured out a rv that was a function of ^ and, and that had a Chi-square distribution.) Note that neither ^ nor is a parameter in the Chi-square, which is important. 15 See Gujarati page 114 and MGB pages 41-43). Theorem 7 (MGB page 4) states that If random variables X i, i 1; ; ::; k, are normally distributed with means i and variances i, then U P n Xi i has a chi-square distribution with parameter k (k degress of i freedom). A collarary is that if X 1 ; X ; ::; X n is a random sample from a normal distribution with mean mean and variance then P n Xi has a chi-square distribution with n degrees of freedom. A special case is that Xi has a chi-square distribution with 1 degree of freedom. 0
So what do we know at this point? and ( ~x i (n )^ ~ n ):5 ~N(0; 1) So, now let s mention the t distribution. MGB 49-51 tell us n N(0; 1) :5 ~t n (n ) That is, a standard normal rv, e.g., divided by the square root of a rv (divided by its parameter) has a t distribution with that parameter. 16 16 Theorem 10 (MGB page 50) states that If the rv Z has a standard normal distribution, if the rv U has a chi-squared distribution with (degrees of freedom k), and if Z and U are independent, Z (Uk) :5 has a Student t distribution with parameter k (degrees of freedom). A relevant Corollary is on page 50. 1
So, let s divide and see what simpli es. De ne the rv W W N(0; 1) :5 (n ) n (n ) :5 n ( ~x i ):5 n ( ~x i ):5 (n )^ (n ) :5 ( ~x i ):5 (^ :5 (n ) :5 ( ~x i ):5 ^ : ^ Note that cance out; ( P n ~x i ):5 this is critical since we don t know it. ^ ( ~x i ):5 ~t n ^ if y i + x i + i where ~N(0; ).
So, to say it explicitly, we have determined that ^ has a t distribution with parameter (n ) 17 It took a lot of what we have learned to derive this. Consider an example. If 18 n 3 ~t 30 ^ In which case Pr(t 30 > :04) :05 and Pr(t 30 < :04) :05 from the t table. So, Pr :04 < < :04 :95 ^ () Pr :04^ < < + :04^ :95 The interval :04^ to + :04^ is the 95% con dence interval for based on ^ rather than. This interval is a random variable; 95% of these interva will include. Contrast this con dence interval with which we derived earlier. Pr 1:96 < < + 1:96 :95 A hypothesis test How would one determine whether they can reject the null hypothesis that 4? One can derive the con dence interval for and see if its includes 4. Alternatively, one can directly use ~t n ^ If 4, the null is correct 4 ^ ~t n 17 This is close but di erent from saying that has a t distribution. 18 If ~t ^ n, E[ n ] 0 and it variance is ^ (n ) n. In explanation, all t (n 4) distributions have a mean of zero, and n is the variance of all t distributions. (n 4) 3
Note that since a value of is assumed, this is a calculable number. For example, if n 3, 8, and ^ then 4 ^ 8 4 If one chooses a two-tailed test (:05 in each tail), the critical value of t is :04. In this case, 8 4 3 :0 < :045 and one fai to reject the null hypothesis that 4. 19 f() 0.35 0.3 0.5 0. 0.15 0.1 0.05 0 5.5 0.5 5 (B B)/sigmahatB ( )b has a t distribution Most basic OLS regression packages print out the t values corresponding to the null hypothesis 0. Be aware that these t statistics don t mean much unless you are willing to assume that is normally distributed. 19 Note that these t values make no sense if one does not assume ~N 0;. That is, if one does not adopt this assumption, the random variable ^ does not have a t distribution. Said a di erent way, if you are unwilling to assume ~N 0; you better not be paying any attention to the t values your OLS package printed out. 4
Now derive the 95% con dence interval for assuming n 3 We are still assuming the CLR model and no knowledge of. Earlier we showed that G (n )^ ~ n Using the table one can determine that Pr 30 > 46:98 :05 and Pr 30 < 16:79 :05 Below is the density function for 30; :5% of the area is to the left of 16:79 and :5% is to the right of 46:98. f(g) 0.05 0.0375 0.05 0.015 0 0 1.5 5 37.5 50 G has a ChiSquared distribution g We are still assuming the CLR model and no knowledge of. Earlier we showed that G (n )^ ~ n ) ) So Pr 16:79 30^ 46:98 :95 16:79 Pr 30^ 1 46:98 30^ :95 30^ Pr 16:79 30^ :95 46:98 5
) ) 30^ Pr 46:98 30^ :95 16:79 Pr :638^ 1:786^ :95 So, we have derived a con dence interval on the population parameter as a function of ^. Note that the con dence interval, :638^ 1:786^, is a random variable; 95% of these interva will include. If one wanted to test the null hypothesis that takes some speci c value, e.g. 4, one can either see if 4 is in the interval :638^ 1:786^. Or one can directly use the fact that Plugging in the 4 and n 3 (n )^ ~ n (30)^ 4 7:5^ ~ n From above, for a two-tailed test at the :05 signi cance level, the critical values of 30 are 16:79 and 46:98. So if ^ 46:98 7:5 6:6, one would reject the null hypothesis that 4. One would ao reject this null hypothesis if ^ 16:798 7:5 : 6
How about a con dence interval for y i, conditional on x i, assuming ~N(0; ) and no knowldege of? From above we know that #! y j ~N + x j ; 1 n + (x j x) ~x j If we replace with ^ it no longer has a normal distribution. But, by the same argument as above y j E[y j jx j ] ^ yj ~t n This implies, still assuming n 3, Pr :04 < y! j E[y j jx j ] < :04 ^ yj :95 ) Pr y j :04^ yj < y j jx j < y j + :04^ yj :95 So, 95% of the interva, y j :04^ yj < y j jx j < y j + :04^ yj, will contain the true y j conditional on x j. Note that this interval takes it minimum value when x j x, decreases as x j! x. How do I know this? 7
1.7.1 What if I don t know the distribution of but am willing to assume E[] 0 and that has a nite but unknown variance? We are now assuming a LRM, but not a CLRM. y j In this case we still can do OLS estimation, and, as we saw,,, and are BLUE estimators. We can ao calculate ^ (y i x i ) (n ) and ^ ^ ~x i To do hypothesis tests or interval estimation on, we need to determine the distribution of Note that we cannot assume that ^ ~N (0; 1). If it were normal, one can determine (above) that ^ ~t n, but now we can t determine the distribution. To do so we need to know the distribution of, which we do not. ^ 8
1.7. What if we know the distribution of and it is not normal? Now we are assuming a LRM and knowledge of f (), which is not normal. So we are not assuming the CLR model. Assume for example, ~S 0; where S denotes the Snerd distribution, where the Snerd is not the normal distribution - to start, assume you know. In this case, is ~S (0; 1)? That is, does it have a standardized Snerd distribution? The answer is sometimes but not always. 0 If you could show that ~S (0; 1) one could do con dence interva and hypothesis tests for assumed values of. If one replaces with ^, ^ will de nitely not have a Snerd distribution or a student t distribution. In theory, one could gure out the distribution of this rv (along the lines we did it assuming normality) and then do hypothesis tests and con dence interva. This could be tough. Now again assume you know, continuing to assume has a Snerd distribution To simulate estimated con dence interva for and, one might proceed as follows: Assume the data-generating process for your realworld population of interest is the LRM with ~S 0;, where the value of is known - S 0; is completely speci ed. Estimate and for this sample. Then assume, and are the population parameters; that is, your suedo data-generating process is y i + x i + i where ~S 0;. For the vector of x, x 1 ; x ; :::x i ; :::; x n generate S di erent random samples of size n based on the suedo data-generating process; make S a large number. For each sample s, estimate s and s. Plot the distribution of the S s and the distribution of the S s. The former is an estimated sampling distribution for, centered on, the latter an estimated sampling distribution for, centered on. A 95% con dence for each can be estimated by lopping o the top and bottom :5% of each of these estimated distributions. 0 For example if had a t distribution, would not have a t distribution. But we know that if is normal then is normal. 9
Note, these estimated con dence interva are a function of the initial random sample from your population, the assumption that one has a LRM, the assumption that ~S 0;, that is known, and n: it is de nitely a function of the Snerd assumption and. The larger n the shorter the con dence interval. Note, one does not need to theoretically derive either f( ) or f( ): the latter was derived by simulation. Now continue to assume has a Snerd distribution but now assum the value of is unknown. To simulate estimated con dences interva for,, b one might proceed as follows: Assume the data-generating process for your real-world population of interest is the LRM with ~S 0;, where the value of is unknown. Estimate and for this sample, and use these to estimate, b. Then assume, and b are the population parameters; that is, your suedo data-generating process is y i + x i + i where ~S 0; b. For the vector of x, x 1 ; x ; :::x i ; :::; x n generate S di erent random samples of size n based on the suedo data-generating process; make S a large number. For each sample s, estimate s and s, and then use them to estimate s b Plot the distribution of the S s, the distribution of the S s, and the distribution of the S s b. The rst is an estimated sampling distribution for, centered on, the second is an estimated sampling distribution for, centered on, and the third is the sampling distribution of b, centered on b. A 95% con dence for each can be estimated by lopping o the top and bottom :5% of each of these estimated distributions. Note, these estimated con dence interva are a function of the initial random sample from your population, the assumption that one has a LRM, the assumption that ~S 0; and n: it is de nitely a function of the Snerd assumption. It is not a function of the value of. The larger n the shorter these con dence interva. Note, one does not need to theoretically derive either f( b ) or f( ): the latter was derived by simulation. 30