Linear Models: The full rank model estimation

Transcription

1

2 Linear models We remind ourselves what a linear model is. We have n subjects, labelled 1 to n; We wish to analyse or predict the behaviour of a measurement or property of the subject (y variable); Denoted by y 1, y 2,..., y n. Each subject has certain other properties that we know or have measured (x variables); Subject i has k of these properties x i1, x i2,..., x ik.

3 The linear model is for all i = 1, 2,..., n, or y i = β 0 + β 1 x i1 + β 2 x i β k x ik + ε i

4 y 1 y 2. y n = 1 x 11 x x 1k 1 x 21 x x 2k x n1 x n2... x nk β 0 β 1 β 2. β k + ε 1 ε 2. ε n y = X β + ε

5 Note that under the terminology we have developed, y and ε are random vectors. A common assumption is that ε is a normal random vector with mean 0 and variance σ 2 I. As mentioned before, X and β are NOT random vectors. Although it is common for X to be a measurement, technically this is wrong the model is deterministic and only the error term is subject to random variation.

6 The full rank model The full rank model is very simple: it happens when the design matrix X has full rank, i.e. r(x ) = k + 1. This small condition has critical importance in the analysis of the model, so much so that we divide the cases into full rank and less than full rank. For this section, we assume that X is of full rank. This means that X T X is invertible, i.e. (X T X ) 1 exists.

7 Example. We want to analyse the selling price of a house (y). We think that the price of a house depends on two variables, its age (x 1 ) and the house area (x 2 ). Our linear model takes the form y = β 0 + β 1 x 1 + β 2 x 2 + ε. We sample 5 random houses and obtain the data Price ( $10k) Age (years) Area ( 100m 2 )

8 The model generates the 5 linear equations 50 = β 0 + 1β 1 + 1β 2 + ε 1 40 = β 0 + 5β 1 + 1β 2 + ε 2 52 = β 0 + 5β 1 + 2β 2 + ε 3 47 = β β 1 + 2β 2 + ε 4 65 = β β 1 + 3β 2 + ε 5

9 The matrix form of the model is y = X β + ε where y = , X = ,

10 β = β 0 β 1 β 2, ε = ε 1 ε 2 ε 3 ε 4 ε 5. Direct calculation will show that X is of full rank. This is an example of multiple regression.

11 Example. Simple linear regression can be cast in the framework of a linear model, where the response variable y depends on only one variable x: y = β 0 + β 1 x + ε. If we have n responses, this gives the linear equations y 1 = β 0 + β 1 x 1 + ε 1 y 2 = β 0 + β 1 x 2 + ε 2. y n = β 0 + β 1 x n + ε n

12 In the matrix formulation, we have y 1 1 x 1 1 x 2 y 2 y =., X =., y n 1 x n β = [ β0 β 1 ε 1 ε n ] ε 2, ε =.. We will show later how the linear model framework can be used to derive the well-known regression formulas for the parameters β 0 and β 1.

13 Parameter estimation using least squares The first thing we want to do with the linear model is to estimate the parameters β 0, β 1,..., β k. We do this using the method of least squares. Firstly, we assume that the error vector ε has mean 0 and variance σ 2 I ; in other words, that the model is unbiased and that the errors are independent of the responses and uncorrelated with each other. We do NOT necessarily assume that the errors are independent of each other.

14 Since the error term is the only random term in the model, this means that and E[y] = X β var y = σ 2 I. In particular, this means that the expected value of each response is a linear function of the parameters (as you would expect in a linear model).

15 How can we estimate the true values of the parameters? Consider the elements of the error vector: ε i = y i (β 0 + β 1 x i1 + β 2 x i β k x ik ) = y i E[y i ]. What happens to this value if we have the correct model (including correct parameter values)?

16 If we have the corerct parameter values and model, the error is likely to be much smaller than what it would be if we had the wrong parameter values. Therefore, it makes sense to estimate the parameters in such a way that the errors are minimised in some sense. However, we do not know the true expected values and as such cannot calculate the errors! Instead, we calculate the residuals.

17 Suppose that we have some estimates of the parameters, b 0, b 1,..., b k. Then we can estimate the expected value of y i by ˆ E[y i ] = b 0 + b 1 x i b k x ik. The ith residual is defined to be the difference between the observed value and the estimated value: e i = y i E[y ˆ i ]. If our estimates are good, the residuals should be very close to the errors.

18 Since we want to minimise the errors but cannot calculate them, we instead choose our estimates to minimise the residuals. But how do we minimise? The natural answer is to minimise the sum of the residuals, but on further inspection this is a bad idea. Residuals can be either positive or negative, and adding positive and negative residuals will cancel each other out and make us think we have a good fit. However, we want to eliminate residuals which are large, in both the positive and negative direction. In fact, under some not too restrictive conditions, it can be shown that the residuals will always sum to 0!

19 What about summing the absolute values of the residuals? This is not incorrect, but it is a bit incovenient mathematically because the absolute value function is not differentiable everywhere. To avoid this, we minimise the sum of the squares of the residuals: n min ei 2 = min e T e. i=1 This results in the least squares estimators of the parameters.

20 How do we calculate the least squares estimators? First we write our observed responses in terms of the (as yet unknown) estimated parameters and the residuals: y i = b 0 + b 1 x i1 + b 2 x i b k x ik + e i

21 Then we define the vectors of estimated parameters and residuals: b 0 b k e 0 b 1 b =., e = e 1.. Then we can write our observed responses as y = X b + e. This looks like the linear model, but it is slightly different it uses the estimated parameters rather than the actual parameters, and therefore we have the residuals instead of the errors. e n

22 Now we want to minimise e T e = (y X b) T (y X b) = y T y y T X b b T X T y + b T X T X b = y T y 2y T X b + b T X T X b = y T y 2(X T y) T b + b T (X T X )b with respect to b (since that is the only thing we can control).

23 For this expression to be at a minimum, we need e T e b = 0. This is where we can use the vector differentiation that we developed earlier! b yt y = 0 since y (our measurements) do not depend on b (our parameter estimates).

24 b ( 2(X T y) T b) = 2X T y. b (bt (X T X )b) = (X T X )b + (X T X ) T b. Therefore to be at a minimum, we need 2X T y + 2(X T X )b = 0.

25 Rearranging gives the normal equations: X T X b = X T y. As we observed before, because X is of full rank, X T X has an inverse. Therefore we can solve for b to find the least squares estimator b = (X T X ) 1 X T y.

26 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 random vector with mean 0 and variance σ 2 I. Then the least squares estimator for β is given by b = (X T X ) 1 X T y.

27 Example. We return to the house price example presented earlier. Our data are the house prices (response) and the house age and area (design): y = , X = Matrix calculations give X T X = , X T y =

28 We can then find the inverse of X T X to be (X T X ) 1 = This gives the least squares estimators as b = (X T X ) 1 X T y = =

29 Therefore our estimated model is y = b 0 + b 1 x 1 + b 2 x 2 + e or y = x x 2 + e.

30 Example. Recall that the simple linear regression model can be written as a linear model with two parameters y = β 0 + β 1 x + ε, which gives y = y 1 y 2., X = 1 x 1 1 x 2... y n 1 x n

31 Then (since X is n 2), X T X = = 1 x 1 [ ] x 2 x 1 x 2... x n.. 1 x n [ n n i=1 x ] i n i=1 x i n i=1 x2 i and X T y = [ n i=1 y ] i n i=1 x. iy i

32 Finding the inverse of the 2 2 matrix X T X gives us (X T X ) 1 = 1 n n i=1 x2 i ( n i=1 x i) 2 Therefore the least squares estimator for β is [ n i=1 x2 i n i=1 x i n i=1 x i n ].

33 b = (X T X ) 1 X T y = 1 = n n i=1 x i 2 ( n i=1 x i 1 n n i=1 x i 2 ( n i=1 x i [ n ) i=1 x i 2 2 n [ n i=1 ) x i 2 2 n i=1 x i i=1 x i n ] [ n i=1 y ] i n i=1 x iy i n i=1 y i n i=1 x i n i=1 x iy i n n i=1 x iy i n i=1 x i n i=1 y i ].

34 Breaking it down into individual coefficients, the estimator for the slope of the regression line is b 1 = n n i=1 x iy i n i=1 x n i i=1 y i n n i=1 x2 i ( n i=1 x i) 2, which may be familiar to you from standard linear regression courses. The estimator for the intercept of the regression line is n n i=1 b 0 = x2 i i=1 y i n i=1 x n i i=1 x iy i n n i=1 x2 i ( n i=1 x i) 2 which may not look familiar. Usually the estimator is given as ȳ b 1 x, but it is quite simple to show that they are the same.

35 How good is the least squares estimator, really? Here is where some random vector theory comes in handy! Theorem In the above model, the least squares estimator b = (X T X ) 1 X T y is an unbiased estimator for β. In other words, E[b] = β. Furthermore, var b = (X T X ) 1 σ 2.

36 Proof. E[b] = E[(X T X ) 1 X T y] = (X T X ) 1 X T E[y] = (X T X ) 1 X T (X β) = I β = β. var b = var (X T X ) 1 X T y = (X T X ) 1 X T σ 2 I ( (X T X ) 1 X T ) T = (X T X ) 1 X T X ((X T X ) T ) 1 σ 2 = (X T X ) 1 σ 2.

37 We now know that the least squares estimator is good but really, how good?

38 We now know that the least squares estimator is good but really, how good? How about the best?

39 We now know that the least squares estimator is good but really, how good? How about the best? But what do we mean by the best? If you think about it, our estimators are really just random variables which are functions of our observations. As shown before, they have expectation and variance. The expectation is already as good as it can be. The ideal situation would now to be for the variance to be 0 because then, we know the actual parameters with certainty. Unfortunately, this is generally impossible, but in lieu of that, we want the variances to be as low as possible.

40 There are many kinds of estimators, but we concentrate on linear estimators. These are estimators which take the form Ly, where L is some matrix of constants. In particular, the least squares estimator is a linear estimator with L = (X T X ) 1 X T. Definition Suppose you have a model with parameters β and some linear estimators b for these parameters. If E[b] = β and the variances of b 0, b 1,..., b k are minimised over all linear estimators, then b is called a best linear unbiased estimator of β (or BLUE for short).

41 Gauss-Markov Theorem Theorem (Gauss-Markov Theorem) Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 random vector with mean 0 and variance σ 2 I. Then the least squares estimator b is the best linear unbiased estimator for β.

42 Proof. Suppose that we have another unbiased linear estimator for β, called b. We can express this in the form b = [(X T X ) 1 X T + B]y where B is a (k + 1) n matrix. We then take expectations of both sides: E[b ] = [(X T X ) 1 X T + B]E[y] = [(X T X ) 1 X T + B]X β = [I + BX ]β.

43 Since b is an unbiased estimator for β, we know that E[b ] = β. Therefore [I + BX ]β = β, which means that BX = 0. Now look at the variance of the estimator: var b = var [(X T X ) 1 X T + B]y = [(X T X ) 1 X T + B]σ 2 I [(X T X ) 1 X T + B] T = σ 2 [(X T X ) 1 X T + B][X (X T X ) 1 + B T ] = σ 2 [(X T X ) 1 X T X (X T X ) 1 + (X T X ) 1 X T B T +BX (X T X ) 1 + BB T ]

44 Since BX = 0, X T B T = (BX ) T = 0. Then var b = σ 2 [(X T X ) 1 + BB T ] = (X T X ) 1 σ 2 + BB T σ 2 = var b + BB T σ 2. Now we want to minimise the variance of b0, b 1,..., b k, with the covariances being relatively unimportant.

45 The variances are given by the diagonal elements of the covariance matrix: var b i = [var b ] ii = var b i + σ 2 n Each term in the sum is non-negative, so the variance of bi can never go below var b i. Moreover, the minimum is obtained if and only if b ij = 0 for all i, j, in which case B = 0 and b = b. j=1 b 2 ij. This means that not only is the least squares estimator the best linear unbiased estimator for β, but it is the only BLUE for β.

46 Example. Consider the house price example. The variance of the least squares estimators is given by (X T X ) 1 σ 2 = σ 2. This means that there is no linear estimator of β 0 which has a smaller variance than 2.31σ 2, and no linear estimator of β 1 which has a smaller variance than 0.03σ 2, etc. This is true even though we don t (yet) know what σ 2 is (although we will get to that!).

47 This is all well and good, but what if we want to estimate something other than the parameters? In particular, we are often interested in estimating some linear function of the parameters, t T β, where t is a (k + 1) 1 vector of constants. How can we estimate these?

48 This is all well and good, but what if we want to estimate something other than the parameters? In particular, we are often interested in estimating some linear function of the parameters, t T β, where t is a (k + 1) 1 vector of constants. How can we estimate these? It turns out that the correct answer is the obvious one: we simply take the identical linear function of the least squares estimator.

49 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 random vector with mean 0 and variance σ 2 I. Let t be a (k + 1) 1 vector of constants. Then the best linear unbiased estimator for t T β is t T b, where b is the least squares estimator for β. The proof of this theorem is very similar to that of the Gauss-Markov theorem.

50 The most common use of the above theorem is to predict the value of the response variable, given certain values of the x s. Example. Consider the house price example we have shown before. The model is y = β 0 + β 1 x 1 + β 2 x 2 + ε, where y is the house price, x 1 is its age, and x 2 is its area. Suppose we are now given a specific house with age x1 and area x 2, and we wish to estimate what price it will fetch.

51 Then we want to estimate the linear function of the parameters E[y] = β 0 + β 1 x 1 + β 2 x 2 = t T β where t = [ 1 x1 x2 the house price is ] T. Therefore an unbiased estimator for t T b = [ 1 x 1 x 2 ] b = b0 + b 1 x 1 + b 2 x 2 where b is the least squares estimator for β.

52 For example, suppose we have a house which is 15 years old and has an area of 250 m 2. From previous examples, we have the least squares estimator b = and therefore the estimated price of the house is ˆ E[y] = = In other words, we expect the house to sell for $57,020.

53 Variance estimation Another quantity we will want to estimate is the variance of the errors, whcih is also the variance of the response variables. Remember that we are assuming that y has the covariance matrix σ 2 I in other words, that each observation has 0 correlation with the others, and identical variance to the others. We will want to estimate σ 2. One reason that we want to do this is so that we can create confidence intervals for the true values of the parameters.

54 How should we estimate σ 2? If you think about it, σ 2 can be written as [ (y X β) σ 2 T ] (y X β) = E n and so a reasonable estimator for the variance is ˆσ 2 = (y X b)t (y X b). n The numerator of this estimator is the sum of the squares of the residuals. Since the residuals reflect variation that is not explained by the model, this seems reasonable.

55 Is the estimator unbiased? We take expectations: E[ ˆσ 2 ] = 1 n E[(y X b)t (y X b)] = 1 n E[(y X (X T X ) 1 X T y) T (y X (X T X ) 1 X T y)] = 1 n E[yT (I X (X T X ) 1 X T )(I X (X T X ) 1 X T )y]. It is a simple linear algebra exercise to show that I X (X T X ) 1 X T is idempotent, which gives E[ ˆσ 2 ] = 1 n E[yT (I X (X T X ) 1 X T )y].

56 Now recall from random vector theory that the expectation of y T Ay is E[y T Ay] = tr(av ) + µ T Aµ where µ is the mean of y and V is the variance of y. But in this case, µ T Aµ = (X β) T (I X (X T X ) 1 X T )X β = β T X T X β β T X T X (X T X ) 1 X T X β = 0

57 and tr(av ) = tr((i n X (X T X ) 1 X T )σ 2 I n ) = σ 2 (tr(i n ) tr(x (X T X ) 1 X T )) = σ 2 (n tr((x T X ) 1 X T X )) = σ 2 (n tr(i k+1 )) = σ 2 (n (k + 1))

58 Therefore E[ ˆσ 2 n (k + 1) ] = σ 2. n This is not quite what we want! To ensure unbaisedness, instead we use the estimator s 2 = n n (k + 1) [ (y X b) T ] (y X b) n = (y X b)t (y X b). n (k + 1)

59 If we write the sum of squares of the residuals as SS Res = (y X b) T (y X b) and denote the number of parameters by p = k + 1, then this gives us the standard form of the estimator: s 2 = SS Res n p. We call this quantity the sample variance.

60 Example. Back to the house price example. We have estimated our parameters to be and so our residuals are y X b = = b =

61 Therefore SS Res = ( 2.83) 2 + ( 1.55) 2 + ( 5.6) = and the sample variance is s 2 = SS Res 5 3 =

62 Example. A study is designed to predict the extent of the cracking of latex paint in field conditions, based on the extent of the cracking in accelerated tests in the laboratory. A simple linear regression model with two parameters is assumed and we generate the data Test cracking (x) Actual cracking (y)

63 The data matrices are By direct calculation, X T X = y = [ , X = ] [, X T 23.5 y = ]

64 This gives the least squares estimator of the parameters b = [ Then the estimated expected cracking rate for each entry is X b = ].

65 The residuals are e = y X b = Therefore we can estimate the common variance of the response variables to be s 2 = et e 6 n p = i=1 e2 i 6 2 = 0.27.

66 Regression through the origin We note that so far we have always considered the linear model to include a parameter β 0, which is associated with a column of 1 s in the design matrix X. In multiple regression terms, this parameter is called the intercept. Sometimes it is reasonable to assume (from prior knowledge of the data) that no intercept is needed; then we can cut it out. However, surprisingly little changes. The model becomes y = β 1 x 1 + β 2 x β k x k + ε, but to analyse it, the design matrix simply loses the first column, the parameter vector loses the first row, and everything proceeds as normal.

67 Thus the least squares estimator is still and the variance estimator is still b = (X T X ) 1 X T y s 2 = SS Res n p. The only adjustment that we need to make is that the number of parameters, p, is now k instead of k + 1.

68 Maximum likelihood estimation You may have come across maximum likelihood estimation before: this is where we estimate parameters so that the probability of the observed values occuring is maximised. We can in fact apply this methodology to estimate the parameters of the linear model. Firstly, we need to assume a distribution for the errors. We assume that the errors are jointly normally distributed, and as before, have mean 0 and variance σ 2 I. In particular, this means that the errors are independent, not just uncorrelated.

69 Now, given observed values of the response variables y, the observed errors are y X β (although since we have not yet estimated β, these are unknown). Since the errors are independent, the likelihood function is given by L(ε; σ 2 ) = = = n i=1 1 σ 2π e ε2 i /2σ2 1 P ) n (2πσ 2 ) n/2 e 1/(2σ2 i=1 ε2 i 1 )(y X β) T (y X β) (2πσ 2 ) n/2 e 1/(2σ2

70 We want to maximise this quantity with respect to β to generate maximum likelihood estimators for β. To do this, we differentiate with respect to β and set the result to be 0. Actually, this turns out to be very hard, so we do the equivalent of differentiating ln L(ε; σ 2 ) with respect to β and setting the result to be 0. ln L(ε; σ 2 ) = n 2 ln(2πσ2 ) 1 2σ 2 (y X β)t (y X β).

71 Note that the first term is constant, so using the derivative rules we worked out earlier, β ln L(ε; σ2 ) = 1 ( y T 2σ 2 y 2(X T y)β + β T (X T X )β ) β (X T X )β = X T y. = 2(X T y) + 2(X T X )β = 0 This is exactly the normal equations!

72 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 normally distributed random vector with mean 0 and variance σ 2 I. Then the maximum likelihood estimator for β is also the least squares estimator, b = (X T X ) 1 X T y.

73 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 normally distributed random vector with mean 0 and variance σ 2 I. Then the maximum likelihood estimator for σ 2 is given by σ 2 = SS Res. n Note that this is not the least squares estimator! So maximum likelihood and least squares do not necessarily agree it just happens that they do for the parameter estimators (which means they must be good!)

74 Sufficiency The Gauss-Markov theorem states that the least squares estimators are the best linear unbiased estimators for the parameters β. The above theorem also shows that if we assume that the errors are normally distributed, the least squares estimator is also the maximum likelihood estimator. In fact, given the assumption of normality, we can say something stronger: the least squares estimators use all relevant information about the parameters that there is contained in the observed response variables. For a more technical statement, we define the notion of sufficiency.

75 Definition (Fisher-Neyman Factorization Theorem) Let X denote a random variable whose density depends on a single parameter θ, and let X 1, X 2,..., X n be a random sample drawn from this distribution, with joint density f (x 1, x 2,..., x n ; θ). Then the statistic Y = u(x 1, X 2,..., X n ) is sufficient for θ if and only if f can be expressed as f (x 1, x 2,..., x n ; θ) = g(y; θ)h(x 1, x 2,..., x n ). In other words, we must be able to factorise the density into one part which depends only on Y and θ (and not on the x s directly), and another part which depends only on the x s (and not at all on θ).

76 Example. Suppose we have a random sample from a Poisson distribution with parameter λ. The density for a single one of these variables (X 1 say) is f (x 1 ; λ) = e λ λ x 1 x 1! Because the samples are independent, the joint density is the product of all the individual densities.

77 f (x 1, x 2,..., x n ; λ) = = n f (x i ; λ) i=1 n e λ λ x i i=1 x i! = e nλ λ P x i ( n i=1 x i!) 1. It can now be seen that the statistic n i=1 x i is sufficient for λ.

78 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 normally distributed random vector with mean 0 and variance σ 2 I. Then the estimators of the parameters and the variance b = (X T X ) 1 X T y and s 2 = SS Res n p are jointly sufficient for β and σ 2.

79 Interval estimation Having created very good point estimates for the parameters, we would like to now find interval estimates to get a truer picture of where the parameters might lie. To do this, we remind ourselves of the Student s-t distribution in a way which may not be familiar. Definition Let Z be a standard normal random variable and let X 2 γ be an independent χ 2 random variable with γ degrees of freedom. Then Z Xγ 2 /γ has a t distribution with γ degrees of freedom.

80 Now, to find an interval estimate for the parameters, we first need to know what the distribution of our least squares estimator is. For us to know this, we have to make some assumption about the distribution of the response variables (or errors) remember that we do not need any such assumption to derive the least squares estimators. We go with the old standby: the errors are jointly normally distributed. This means that the response variables are normally distributed too. But b = (X T X ) 1 X T y, a linear combination of the y s!

81 Theorem Let y = X β + ε, where X is of full rank and ε is a normally distributed random vector with mean 0 and variance σ 2 I. Then b is normally distributed with mean β and variance (X T X ) 1 σ 2. We can use this theorem to give us confidence intervals for the parameters, except there is one small problem: we do not know the real variance σ 2. We can use s 2 instead, but that is itself a random variable. So we would like to know its distribution.

82 Theorem Let y = X β + ε, where X is a n p matrix of full rank and ε is a normally distributed random vector with mean 0 and variance σ 2 I. Then (n p)s 2 σ 2 = SS Res σ 2 has a χ 2 distribution with n p degrees of freedom. Proof. We have shown earlier (but it is easy to work out) that the residual sum of squares can be expressed as the quadratic form SS Res = (y X b) T (y X b) = y T [I X (X T X ) 1 X T ]y.

83 It can also be shown that I X (X T X ) 1 X T is symmetric and idempotent and that it has rank n p. By assumption, y is a normal random vector with mean X β and variance σ 2 I. By an earlier corollary on the distribution of a quadratic form, 1 σ 2 y T [I X (X T X ) 1 X T ]y has a noncentral χ 2 distribution, with n p degrees of freedom and noncentrality parameter λ = 1 2σ 2 µt [I X (X T X ) 1 X T ]µ.

84 But µ = X β, so λ = 1 2σ 2 (X β)t [I X (X T X ) 1 X T ]X β = 1 2σ 2 [βt X T X β β T X T X (X T X ) 1 X T X β] = 0. This means that SS Res σ 2 degrees of freedom. has a (central) χ 2 distribution with n p

85 We now have a normal random vector, b, and a χ 2 variable, SS Res /σ 2. We can potentially combine these two together to make a single t-distributed variable, but we need to know that they are independent. Theorem Let y = X β + ε, where X is a n p matrix of full rank and ε is a normally distributed random vector with mean 0 and variance σ 2 I. Then b and SS Res /σ 2 are independent.

86 Proof. We use the result for the independence of a vector By and a quadratic form y T Ay: we need BVA = 0 where V is the variance of y. We have b = (X T X ) 1 X T y, SS Res σ 2 = y T [I X (X T X ) 1 X T ] σ 2 y and so BVA = (X T X ) 1 X T σ 2 I [I X (X T X ) 1 X T ] σ 2 = (X T X ) 1 X T (X T X ) 1 X T X (X T X ) 1 X T = 0.

87 We are finally ready to use these results to create confidence intervals for the parameters. The first thing that we will try to do is to find a confidence interval for a single parameter, β i. Consider the covariance matrix of b, c 00 c c 0k (X T X ) 1 σ 2 c 10 c c 1k =..... c k0 c k1... c kk σ2.

88 The least squares estimator of β i is b i. The variance of b i is the ith diagonal element of the covariance matrix, denoted c ii σ 2. Since b i has a normal distribution, this means that b i β i σ c ii has a standard normal distribution. Of course, we do not know what σ is.

89 But from all the above theory, ( ) bi β i / σ SS Res /σ 2 c ii n p has a t distribution with n p degrees of freedom. Simplifying gives ( ) ( ) bi β i / s 2 σ c ii σ 2 = b i β i s. c ii

90 It is now easy to derive the confidence interval with confidence 100(1 α)%: P[ t α/2 (b i β i )/(s c ii ) t α/2 ] = 1 α P[ t α/2 s c ii b i β i t α/2 s c ii ] = 1 α P[b i t α/2 s c ii β i b i + t α/2 s c ii ] = 1 α. Therefore the confidence interval (using a t distribution with n p d.f.) is b i ± t α/2 s c ii.

91 Example. We want to model the amount of a chemical that dissolves in a fixed volume of water. It is known that this depends in part on the water temperature. An experiment is run 6 times and the following data measured: Temperature (x) Amount dissolved (y)

92 We use a simple linear regression model with two parameters, y = β 0 + β 1 x + ε. Using the least squares formulae we find b 0 = 1.44, b 1 = 0.31, s = First we find a confidence interval on β 0, the intercept. By direct calculation, the top left entry in (X T X ) 1 is found to be c 00 = 0.52 and the degrees of freedom is n p = 6 2 = 4.

93 A 95% confidence interval has α/2 = and t = 2.78, so the confidence interval is 1.43 ± = [ 0.30, 3.17]. In other words, we are 95% confident that the true amount of chemical dissolved at 0 temperature lies between and Notably, we cannot say with 95% confidence that it is untrue that no chemical dissolves at 0 temperature.

94 Next we find a confidence interval on β 1, the slope of the regression. The bottom right entry in (X T X ) 1 is calculated to be c 11 = and the degrees of freedom and t-value are the same, which gives the confidence interval 0.31 ± = [0.25, 0.36]. In other words, we are 95% confident that for each rise in temperature of 1 degree, the amount of chemical dissolved goes up by an amount between 0.25 and In particular, we are (at least) 95% sure that there is a positive relationship between temperature and chemical dissolved.

95 It is good that we can find confidence intervals for the parameters, but sometimes we want to estimate things other than just the parameters. In particular, we often want to predict the value of the response variable for a given set of inputs. This is an example of the more general case of linear functions of the parameters.

96 Remember that if we want to estimate the function t T β, the best linear unbiased estimator is t T b where b is the least squares estimator of the parameters. We look at the distribution of this estimator. Since b is normally distributed, any linear combination of b s is normally distributed. We have E[t T b] = t T β since b is an unbiased estimator for β.

97 Variance results give us Therefore var t T b = t T (X T X ) 1 σ 2 t = t T (X T X ) 1 tσ 2. t T b t T β t T (X T X ) 1 tσ 2 has a standard normal distribution. But again, we do not know what σ is!

98 The solution should not be difficult to see: since SS Res /σ 2 is independent of b, it is independent of t T b. We divide a standard normal variable by the square root of a χ 2 variable to get a t-distributed variable: (t T b t T β)/( t T (X T X ) 1 tσ 2 ) SSRes /σ 2 (n p) = t T b t T β s t T (X T X ) 1 t has a t distribution with n p degrees of freedom. Using similar steps to before, this gives the 100(1 α)% confidence interval t T b ± t α/2 s t T (X T X ) 1 t.

99 In particular, if we want to find a confidence interval for the expected response to a particular set of x variables, we can observe that given the x values x1, x 2,..., x k, the expected response is E[y] = β 0 + β 1 x β k xk = (x ) T β where x = [ 1 x1 x2... xk ] T. This is a linear function of β, and therefore the 100(1 α)% confidence interval for it is (x ) T b ± t α/2 s (x ) T (X T X ) 1 x.

100 Example. In the house price example, we previously estimate the average selling price of a 15-year-old house with an area of 250 m 2 to be We also found the inverse of X T X to be (X T X ) 1 = and it is a simple calculation to evaluate (y X b) s = T (y X b) = n p

101 Then for a 15-year-old house with an area of 2.5, we get x = [ ] T, and (x ) T (X T X ) 1 x = [ ] = which, using a t distribution with 5 3 = 2 degrees of freedom, gives a 95% confidence interval of ± = [37.64, 76.4]

102 Prediction intervals We must differentiate between the confidence interval given in the above example and a prediction interval. Given a set of inputs, a 95% confidence interval (say) gives an interval that we are 95% sure that the expected or mean response for those inputs lies in. In contrast, given a set of inputs, a 95% prediction interval produces an interval in which we are 95% sure that any given response with those inputs lies in, not the mean. Because we are predicting the value of one response, this interval is wider than the corresponding confidence interval.

103 To find such an interval, suppose we have the inputs x = [ 1 x1 x2... xk ] T. These inputs will generate a response y = (x ) T β + ε where var ε = σ 2 by assumption. This will be (point) estimated by (x ) T b with an error of y (x ) T b = (x ) T β + ε (x ) T b.

104 Now the first term is not random. ε is an error associated with the future observation y, and b depends only on the current observations y. So we can say that they are independent, which gives Therefore var (y (x ) T b) = var ε + var (x ) T b. var (y (x ) T b) = σ 2 +(x ) T (X T X ) 1 σ 2 x = [1+(x ) T (X T X ) 1 x ]σ 2 and since the estimator is unbiased, the expectation is 0.

105 Following exactly the previous arguments, we derive that y (x ) T b s 1 + (x ) T (X T X ) 1 x has a t distribution with n p degrees of freedom, and a prediction interval is (x ) T b ± t α/2 s 1 + (x ) T (X T X ) 1 x. The only difference with confidence intervals is the presence of the 1, which makes the interval wider as expected.

106 Example. In the previous example, we estimated the average selling price of a 15-year-old house with area 250 m 2 to be between and 76.4 with 95% confidence. If we are given a single 15-year-old house with area 250 m 2, then with 95% confidence we can say that it will sell in the range of (x ) T b ± t α/2 s 1 + (x ) T (X T X ) 1 x = ± = [21.24, 92.8]. This is wider than the confidence interval for the mean.

107 Joint confidence intervals Sometimes we would like to have confidence intervals for more than one parameter, or linear combination of parameters. It is possible to simply find confidence intervals for each parameter as before, but this is misleading. If we find more than one 95% confidence interval, we do not have 95% confidence that all of them will be satisfied at once. If we find many such confidence intervals, the laws of probability imply that it is quite likely that at least one will be wrong! We need to find joint confidence intervals on a number of parameters at the same time.

108 However, having joint confidence intervals gives a rectangle that we think the parameters will be located in. But it is more accurate to simply find a confidence region, which may not be a rectangle, where we think the parameters are located in. To do this we recall the F distribution, defining it in terms of the χ 2 distribution.

109 Definition Let X 2 γ 1 and X 2 γ 2 be independent χ 2 random variables with γ 1 and γ 2 degrees of freedom. Then X 2 γ 1 /γ 1 X 2 γ 2 /γ 2 has an F distribution with γ 1 and γ 2 degrees of freedom. Remember that the first degree of freedom is associated with the numerator and the second degree of freedom is associated with the denominator.

110 Now we derive a confidence region for β. The least squares estimator is b = (X T X ) 1 X T y. It can be shown using our random vector theory that the quadratic form (b β) T X T X (b β) σ 2 has a χ 2 distribution with p degrees of freedom (where p is the number of parameters in the model). We also know that (n p)s 2 has a χ 2 distribution with n p degrees of freedom. σ 2

111 Since b and s 2 are independent, the two χ 2 variables above are independent, which means that ( (b β) T X T ) ( ) X (b β) / (n p)s 2 pσ 2 = (b β)t X T X (b β) ps 2 (n p)σ 2 has an F distribution with p and n p degrees of freedom. Because this statistic is based on b β, which we want to be small, we use the right-hand side of the F -distribution to create a confidence region.

112 If f α is the critical value of the F distribution with p and n p d.f. and probability α, then P[(b β) T X T X (b β)/ps 2 f α ] = 1 α which gives the confidence region (b β) T X T X (b β) ps 2 f α. The region can be calculated from the estimated values b, X T X, and s 2.

113 Example. Consider the example used in practice classes 4 and 5: modelling income against years of formal education. The data is Years of education Income

114 You would have found that and b = X T X = [ [ ] ], s 2 =

115 So a joint 95% confidence interval is given by [ β β 1 ] [ ] [ β β 1 ] which simplifies to 6β β β 0 β β β This is an ellipse (which is true in general).

116

117 Generalized least squares Throughout this section, we have made the assumption that the errors ε have mean 0 and variance σ 2 I, and sometimes that they are normally distributed. However, these assumptions are not always correct. If the errors do not have 0 mean, then the model is probably wrong! It is not always satisfying to have normal errors, but (a) they occur quite often in practice and (b) there isn t all that much (easily analysed) alternative. What if the variance of ε is not σ 2 I?

118 Suppose that the variance of ε is a positive definite matrix V, but ε is still (jointly) normally distributed. It can be shown that the maximum likelihood estimator minimizes the function e T V 1 e = (y X b) T V 1 (y X b) and satisfies the equivalent of the normal equations, X T V 1 X b = X T V 1 y.

119 This gives the generalized least squares estimators b = (X T V 1 X ) 1 X T V 1 y. If V = σ 2 I, this reduces to ordinary least squares. It can be shown that under these circumstances, the Gauss-Markov theorem still holds, i.e. the generalized least squares estimator is BLUE.

120 Weighted least squares In this situation, the errors are uncorrelated but do not have a common variance: σ σ var ε = σn 2 The same formulae can be applied as above; here we are minimising (y X b) T V 1 (y X b) = n i=1 ( ) 2 ei so we weight each residual by the inverse of the corresponding standard deviation. So a point with high variance influences b less than a point with low variance. σ i