Linear Models: The less than full rank model estimation and estimability

Transcription

1 Linear Models: The less than full rank model estimation and estimability

2 The less than full rank model In previous sections we use the linear model y = X β + ε in the knowledge (or assumption) that X, of dimension n p, is of full rank, i.e. r(x ) = p. This assumption allows for easy(er) analysis, because a full rank X implies that X T X is invertible, and therefore the normal equations have a unique solution. X T X b = X T y

3 Unfortunately, not all linear models fall into this category. If this happens, we must develop other techniques to analyse the model. Example. A common (and commonly known) example of a less than full rank model is the one-way classification model with fixed effects. In this model, the samples come from k distinct populations, with different characteristics. We wish to determine the differences in these populations.

4 For example: A medical researcher might want to compare three different types of pain relievers for effectiveness in relieving arthritis; A biologist might study the effects of four experimental treatments used to enhance the growth of tomato plants; or An engineer might want to investigate the sulfur content in the five major coal seams in a particular geographic region. Often, the populations arise as the result of applying k different treatments to groups of similar subjects.

5 For this model, we give each response variable two indices, to denote both the population from which it is taken and its position in the samples from that population. So y ij is the jth sample taken from the ith population. The model we use is y ij = µ + τ i + ε ij, for i = 1, 2,..., k and j = 1, 2,..., n i, where k is the number of populations / treatments; n i is the number of samples from the ith population.

6 Although it might not look exactly like a linear model, it can be written in that form quite easily by taking Writing it out: (next page) β = µ τ 1 τ 2. τ k. Because the first column of X is the sum of the remaining columns, the columns are not linearly independent, and therefore X is not of full rank.

7 y 11 y 12. y 21 y 22. y k,nk = µ τ 1 τ 2. τ k + ε 11 ε 12. ε 21 ε 22. ε k,nk y = X β + ε

8 Example. Three different treatment methods for removing organic carbon from tar sand wastewater are compared: airflotation, foam separation, and ferric-chloride coagulation. A study is conducted and the amounts of carbon removed are as follows: AF FS FCC

9 The linear model is = µ τ 1 τ 2 τ 3 τ k + ε 11 ε 12 ε 13 ε 21 ε 22 ε 23 ε 31 ε 32 ε 33 y = X β + ε

10 As noted before, the big difficulty with a less than full rank model is that X T X is now singular. This means that the normal equations do not have a unique solution. We will show later that in fact, the normal equations now have an infinite number of solutions. However, the problem goes deeper than that: not only can we not estimate the parameters, but the parameters themselves do not have any fixed value! We show this in an example.

11 Example. Suppose that we have a one-way classification model with k = 3 populations. The response variable from each population is centred around µ + τ i. Now suppose that from a study, it is found that µ + τ 1 = 1 µ + τ 2 = 12 µ + τ 3 = 8 Then our parameters might be µ = 1, τ 1 =, τ 2 = 2, τ 3 = 2. However, we can also have µ = 3, τ 1 = 7, τ 2 = 9, τ 3 = 5! In fact we can choose µ to be any real number, and still describe the system.

12 Reparametrization One way we can tackle the less than full rank model is by the simple means of converting to a full rank model. We can then use all the machinery we have developed on the converted model. Example. Consider the one-way classification model with k = 3. The less than full rank model for this is y ij = µ + τ i + ε ij, for i = 1, 2, 3, j = 1, 2,..., n i. However, we can write the mean of each population as µ i = µ + τ i.

13 Then we can recast the model as y ij = µ i + ε ij with corresponding matrices X = 1, β = µ 1 µ 2 µ 3.

14 It is apparent now that the columns of X are linearly independent, and so this is a full rank model that we can fiddle with. Simple matrix calculations give us X T X = 1 n 1 n 2, (X T X ) 1 n 1 = 1 n 2 n 1 3 n 3 X T y = n1 i=1 y 1i n2 i=1 y 2i n3 i=1 y 3i, b = (X T X ) 1 X T y = n1 i=1 y 1i/ n1 n2 i=1 y 2i/ n2 n3 i=1 y 3i/ n3.

15 Therefore, the least squares estimates for each of the population means are the means of the samples drawn from that population: ˆµ i = 1 n i y ij. n i j=1 Linear functions of the parameters, of the form t T β, are estimated using t T b. For example, the function µ 1 µ 2 is estimated by 1 n 1 n 1 i=1 y 1i 1 n 2 y 2i. n 2 i=2

16 The standard assumption that the errors are normally distributed with mean and variance σ 2 I is interpreted in this context to mean that all populations have a common variance σ 2 (but different means). The standard estimator for this variance is s 2 = yt y y T X (X T X ) 1 X T y n p = yt y y T X b. n 3

17 Then s 2 = = = n 3 1 n 3 1 n n 3X X i i=1 i=1 j=1 n 3X X i yij 2 j=1 2 n 3X X i 4 i=1 j=1 y 2 ij + ˆ P n 1 i=1 y 1i y 2 ij 1 n i n 3X 1 X n i i=1 n X j=1 j=1 y ij P n2 i=1 y 2i y ij A 5 A 5. P n3 i=1 y 3i 2 4 P n1 i=1 y 1i n1 Pn2 i=1 y 2i n2 Pn3 i=1 y 3i n

18 This can be written as a pooled variance where s 2 i s 2 = (n 1 1)s (n 2 1)s (n 3 1)s 2 3 (n 1 1) + (n 2 1) + (n 3 1) are the individual population variance estimators s 2 i = 1 n i 1 ( ) n i y ij 1 n i 2 y ik. n i j=1 k=1

19 In general, it is always possible to reparametrize a less than full rank model into a full rank model. However, this is not always desirable. For the one-way classification model, we have a nice interpretation of the (re-)parameters as the population means. But this is not always possible. Example. Consider the two-way classification model (without interaction), with one sample from each combination of factors and two levels of each factor: y ij = µ + τ i + β j + ε ij, i, j = 1, 2. We will study this model with more generality later.

20 The design matrix for this model is X = It is obvious that the first column is the sum of the next two columns, so the rank of X is at most 4. However, the sum of the 2nd and 3rd columns is equal to the sum of the 4th and 5th, so in fact r(x ) = 3. This means that we have to remove 2 parameters making interpretability much harder! Fortunately, we do not have to reparametrize our models we can develop theory for the less than full rank model.

21 Conditional inverses The starting point of our theory is (as might be guessed) more linear algebra. This time we introduce the concept of conditional inverses. Definition Let A be a n p matrix. The p n matrix A c is called a conditional inverse for A if and only if AA c A = A.

22 The first thing we note is that if A is nonsingular and square, then A 1 = A c so conditional inverses are just an extension of regular inverses for non-square and singular matrices. Example. Consider the matrices A = 1 1, A 1 =

23 Then AA 1 A = = = = A Therefore A 1 is a conditional inverse for A.

24 But it can also be shown that A 2 = is also a conditional inverse for A! So conditional inverses are not unique. That is why we speak of a conditional inverse for A, not the conditional inverse for A. Of course, if A is nonsingular, then the conditional inverse is uniquely the regular inverse. We can use this in the above example to show that A is singular.

25 For a square matrix to have a regular inverse, it must satisfy other conditions, namely nonsingularity. However, this is not the case for a conditional inverse. Theorem Let A be a n p matrix. Then A has a conditional inverse.

26 Proof. Let A have rank r. It is possible to perform a serise of elementary row and column operations (multiplication, transposition, and addition) on A to reduce it to the form B = [ Ir If we denote the matrices of the row and column operations by P and Q (which are nonsingular), then we get ]. PAQ = B.

27 Now consider the p n matrix B T = [ Ir where the s are appropriately dimensioned. ] It is not too much work to see that BB T B = B, so B T is a conditional inverse of B.

28 Now since P and Q are nonsingular, A = P 1 BQ 1. Then A(QB T P)A = P 1 BQ 1 QB T PP 1 BQ 1 = P 1 BB T BQ 1 = P 1 BQ 1 = A. By definition, QB T P is a conditional inverse for A. Therefore A has a conditional inverse.

29 Finding a conditional inverse How do we find a conditional inverse? The above theorem gives one way, but there is an easier way: 1 Find a minor M of A which is nonsingular and of dimension r(a) r(a). 2 Find M 1 and (M 1 ) T. 3 Replace M in A with (M 1 ) T and the other entries with zeros. 4 Transpose the resulting matrix.

30 Example. In the earlier example, we have A = It can be seen that r(a) = 2, so we take the principal 2 2 minor M = [ ].

31 and (M 1 ) T = 1 4 A c = [ T ] T = = [ This is the conditional inverse A 1 in the earlier example, so we can see that it works. On the other hand, if we take the lower left 2 2 minor, following the procedure gives us A 2. So this procedure can produce more than one conditional inverse.. ]

32 Conditional inverse properties Let A be a n p matrix of rank r, where n p r. Then A c A and AA c are idempotent; r(aa c ) = r(a c A) = r; (A c ) T = (A T ) c ; A = A(A T A) c (A T A) and A T = (A T A)(A T A) c A T ; I A c A is idempotent.

33 More properties We say that an expression involving a conditional inverse is unique if it is the same no matter what conditional inverse we use. A(A T A) c A T is unique, symmetric, and idempotent; r(a(a T A) c A T ) = r; I A(A T A) c A T is unique, symmetric and idempotent; r(i A(A T A) c A T ) = n r.

34 Example proof. [A(A T A) c A T ] T = A[(A T A) c ] T A T = A[(A T A) T ] c A T = A(A T A) c A T. A(A T A) c A T A(A T A) c A T = [ A(A T A) c A T A ] (A T A) c A T = A(A T A) c A T.

35 Solving the normal equations Now that we have developed the machinery, we can try to solve the normal equations X T X b = X T y. First, we must make sure that they have a solution! Theorem The system Ax = g is consistent if and only if the rank of [ A g ] is equal to the rank of A.

36 Proof. ( ) Assume that r( [ A g ] ) = r(a). Because adding g does not add to the rank, this must mean that g is a linear combination of the columns of A. Therefore there exist constants x 1, x 2,..., x p, not all zero, so that x 1 a 1 + x 2 a x p a p = g where a i is the ith column of A. But if we put this into matrix notation and set x 1 x 2 x =., then this is exactly the system Ax = g. Therefore the system is consistent. x p

37 Theorem Let y = X β + ε be a linear model. Then the normal equations are consistent. X T X b = X T y Proof. It is obvious that r(x T X ) r( [ X T X X T y ] ), as adding a column cannot decrease the number of linearly independent columns.

38 However, using rank properties from earlier on, r( [ X T X X T y ] ) = r(x T [ X y ] ) r(x T ) = r(x T X ). Therefore r( [ X T X X T y ] ) = r(x T X ) and the previous theorem shows that the normal equations are consistent.

39 Now that we know the normal equations have a solution, how do we find it?

40 Now that we know the normal equations have a solution, how do we find it? We use conditional inverses. Theorem Let Ax = g be a consistent system. Then A c g is a solution to the system, where A c is any conditional inverse for A.

41 Proof. Since Ax = g, AA c g = AA c Ax = Ax = g. Therefore, A c g solves the system. From this theorem, we see that b = (X T X ) c X T y solves the normal equations, for any conditional inverse. However, in the less than full rank model, different conditional inverses may result in different solutions.

42 Example. Suppose that for a particular linear model, we derive X T X = 1 1, X T y = This could potentially arise from a two-class classification model with one sample from each class: X = [ ].

43 The normal equations are then b b 1 b 2 = Since the last column of X T X is the sum of the first two, X T X is not of full rank. However, since the first two columns are not multiples of each other, r(x T X ) = 2.

44 To find a conditional inverse[ of X T X ], we apply the algorithm, 2 1 using the nonsingular minor. This gives us (X T X ) c = 1 2 and therefore b = (X T X ) c X T y = = 8 2.

45 [ ] 1 However, using the minor gives the conditional inverse 1 (X T X ) c = 1, 1 which gives the solution b = = Both these solutions solve the normal equations, and are equally valid! This is the problem with the less than full rank model. 6 8.

46 Example. Consider the earlier carbon removal example. We have X T X = so a conditional inverse is (X T X ) c =

47 We can also calculate X T y = Using the conditional inverse gives us a solution (but not the solution) to the normal equations: b = (X T X ) c X T y =

48 In fact, if the model is less than full rank, the normal equations have an infinite number of solutions. Theorem Let Ax = g be a consistent system. Then x = A c g + (I A c A)z solves the system, where z is an arbitrary p 1 vector.

49 Proof. We know that A c g solves the system, so Ax = A [A c g + (I A c A)z] = AA c g + (A AA c A)z = g + (A A)z = g. For the normal equations, this means that any vector of the form b = (X T X ) c X T y + [I (X T X ) c X T X ]z also satisfies the equations.

50 Example. In the two-class example above, one solution to the normal equations was [ 8 2 ] T. Using the first conditional inverse found, (X T X ) c X T X = = Let z = [ ] T, arbitrarily. Then another solution to the normal equations is b = =

51 Example. In the carbon removal example, our conditional inverse gives us (X T X ) c X T X = and so another solution to the normal equations is b = =

52 The converse of the above theorem is also true: all solutions to the system can be expressed in this form. Theorem Let Ax = g be a consistent system and let x be any solution to the system. Then where z = x. x = A c g + (I A c A)z

53 Proof. Since x solves the system, A c g + (I A c A)z = A c g + (I A c A)x = A c g + x A c Ax = A c g + x A c g = x. For the normal equations, this means that any solution can be expressed as b = (X T X ) c X T y + [I (X T X ) c X T X ]z for any conditional inverse (X T X ) c.

54 Example. In the two-class example, we found the solution 8 b 1 = 2 using our original conditional inverse. But we also noted that the conditional inverse (X T X ) c 2 = 1 1 produces the solution b 2 = 6 8.

55 Using the theorem, the first solution can be written in terms of the second solution: b 1 = (X T X ) c 2X T y + (I (X T X ) c 2X T X )z 1 = = =

56 Estimability Now we know how to solve the normal equations; furthermore, we know how to find all solutions for them. But which solution do we want? Which one is the best?

57 Estimability Now we know how to solve the normal equations; furthermore, we know how to find all solutions for them. But which solution do we want? Which one is the best? All of them! They are all equally valid. This means that we can never estimate the parameters.

58 However, not all hope is lost. There is at least one thing which is not arbitrary.

59 However, not all hope is lost. There is at least one thing which is not arbitrary. It is the value of the response variable, y. No matter what the parameters are estimated to be, y will never change! In fact, there exist linear combinations of the parameters will always be estimated at the same value no matter what solution we use for the normal equations. We call these linear combinations estimable.

60 As we might guess, combinations which are estimable can be linked to the response variable in some way. Formally: Definition Let y = X β + ε be a linear model. A function t T β is said to be estimable if there exists a vector c such that E[c T y] = t T β. Another way of looking at it is that there must exist a linear unbiased estimator for t T β.

61 We look at some equivalent conditions to estimability. Theorem Let y = X β + ε be a linear model where ε has mean and variance σ 2 I. Then t T β is estimable if and only if there is a solution to the linear system X T X z = t. Proof. ( ) Let z be a solution to X T X z = t and put c = X z. Then E[c T y] = E[z T X T y] = z T X T E[y] = z T X T X β = t T β, so t T β is estimable.

62 Example. Consider our two-class example. As a reminder, we had X T X = Consider the combination of parameters β 1 β 2. This corresponds to t T β where t = 1 1.

63 Now we look for a solution to the system z z 2 = z 3 1 A little thought shows that this solution has the system z 1 =, z 2 = 1, z 3 = 1, so β 1 β 2 is estimable..

64 Theorem Let y = X β + ε be a linear model where ε has mean and variance σ 2 I. Then t T β is estimable if and only if t T (X T X ) c X T X = t T for any conditional inverse of (X T X ). Proof. ( ) Assume that t T (X T X ) c X T X = t T, so X T X ((X T X ) c ) T t = X T X (X T X ) c t = t. This means that (X T X ) c t is a solution to the system X T X z = t, and the previous theorem implies that t T β is estimable.

65 ( ) Suppose that t T β is estimable. By the previous theorem, there exists a solution to the system X T X z = t. Using the conditional inverse, we know that a solution is z = (X T X ) c t. In other words, X T X (X T X ) c t = t and by taking transposes as above, we see that this gives t T (X T X ) c X T X = t T.

66 Example. Consider the previous example. Let us take the conditional inverse (X T X ) c = and consider the same quantity, β 1 β 2, which corresponds to t = [ 1 1 ] T. Then t T (X T X ) c (X T X ) = [ 1 1 ] = [ 1 1 ] = [ 1 1 ] = t T, so again we see that β 1 β 2 is estimable.

67 On the other hand, suppose we take t = [ 1 ] T so that t T β = β. Then we have t T (X T X ) c (X T X ) = [ 1 ] so β is not estimable. = [ 1 ] = [ 1 1 ] t T,

68 Example. We return to the carbon removal example. We are interested in seeing if the various carbon removal treatments have (significantly) different means. To test this, we look at the quantities τ 1 τ 2 and τ 1 τ 3. If both of these are, then the treatments are the same. We have (X T X ) c X T X = and the coefficient vectors t 1 = 1 1, t 2 =

69 t T 1 (X T X ) c X T X = [ 1 1 ] so t T 1 β = τ 1 τ 2 is estimable. t T 1 (X T X ) c X T X = [ 1 1 ] so t T 2 β = τ 1 τ 3 is also estimable = [ 1 1 ] = [ 1 1 ]

70 Using our definition of estimable, we can prove formally that no matter what conditional inverse we use, we will still generate the same estimate for an estimable quantity. First we will state a supporting lemma. Lemma Let y = X β + ε where ε has mean and variance σ 2 I. The best linear unbiased estimator for any estimable function t T β is z T X T y, where z is a solution to the system X T X z = t.

71 Theorem (A Gauss-Markov Theorem) Let y = X β + ε be a linear model where ε has mean and variance σ 2 I. Suppose t T β is estimable. Then any solution to the system X T X z = t gives the same estimate for t T β. Furthermore, this estimate is t T b, where b is any solution to the normal equations. Lastly, this estimate is BLUE. Proof. Suppose we have two solutions to the system X T X z = t, called z and z 1. Let b be any solution to the normal equations, which means that X T X b = X T y.

72 From the previous lemma, the best linear unbiased estimator of t T β is Similarly, z T X T y = z T X T X b = (X T X z ) T b = t T b. z T 1 X T y = t T b = z T X T y. This shows that the best linear unbiased estimator is unique, and equal to t T b.

73 Example. Let s look again at the two-class example. We have shown that β 1 β 2 is estimable. We also know that solutions to the normal equations include b = 8 2, b = If we want to estimate β 1 β 2, we would use t T b = [ 1 1 ] 8 2 =

74 However, from the above theorem, we can also use t T b = [ 1 1 ] 6 = 2. 8 It is not a coincidence that this estimate is the same as the previous one! The theorem shows that any solution to the normal equation, using any conditional inverse, will produce exactly the same estimate. In other words, the estimator is unique.

75 Example. We look at the carbon removal example. We have shown that τ 1 τ 2 and τ 1 τ 3 are estimable. We estimate them by and t T 1 b = [ 1 1 ] t T 2 b = [ 1 1 ] = 4.3 = 8.2 respectively. The Gauss-Markov theorem shows that no matter what conditional inverse we use to calculate b, these estimates will always remain the same.

76 Estimability theorems Now that we have defined estimability, we would like to know which quantities are estimable and which are not (so that we can decide what we want to find out before we start the study!). The first quantities which are definitely estimable are elements of y this is how we defined estimability, after all! Theorem Let y = X β + ε be a linear model. Then elements of X β are estimable.

77 Proof. We know that E[y] = X β. Therefore, we can multiply X β by each of 1. T, 1. T,...,. 1 T to get functions which are estimable. But these are the elements of X β, so the elements of X β are estimable.

78 Example. Consider the carbon removal example. We have the vectors X = , β = µ τ 1 τ 2 τ 3. We showed earlier that we cannot estimate the parameter vector β.

79 However, the real quantities of interest in this model are the mean responses from the three treatments. These are µ + τ 1, µ + τ 2, and µ + τ 3. We can see that µ + τ 1 = [ 1 1 ] β µ + τ 2 = [ 1 1 ] β µ + τ 3 = [ 1 1 ] β and each of these are elements of X β. Therefore, they are estimable. We would estimate them by replacing β with b, where b is any solution to the normal equations (it does not matter which). In fact, in a classification model with any k, µ + τ i is always estimable.

80 We know that elements of X β are estimable; what else?

81 We know that elements of X β are estimable; what else? If we combine estimable quantities (in a linear manner), the result should be estimable. Theorem Let t T 1 β, tt 2 β,..., tt k β all be estimable functions, and let z = a 1 t T 1 β + a 2 t T 2 β a k t T k β. Then z is estimable, and the best linear unbiased estimator for z is a 1 t T 1 b + a 2 t T 2 b a k t T k b.

82 Proof. By definition, z = (a 1 t 1 + a 2 t a k t k ) T β. Since all the functions are estimable, (a 1 t 1 + a 2 t a k t k ) T (X T X ) c X T X = a 1 t T 1 (X T X ) c X T X + a 2 t T 2 (X T X ) c X T X a k t T k (X T X ) c X T X = (a 1 t 1 + a 2 t a k t k ) T.

83 Therefore z is estimable, with estimator (a 1 t 1 + a 2 t a k t k ) T b. Of particular interest in many studies is the way different populations compare against each other. To attach a numerical value to these comparisons, we form linear combinations a 1 τ 1 + a 2 τ a k τ k, where k i=1 a i =. These treatment contrasts wipe out the effect of the overall mean response, so as to get a better picture of the differences between populations.

84 In a one-way classification model, any treatment contrast is estimable. We show this by noting that if is a treatment contrast, then z = a 1 τ 1 + a 2 τ a k τ k z = k a k µ + a 1 τ 1 + a 2 τ a k τ k i=1 = a 1 (µ + τ 1 ) + a 2 (µ + τ 2 ) a k (µ + τ k ) is a linear combination of the estimable functions µ + τ i, and is therefore itself estimable.

85 Of particular interest among treatment contrasts is the contrast of the form τ i τ j, for some i j. This is because τ i τ j = (µ + τ i ) (µ + τ j ) is the difference between the mean response in population i and the mean response in population j. If we write ȳ i for the sample mean from population i, then we would expect to estimate this contrast by the corresponding difference in sample means, ȳ i ȳ j. We can show using the theory we have developed that this is in fact the case.

86 Example. We do this for k = 3 and the contrast τ 1 τ 2. Our matrices are X =...., y = y 11 y 12. y 1n1 y 21 y 22. y 2n2 y 31 y 32., β = µ τ 1 τ 2 τ 3. Direct multiplication gives y 3n3

87 X T y = 3 i=1 nj j=1 y ij j=1 y 1j j=1 y 2j j=1 y 3j, X T X = n n 1 n 2 n 3 n 1 n 1 n 2 n 2 n 3 n 3 We can use the conditional inverse algorithm on the lower right corner of X T X to get (X T X ) c 1 = n 1 1 n 2. 1 n 3.

88 Therefore a solution to the normal equations is b = (X T X ) c X T y = ȳ 1 ȳ 2 ȳ 3. We can write τ 1 τ 2 as [ 1 1 ] β, so the best linear unbiased estimator for τ 1 τ 2 is [ ] 1 1 ȳ 1 ȳ 2 ȳ 3 = ȳ 1 ȳ 2. If we took any other conditional inverse, we would get the same result.

89 Example. In the carbon removal example, we showed that τ 1 τ 2 and τ 1 τ 3 are estimable. Both of these are contrasts, so we can say straight off that they are estimable (without doing the calculations).

90 Estimating σ 2 in the less than full rank model In the full rank model, we estimated σ 2 by s 2 = SS Res n p, where n is the sample size, p is the number of parameters, and SS Res is the sum of squares of the residuals: SS Res = (y X b) T (y X b) = y T [I X (X T X ) 1 X T ]y. We would like to find a corresponding expression for the less than full rank model, but obviously it will not be the same (since (X T X ) 1 does not exist).

91 We still define the residual sum of squares as SS Res = (y X b) T (y X b), where b is any solution to the normal equations. The important thing is that although b can vary, X b will not, because the elements of X β are estimable. Therefore SS Res is invariant to the choice of b. Next, we find the equivalent expression for SS Res. Theorem SS Res = y T [I X (X T X ) c X T ]y.

92 Proof. Let b = (X T X ) c X T y and recall that X (X T X ) c X T X = X. Then SS Res = (y T b T X T )(y X b) = y T y 2y T X b + b T X T X b = y T y 2y T X (X T X ) c X T y + y T X (X T X ) c X T X (X T X ) c X T y = y T y 2y T X (X T X ) c X T y + y T X (X T X ) c X T y = y T [I X (X T X ) c X T ]y.

93 How do we now find an estimator for σ 2? Using the quadratic forms theory that we developed earlier, we know that E[SS Res ] = E[y T (I X (X T X ) c X T )y] = tr(i X (X T X ) c X T )σ 2 + (X β) T (I X (X T X ) c X T )X β = tr(i X (X T X ) c X T )σ 2 + β T X T X β β T X T X (X T X ) c X T X β = tr(i X (X T X ) c X T )σ 2 + β T X T X β β T X T X β = tr(i X (X T X ) c X T )σ 2.

94 It can be shown that I X (X T X ) c X T is symmetric and idempotent, so E[SS Res ] = r(i X (X T X ) c X T )σ 2 = (n r)σ 2, where r = r(x ), the rank of X. This gives us the following theorem. Theorem Let y = X β + ε be a linear model, where X has rank r and ε has mean and variance σ 2 I. Then an unbiased estimator for σ 2 is SS Res n r.

95 Example. We return to the carbon removal example. The fitted values are X b = =

96 So the residuals are y X b = =

97 This means SS Res = (y X b) T (y X b) = 1.3. The rank of X is easily seen to be 3, so s 2 = =.217.

98 Interval estimation in the less than full rank model As for the full rank model, we have estimated what we could estimate. The next step is to try and find confidence intervals for our estimates. So far, we have not assumed that the error vector ε is normally distributed. However, to find confidence intervals, we need some idea of the distribution of the variables, so we make that assumption now.

99 Recall that in the full rank model, we generated confidence intervals by finding a t-distributed quantity, which was created by dividing a normal variable by a χ 2 variable. The χ 2 variable was SS Res σ 2, which had n p degrees of freedom. The σ 2 term was not known, but cancelled out another σ 2 term in the numerator to leave us with something that we could calculate.

100 We can do pretty much the same thing for the less than full rank model. Theorem Let y = X β + ε be a linear model, where ε is a normal random vector with mean and variance σ 2 I. Then (n r)s 2 σ 2 = SS Res σ 2 has a χ 2 distribution with n r degrees of freedom. The proof of this theorem is very similar to that for the full rank case, so we will not repeat it.

101 The steps to derive a confidence interval are very similar to that for the full rank case, but with two small differences. Firstly, we can only find confidence intervals for quantities that are estimable! Secondly, we replace the inverse (X T X ) 1 by the conditional inverse (X T X ) c. All other steps are the same.

102 This gives us the confidence interval for the (estimable) quantity t T β, using a t distribution with n r degrees of freedom: t T b ± t α/2 s t T (X T X ) c t This formula can also be used to find confidence intervals for the individual parameters, providing that they are estimable.

103 Example. We return again to the carbon removal example. Suppose we want to find a 95% confidence interval for τ 1 τ 2. We have t = 1 1, s2 =.217, t.25 = 2.45 using n r = 9 3 = 6 degrees of freedom. We also use the conditional inverse (X T X ) c =

104 This gives the confidence interval t T b ± t α/2 s t T (X T X ) c t = [ 1 1 ] ± [ ] = 4.3 ±.93 = ( 5.23, 3.37). 1 1 In particular, we can say with 95% confidence that the the first carbon removal treatment is not as effective as the second.

105 Example. We showed earlier that in a general 3-way classification model, the contrast τ 1 τ 2 can be estimated by the difference in the respective population means, ȳ 1 ȳ 2. We also had t = 1 1, (X T X ) c = 1 n 1 1 n 2 1 n 3.

106 Therefore we have t T (X T X ) c t = [ 1 1 ] 1 n 1 1 n 2 1 n = 1 n n 2 and the confidence interval is ȳ 1 ȳ 2 ± t α/2 s 1 n n 2. You have probably seen this formula before! The linear models framework allows us to derive it from first principles.