Linear Models: The full rank model estimation

Size: px
Start display at page:

Download "Linear Models: The full rank model estimation"

Transcription

1

2 Linear models We remind ourselves what a linear model is. We have n subjects, labelled 1 to n; We wish to analyse or predict the behaviour of a measurement or property of the subject (y variable); Denoted by y 1, y 2,..., y n. Each subject has certain other properties that we know or have measured (x variables); Subject i has k of these properties x i1, x i2,..., x ik.

3 The linear model is for all i = 1, 2,..., n, or y i = β 0 + β 1 x i1 + β 2 x i β k x ik + ε i

4 y 1 y 2. y n = 1 x 11 x x 1k 1 x 21 x x 2k x n1 x n2... x nk β 0 β 1 β 2. β k + ε 1 ε 2. ε n y = X β + ε

5 Note that under the terminology we have developed, y and ε are random vectors. A common assumption is that ε is a normal random vector with mean 0 and variance σ 2 I. As mentioned before, X and β are NOT random vectors. Although it is common for X to be a measurement, technically this is wrong the model is deterministic and only the error term is subject to random variation.

6 The full rank model The full rank model is very simple: it happens when the design matrix X has full rank, i.e. r(x ) = k + 1. This small condition has critical importance in the analysis of the model, so much so that we divide the cases into full rank and less than full rank. For this section, we assume that X is of full rank. This means that X T X is invertible, i.e. (X T X ) 1 exists.

7 Example. We want to analyse the selling price of a house (y). We think that the price of a house depends on two variables, its age (x 1 ) and the house area (x 2 ). Our linear model takes the form y = β 0 + β 1 x 1 + β 2 x 2 + ε. We sample 5 random houses and obtain the data Price ( $10k) Age (years) Area ( 100m 2 )

8 The model generates the 5 linear equations 50 = β 0 + 1β 1 + 1β 2 + ε 1 40 = β 0 + 5β 1 + 1β 2 + ε 2 52 = β 0 + 5β 1 + 2β 2 + ε 3 47 = β β 1 + 2β 2 + ε 4 65 = β β 1 + 3β 2 + ε 5

9 The matrix form of the model is y = X β + ε where y = , X = ,

10 β = β 0 β 1 β 2, ε = ε 1 ε 2 ε 3 ε 4 ε 5. Direct calculation will show that X is of full rank. This is an example of multiple regression.

11 Example. Simple linear regression can be cast in the framework of a linear model, where the response variable y depends on only one variable x: y = β 0 + β 1 x + ε. If we have n responses, this gives the linear equations y 1 = β 0 + β 1 x 1 + ε 1 y 2 = β 0 + β 1 x 2 + ε 2. y n = β 0 + β 1 x n + ε n

12 In the matrix formulation, we have y 1 1 x 1 1 x 2 y 2 y =., X =., y n 1 x n β = [ β0 β 1 ε 1 ε n ] ε 2, ε =.. We will show later how the linear model framework can be used to derive the well-known regression formulas for the parameters β 0 and β 1.

13 Parameter estimation using least squares The first thing we want to do with the linear model is to estimate the parameters β 0, β 1,..., β k. We do this using the method of least squares. Firstly, we assume that the error vector ε has mean 0 and variance σ 2 I ; in other words, that the model is unbiased and that the errors are independent of the responses and uncorrelated with each other. We do NOT necessarily assume that the errors are independent of each other.

14 Since the error term is the only random term in the model, this means that and E[y] = X β var y = σ 2 I. In particular, this means that the expected value of each response is a linear function of the parameters (as you would expect in a linear model).

15 How can we estimate the true values of the parameters? Consider the elements of the error vector: ε i = y i (β 0 + β 1 x i1 + β 2 x i β k x ik ) = y i E[y i ]. What happens to this value if we have the correct model (including correct parameter values)?

16 If we have the corerct parameter values and model, the error is likely to be much smaller than what it would be if we had the wrong parameter values. Therefore, it makes sense to estimate the parameters in such a way that the errors are minimised in some sense. However, we do not know the true expected values and as such cannot calculate the errors! Instead, we calculate the residuals.

17 Suppose that we have some estimates of the parameters, b 0, b 1,..., b k. Then we can estimate the expected value of y i by ˆ E[y i ] = b 0 + b 1 x i b k x ik. The ith residual is defined to be the difference between the observed value and the estimated value: e i = y i E[y ˆ i ]. If our estimates are good, the residuals should be very close to the errors.

18 Since we want to minimise the errors but cannot calculate them, we instead choose our estimates to minimise the residuals. But how do we minimise? The natural answer is to minimise the sum of the residuals, but on further inspection this is a bad idea. Residuals can be either positive or negative, and adding positive and negative residuals will cancel each other out and make us think we have a good fit. However, we want to eliminate residuals which are large, in both the positive and negative direction. In fact, under some not too restrictive conditions, it can be shown that the residuals will always sum to 0!

19 What about summing the absolute values of the residuals? This is not incorrect, but it is a bit incovenient mathematically because the absolute value function is not differentiable everywhere. To avoid this, we minimise the sum of the squares of the residuals: n min ei 2 = min e T e. i=1 This results in the least squares estimators of the parameters.

20 How do we calculate the least squares estimators? First we write our observed responses in terms of the (as yet unknown) estimated parameters and the residuals: y i = b 0 + b 1 x i1 + b 2 x i b k x ik + e i

21 Then we define the vectors of estimated parameters and residuals: b 0 b k e 0 b 1 b =., e = e 1.. Then we can write our observed responses as y = X b + e. This looks like the linear model, but it is slightly different it uses the estimated parameters rather than the actual parameters, and therefore we have the residuals instead of the errors. e n

22 Now we want to minimise e T e = (y X b) T (y X b) = y T y y T X b b T X T y + b T X T X b = y T y 2y T X b + b T X T X b = y T y 2(X T y) T b + b T (X T X )b with respect to b (since that is the only thing we can control).

23 For this expression to be at a minimum, we need e T e b = 0. This is where we can use the vector differentiation that we developed earlier! b yt y = 0 since y (our measurements) do not depend on b (our parameter estimates).

24 b ( 2(X T y) T b) = 2X T y. b (bt (X T X )b) = (X T X )b + (X T X ) T b. Therefore to be at a minimum, we need 2X T y + 2(X T X )b = 0.

25 Rearranging gives the normal equations: X T X b = X T y. As we observed before, because X is of full rank, X T X has an inverse. Therefore we can solve for b to find the least squares estimator b = (X T X ) 1 X T y.

26 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 random vector with mean 0 and variance σ 2 I. Then the least squares estimator for β is given by b = (X T X ) 1 X T y.

27 Example. We return to the house price example presented earlier. Our data are the house prices (response) and the house age and area (design): y = , X = Matrix calculations give X T X = , X T y =

28 We can then find the inverse of X T X to be (X T X ) 1 = This gives the least squares estimators as b = (X T X ) 1 X T y = =

29 Therefore our estimated model is y = b 0 + b 1 x 1 + b 2 x 2 + e or y = x x 2 + e.

30 Example. Recall that the simple linear regression model can be written as a linear model with two parameters y = β 0 + β 1 x + ε, which gives y = y 1 y 2., X = 1 x 1 1 x 2... y n 1 x n

31 Then (since X is n 2), X T X = = 1 x 1 [ ] x 2 x 1 x 2... x n.. 1 x n [ n n i=1 x ] i n i=1 x i n i=1 x2 i and X T y = [ n i=1 y ] i n i=1 x. iy i

32 Finding the inverse of the 2 2 matrix X T X gives us (X T X ) 1 = 1 n n i=1 x2 i ( n i=1 x i) 2 Therefore the least squares estimator for β is [ n i=1 x2 i n i=1 x i n i=1 x i n ].

33 b = (X T X ) 1 X T y = 1 = n n i=1 x i 2 ( n i=1 x i 1 n n i=1 x i 2 ( n i=1 x i [ n ) i=1 x i 2 2 n [ n i=1 ) x i 2 2 n i=1 x i i=1 x i n ] [ n i=1 y ] i n i=1 x iy i n i=1 y i n i=1 x i n i=1 x iy i n n i=1 x iy i n i=1 x i n i=1 y i ].

34 Breaking it down into individual coefficients, the estimator for the slope of the regression line is b 1 = n n i=1 x iy i n i=1 x n i i=1 y i n n i=1 x2 i ( n i=1 x i) 2, which may be familiar to you from standard linear regression courses. The estimator for the intercept of the regression line is n n i=1 b 0 = x2 i i=1 y i n i=1 x n i i=1 x iy i n n i=1 x2 i ( n i=1 x i) 2 which may not look familiar. Usually the estimator is given as ȳ b 1 x, but it is quite simple to show that they are the same.

35 How good is the least squares estimator, really? Here is where some random vector theory comes in handy! Theorem In the above model, the least squares estimator b = (X T X ) 1 X T y is an unbiased estimator for β. In other words, E[b] = β. Furthermore, var b = (X T X ) 1 σ 2.

36 Proof. E[b] = E[(X T X ) 1 X T y] = (X T X ) 1 X T E[y] = (X T X ) 1 X T (X β) = I β = β. var b = var (X T X ) 1 X T y = (X T X ) 1 X T σ 2 I ( (X T X ) 1 X T ) T = (X T X ) 1 X T X ((X T X ) T ) 1 σ 2 = (X T X ) 1 σ 2.

37 We now know that the least squares estimator is good but really, how good?

38 We now know that the least squares estimator is good but really, how good? How about the best?

39 We now know that the least squares estimator is good but really, how good? How about the best? But what do we mean by the best? If you think about it, our estimators are really just random variables which are functions of our observations. As shown before, they have expectation and variance. The expectation is already as good as it can be. The ideal situation would now to be for the variance to be 0 because then, we know the actual parameters with certainty. Unfortunately, this is generally impossible, but in lieu of that, we want the variances to be as low as possible.

40 There are many kinds of estimators, but we concentrate on linear estimators. These are estimators which take the form Ly, where L is some matrix of constants. In particular, the least squares estimator is a linear estimator with L = (X T X ) 1 X T. Definition Suppose you have a model with parameters β and some linear estimators b for these parameters. If E[b] = β and the variances of b 0, b 1,..., b k are minimised over all linear estimators, then b is called a best linear unbiased estimator of β (or BLUE for short).

41 Gauss-Markov Theorem Theorem (Gauss-Markov Theorem) Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 random vector with mean 0 and variance σ 2 I. Then the least squares estimator b is the best linear unbiased estimator for β.

42 Proof. Suppose that we have another unbiased linear estimator for β, called b. We can express this in the form b = [(X T X ) 1 X T + B]y where B is a (k + 1) n matrix. We then take expectations of both sides: E[b ] = [(X T X ) 1 X T + B]E[y] = [(X T X ) 1 X T + B]X β = [I + BX ]β.

43 Since b is an unbiased estimator for β, we know that E[b ] = β. Therefore [I + BX ]β = β, which means that BX = 0. Now look at the variance of the estimator: var b = var [(X T X ) 1 X T + B]y = [(X T X ) 1 X T + B]σ 2 I [(X T X ) 1 X T + B] T = σ 2 [(X T X ) 1 X T + B][X (X T X ) 1 + B T ] = σ 2 [(X T X ) 1 X T X (X T X ) 1 + (X T X ) 1 X T B T +BX (X T X ) 1 + BB T ]

44 Since BX = 0, X T B T = (BX ) T = 0. Then var b = σ 2 [(X T X ) 1 + BB T ] = (X T X ) 1 σ 2 + BB T σ 2 = var b + BB T σ 2. Now we want to minimise the variance of b0, b 1,..., b k, with the covariances being relatively unimportant.

45 The variances are given by the diagonal elements of the covariance matrix: var b i = [var b ] ii = var b i + σ 2 n Each term in the sum is non-negative, so the variance of bi can never go below var b i. Moreover, the minimum is obtained if and only if b ij = 0 for all i, j, in which case B = 0 and b = b. j=1 b 2 ij. This means that not only is the least squares estimator the best linear unbiased estimator for β, but it is the only BLUE for β.

46 Example. Consider the house price example. The variance of the least squares estimators is given by (X T X ) 1 σ 2 = σ 2. This means that there is no linear estimator of β 0 which has a smaller variance than 2.31σ 2, and no linear estimator of β 1 which has a smaller variance than 0.03σ 2, etc. This is true even though we don t (yet) know what σ 2 is (although we will get to that!).

47 This is all well and good, but what if we want to estimate something other than the parameters? In particular, we are often interested in estimating some linear function of the parameters, t T β, where t is a (k + 1) 1 vector of constants. How can we estimate these?

48 This is all well and good, but what if we want to estimate something other than the parameters? In particular, we are often interested in estimating some linear function of the parameters, t T β, where t is a (k + 1) 1 vector of constants. How can we estimate these? It turns out that the correct answer is the obvious one: we simply take the identical linear function of the least squares estimator.

49 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 random vector with mean 0 and variance σ 2 I. Let t be a (k + 1) 1 vector of constants. Then the best linear unbiased estimator for t T β is t T b, where b is the least squares estimator for β. The proof of this theorem is very similar to that of the Gauss-Markov theorem.

50 The most common use of the above theorem is to predict the value of the response variable, given certain values of the x s. Example. Consider the house price example we have shown before. The model is y = β 0 + β 1 x 1 + β 2 x 2 + ε, where y is the house price, x 1 is its age, and x 2 is its area. Suppose we are now given a specific house with age x1 and area x 2, and we wish to estimate what price it will fetch.

51 Then we want to estimate the linear function of the parameters E[y] = β 0 + β 1 x 1 + β 2 x 2 = t T β where t = [ 1 x1 x2 the house price is ] T. Therefore an unbiased estimator for t T b = [ 1 x 1 x 2 ] b = b0 + b 1 x 1 + b 2 x 2 where b is the least squares estimator for β.

52 For example, suppose we have a house which is 15 years old and has an area of 250 m 2. From previous examples, we have the least squares estimator b = and therefore the estimated price of the house is ˆ E[y] = = In other words, we expect the house to sell for $57,020.

53 Variance estimation Another quantity we will want to estimate is the variance of the errors, whcih is also the variance of the response variables. Remember that we are assuming that y has the covariance matrix σ 2 I in other words, that each observation has 0 correlation with the others, and identical variance to the others. We will want to estimate σ 2. One reason that we want to do this is so that we can create confidence intervals for the true values of the parameters.

54 How should we estimate σ 2? If you think about it, σ 2 can be written as [ (y X β) σ 2 T ] (y X β) = E n and so a reasonable estimator for the variance is ˆσ 2 = (y X b)t (y X b). n The numerator of this estimator is the sum of the squares of the residuals. Since the residuals reflect variation that is not explained by the model, this seems reasonable.

55 Is the estimator unbiased? We take expectations: E[ ˆσ 2 ] = 1 n E[(y X b)t (y X b)] = 1 n E[(y X (X T X ) 1 X T y) T (y X (X T X ) 1 X T y)] = 1 n E[yT (I X (X T X ) 1 X T )(I X (X T X ) 1 X T )y]. It is a simple linear algebra exercise to show that I X (X T X ) 1 X T is idempotent, which gives E[ ˆσ 2 ] = 1 n E[yT (I X (X T X ) 1 X T )y].

56 Now recall from random vector theory that the expectation of y T Ay is E[y T Ay] = tr(av ) + µ T Aµ where µ is the mean of y and V is the variance of y. But in this case, µ T Aµ = (X β) T (I X (X T X ) 1 X T )X β = β T X T X β β T X T X (X T X ) 1 X T X β = 0

57 and tr(av ) = tr((i n X (X T X ) 1 X T )σ 2 I n ) = σ 2 (tr(i n ) tr(x (X T X ) 1 X T )) = σ 2 (n tr((x T X ) 1 X T X )) = σ 2 (n tr(i k+1 )) = σ 2 (n (k + 1))

58 Therefore E[ ˆσ 2 n (k + 1) ] = σ 2. n This is not quite what we want! To ensure unbaisedness, instead we use the estimator s 2 = n n (k + 1) [ (y X b) T ] (y X b) n = (y X b)t (y X b). n (k + 1)

59 If we write the sum of squares of the residuals as SS Res = (y X b) T (y X b) and denote the number of parameters by p = k + 1, then this gives us the standard form of the estimator: s 2 = SS Res n p. We call this quantity the sample variance.

60 Example. Back to the house price example. We have estimated our parameters to be and so our residuals are y X b = = b =

61 Therefore SS Res = ( 2.83) 2 + ( 1.55) 2 + ( 5.6) = and the sample variance is s 2 = SS Res 5 3 =

62 Example. A study is designed to predict the extent of the cracking of latex paint in field conditions, based on the extent of the cracking in accelerated tests in the laboratory. A simple linear regression model with two parameters is assumed and we generate the data Test cracking (x) Actual cracking (y)

63 The data matrices are By direct calculation, X T X = y = [ , X = ] [, X T 23.5 y = ]

64 This gives the least squares estimator of the parameters b = [ Then the estimated expected cracking rate for each entry is X b = ].

65 The residuals are e = y X b = Therefore we can estimate the common variance of the response variables to be s 2 = et e 6 n p = i=1 e2 i 6 2 = 0.27.

66 Regression through the origin We note that so far we have always considered the linear model to include a parameter β 0, which is associated with a column of 1 s in the design matrix X. In multiple regression terms, this parameter is called the intercept. Sometimes it is reasonable to assume (from prior knowledge of the data) that no intercept is needed; then we can cut it out. However, surprisingly little changes. The model becomes y = β 1 x 1 + β 2 x β k x k + ε, but to analyse it, the design matrix simply loses the first column, the parameter vector loses the first row, and everything proceeds as normal.

67 Thus the least squares estimator is still and the variance estimator is still b = (X T X ) 1 X T y s 2 = SS Res n p. The only adjustment that we need to make is that the number of parameters, p, is now k instead of k + 1.

68 Maximum likelihood estimation You may have come across maximum likelihood estimation before: this is where we estimate parameters so that the probability of the observed values occuring is maximised. We can in fact apply this methodology to estimate the parameters of the linear model. Firstly, we need to assume a distribution for the errors. We assume that the errors are jointly normally distributed, and as before, have mean 0 and variance σ 2 I. In particular, this means that the errors are independent, not just uncorrelated.

69 Now, given observed values of the response variables y, the observed errors are y X β (although since we have not yet estimated β, these are unknown). Since the errors are independent, the likelihood function is given by L(ε; σ 2 ) = = = n i=1 1 σ 2π e ε2 i /2σ2 1 P ) n (2πσ 2 ) n/2 e 1/(2σ2 i=1 ε2 i 1 )(y X β) T (y X β) (2πσ 2 ) n/2 e 1/(2σ2

70 We want to maximise this quantity with respect to β to generate maximum likelihood estimators for β. To do this, we differentiate with respect to β and set the result to be 0. Actually, this turns out to be very hard, so we do the equivalent of differentiating ln L(ε; σ 2 ) with respect to β and setting the result to be 0. ln L(ε; σ 2 ) = n 2 ln(2πσ2 ) 1 2σ 2 (y X β)t (y X β).

71 Note that the first term is constant, so using the derivative rules we worked out earlier, β ln L(ε; σ2 ) = 1 ( y T 2σ 2 y 2(X T y)β + β T (X T X )β ) β (X T X )β = X T y. = 2(X T y) + 2(X T X )β = 0 This is exactly the normal equations!

72 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 normally distributed random vector with mean 0 and variance σ 2 I. Then the maximum likelihood estimator for β is also the least squares estimator, b = (X T X ) 1 X T y.

73 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 normally distributed random vector with mean 0 and variance σ 2 I. Then the maximum likelihood estimator for σ 2 is given by σ 2 = SS Res. n Note that this is not the least squares estimator! So maximum likelihood and least squares do not necessarily agree it just happens that they do for the parameter estimators (which means they must be good!)

74 Sufficiency The Gauss-Markov theorem states that the least squares estimators are the best linear unbiased estimators for the parameters β. The above theorem also shows that if we assume that the errors are normally distributed, the least squares estimator is also the maximum likelihood estimator. In fact, given the assumption of normality, we can say something stronger: the least squares estimators use all relevant information about the parameters that there is contained in the observed response variables. For a more technical statement, we define the notion of sufficiency.

75 Definition (Fisher-Neyman Factorization Theorem) Let X denote a random variable whose density depends on a single parameter θ, and let X 1, X 2,..., X n be a random sample drawn from this distribution, with joint density f (x 1, x 2,..., x n ; θ). Then the statistic Y = u(x 1, X 2,..., X n ) is sufficient for θ if and only if f can be expressed as f (x 1, x 2,..., x n ; θ) = g(y; θ)h(x 1, x 2,..., x n ). In other words, we must be able to factorise the density into one part which depends only on Y and θ (and not on the x s directly), and another part which depends only on the x s (and not at all on θ).

76 Example. Suppose we have a random sample from a Poisson distribution with parameter λ. The density for a single one of these variables (X 1 say) is f (x 1 ; λ) = e λ λ x 1 x 1! Because the samples are independent, the joint density is the product of all the individual densities.

77 f (x 1, x 2,..., x n ; λ) = = n f (x i ; λ) i=1 n e λ λ x i i=1 x i! = e nλ λ P x i ( n i=1 x i!) 1. It can now be seen that the statistic n i=1 x i is sufficient for λ.

78 Theorem Let y = X β + ε where X is a n (k + 1) matrix of full rank, β is a (k + 1) 1 vector of unknown parameters, and ε is a n 1 normally distributed random vector with mean 0 and variance σ 2 I. Then the estimators of the parameters and the variance b = (X T X ) 1 X T y and s 2 = SS Res n p are jointly sufficient for β and σ 2.

79 Interval estimation Having created very good point estimates for the parameters, we would like to now find interval estimates to get a truer picture of where the parameters might lie. To do this, we remind ourselves of the Student s-t distribution in a way which may not be familiar. Definition Let Z be a standard normal random variable and let X 2 γ be an independent χ 2 random variable with γ degrees of freedom. Then Z Xγ 2 /γ has a t distribution with γ degrees of freedom.

80 Now, to find an interval estimate for the parameters, we first need to know what the distribution of our least squares estimator is. For us to know this, we have to make some assumption about the distribution of the response variables (or errors) remember that we do not need any such assumption to derive the least squares estimators. We go with the old standby: the errors are jointly normally distributed. This means that the response variables are normally distributed too. But b = (X T X ) 1 X T y, a linear combination of the y s!

81 Theorem Let y = X β + ε, where X is of full rank and ε is a normally distributed random vector with mean 0 and variance σ 2 I. Then b is normally distributed with mean β and variance (X T X ) 1 σ 2. We can use this theorem to give us confidence intervals for the parameters, except there is one small problem: we do not know the real variance σ 2. We can use s 2 instead, but that is itself a random variable. So we would like to know its distribution.

82 Theorem Let y = X β + ε, where X is a n p matrix of full rank and ε is a normally distributed random vector with mean 0 and variance σ 2 I. Then (n p)s 2 σ 2 = SS Res σ 2 has a χ 2 distribution with n p degrees of freedom. Proof. We have shown earlier (but it is easy to work out) that the residual sum of squares can be expressed as the quadratic form SS Res = (y X b) T (y X b) = y T [I X (X T X ) 1 X T ]y.

83 It can also be shown that I X (X T X ) 1 X T is symmetric and idempotent and that it has rank n p. By assumption, y is a normal random vector with mean X β and variance σ 2 I. By an earlier corollary on the distribution of a quadratic form, 1 σ 2 y T [I X (X T X ) 1 X T ]y has a noncentral χ 2 distribution, with n p degrees of freedom and noncentrality parameter λ = 1 2σ 2 µt [I X (X T X ) 1 X T ]µ.

84 But µ = X β, so λ = 1 2σ 2 (X β)t [I X (X T X ) 1 X T ]X β = 1 2σ 2 [βt X T X β β T X T X (X T X ) 1 X T X β] = 0. This means that SS Res σ 2 degrees of freedom. has a (central) χ 2 distribution with n p

85 We now have a normal random vector, b, and a χ 2 variable, SS Res /σ 2. We can potentially combine these two together to make a single t-distributed variable, but we need to know that they are independent. Theorem Let y = X β + ε, where X is a n p matrix of full rank and ε is a normally distributed random vector with mean 0 and variance σ 2 I. Then b and SS Res /σ 2 are independent.

86 Proof. We use the result for the independence of a vector By and a quadratic form y T Ay: we need BVA = 0 where V is the variance of y. We have b = (X T X ) 1 X T y, SS Res σ 2 = y T [I X (X T X ) 1 X T ] σ 2 y and so BVA = (X T X ) 1 X T σ 2 I [I X (X T X ) 1 X T ] σ 2 = (X T X ) 1 X T (X T X ) 1 X T X (X T X ) 1 X T = 0.

87 We are finally ready to use these results to create confidence intervals for the parameters. The first thing that we will try to do is to find a confidence interval for a single parameter, β i. Consider the covariance matrix of b, c 00 c c 0k (X T X ) 1 σ 2 c 10 c c 1k =..... c k0 c k1... c kk σ2.

88 The least squares estimator of β i is b i. The variance of b i is the ith diagonal element of the covariance matrix, denoted c ii σ 2. Since b i has a normal distribution, this means that b i β i σ c ii has a standard normal distribution. Of course, we do not know what σ is.

89 But from all the above theory, ( ) bi β i / σ SS Res /σ 2 c ii n p has a t distribution with n p degrees of freedom. Simplifying gives ( ) ( ) bi β i / s 2 σ c ii σ 2 = b i β i s. c ii

90 It is now easy to derive the confidence interval with confidence 100(1 α)%: P[ t α/2 (b i β i )/(s c ii ) t α/2 ] = 1 α P[ t α/2 s c ii b i β i t α/2 s c ii ] = 1 α P[b i t α/2 s c ii β i b i + t α/2 s c ii ] = 1 α. Therefore the confidence interval (using a t distribution with n p d.f.) is b i ± t α/2 s c ii.

91 Example. We want to model the amount of a chemical that dissolves in a fixed volume of water. It is known that this depends in part on the water temperature. An experiment is run 6 times and the following data measured: Temperature (x) Amount dissolved (y)

92 We use a simple linear regression model with two parameters, y = β 0 + β 1 x + ε. Using the least squares formulae we find b 0 = 1.44, b 1 = 0.31, s = First we find a confidence interval on β 0, the intercept. By direct calculation, the top left entry in (X T X ) 1 is found to be c 00 = 0.52 and the degrees of freedom is n p = 6 2 = 4.

93 A 95% confidence interval has α/2 = and t = 2.78, so the confidence interval is 1.43 ± = [ 0.30, 3.17]. In other words, we are 95% confident that the true amount of chemical dissolved at 0 temperature lies between and Notably, we cannot say with 95% confidence that it is untrue that no chemical dissolves at 0 temperature.

94 Next we find a confidence interval on β 1, the slope of the regression. The bottom right entry in (X T X ) 1 is calculated to be c 11 = and the degrees of freedom and t-value are the same, which gives the confidence interval 0.31 ± = [0.25, 0.36]. In other words, we are 95% confident that for each rise in temperature of 1 degree, the amount of chemical dissolved goes up by an amount between 0.25 and In particular, we are (at least) 95% sure that there is a positive relationship between temperature and chemical dissolved.

95 It is good that we can find confidence intervals for the parameters, but sometimes we want to estimate things other than just the parameters. In particular, we often want to predict the value of the response variable for a given set of inputs. This is an example of the more general case of linear functions of the parameters.

96 Remember that if we want to estimate the function t T β, the best linear unbiased estimator is t T b where b is the least squares estimator of the parameters. We look at the distribution of this estimator. Since b is normally distributed, any linear combination of b s is normally distributed. We have E[t T b] = t T β since b is an unbiased estimator for β.

97 Variance results give us Therefore var t T b = t T (X T X ) 1 σ 2 t = t T (X T X ) 1 tσ 2. t T b t T β t T (X T X ) 1 tσ 2 has a standard normal distribution. But again, we do not know what σ is!

98 The solution should not be difficult to see: since SS Res /σ 2 is independent of b, it is independent of t T b. We divide a standard normal variable by the square root of a χ 2 variable to get a t-distributed variable: (t T b t T β)/( t T (X T X ) 1 tσ 2 ) SSRes /σ 2 (n p) = t T b t T β s t T (X T X ) 1 t has a t distribution with n p degrees of freedom. Using similar steps to before, this gives the 100(1 α)% confidence interval t T b ± t α/2 s t T (X T X ) 1 t.

99 In particular, if we want to find a confidence interval for the expected response to a particular set of x variables, we can observe that given the x values x1, x 2,..., x k, the expected response is E[y] = β 0 + β 1 x β k xk = (x ) T β where x = [ 1 x1 x2... xk ] T. This is a linear function of β, and therefore the 100(1 α)% confidence interval for it is (x ) T b ± t α/2 s (x ) T (X T X ) 1 x.

100 Example. In the house price example, we previously estimate the average selling price of a 15-year-old house with an area of 250 m 2 to be We also found the inverse of X T X to be (X T X ) 1 = and it is a simple calculation to evaluate (y X b) s = T (y X b) = n p

101 Then for a 15-year-old house with an area of 2.5, we get x = [ ] T, and (x ) T (X T X ) 1 x = [ ] = which, using a t distribution with 5 3 = 2 degrees of freedom, gives a 95% confidence interval of ± = [37.64, 76.4]

102 Prediction intervals We must differentiate between the confidence interval given in the above example and a prediction interval. Given a set of inputs, a 95% confidence interval (say) gives an interval that we are 95% sure that the expected or mean response for those inputs lies in. In contrast, given a set of inputs, a 95% prediction interval produces an interval in which we are 95% sure that any given response with those inputs lies in, not the mean. Because we are predicting the value of one response, this interval is wider than the corresponding confidence interval.

103 To find such an interval, suppose we have the inputs x = [ 1 x1 x2... xk ] T. These inputs will generate a response y = (x ) T β + ε where var ε = σ 2 by assumption. This will be (point) estimated by (x ) T b with an error of y (x ) T b = (x ) T β + ε (x ) T b.

104 Now the first term is not random. ε is an error associated with the future observation y, and b depends only on the current observations y. So we can say that they are independent, which gives Therefore var (y (x ) T b) = var ε + var (x ) T b. var (y (x ) T b) = σ 2 +(x ) T (X T X ) 1 σ 2 x = [1+(x ) T (X T X ) 1 x ]σ 2 and since the estimator is unbiased, the expectation is 0.

105 Following exactly the previous arguments, we derive that y (x ) T b s 1 + (x ) T (X T X ) 1 x has a t distribution with n p degrees of freedom, and a prediction interval is (x ) T b ± t α/2 s 1 + (x ) T (X T X ) 1 x. The only difference with confidence intervals is the presence of the 1, which makes the interval wider as expected.

106 Example. In the previous example, we estimated the average selling price of a 15-year-old house with area 250 m 2 to be between and 76.4 with 95% confidence. If we are given a single 15-year-old house with area 250 m 2, then with 95% confidence we can say that it will sell in the range of (x ) T b ± t α/2 s 1 + (x ) T (X T X ) 1 x = ± = [21.24, 92.8]. This is wider than the confidence interval for the mean.

107 Joint confidence intervals Sometimes we would like to have confidence intervals for more than one parameter, or linear combination of parameters. It is possible to simply find confidence intervals for each parameter as before, but this is misleading. If we find more than one 95% confidence interval, we do not have 95% confidence that all of them will be satisfied at once. If we find many such confidence intervals, the laws of probability imply that it is quite likely that at least one will be wrong! We need to find joint confidence intervals on a number of parameters at the same time.

108 However, having joint confidence intervals gives a rectangle that we think the parameters will be located in. But it is more accurate to simply find a confidence region, which may not be a rectangle, where we think the parameters are located in. To do this we recall the F distribution, defining it in terms of the χ 2 distribution.

109 Definition Let X 2 γ 1 and X 2 γ 2 be independent χ 2 random variables with γ 1 and γ 2 degrees of freedom. Then X 2 γ 1 /γ 1 X 2 γ 2 /γ 2 has an F distribution with γ 1 and γ 2 degrees of freedom. Remember that the first degree of freedom is associated with the numerator and the second degree of freedom is associated with the denominator.

110 Now we derive a confidence region for β. The least squares estimator is b = (X T X ) 1 X T y. It can be shown using our random vector theory that the quadratic form (b β) T X T X (b β) σ 2 has a χ 2 distribution with p degrees of freedom (where p is the number of parameters in the model). We also know that (n p)s 2 has a χ 2 distribution with n p degrees of freedom. σ 2

111 Since b and s 2 are independent, the two χ 2 variables above are independent, which means that ( (b β) T X T ) ( ) X (b β) / (n p)s 2 pσ 2 = (b β)t X T X (b β) ps 2 (n p)σ 2 has an F distribution with p and n p degrees of freedom. Because this statistic is based on b β, which we want to be small, we use the right-hand side of the F -distribution to create a confidence region.

112 If f α is the critical value of the F distribution with p and n p d.f. and probability α, then P[(b β) T X T X (b β)/ps 2 f α ] = 1 α which gives the confidence region (b β) T X T X (b β) ps 2 f α. The region can be calculated from the estimated values b, X T X, and s 2.

113 Example. Consider the example used in practice classes 4 and 5: modelling income against years of formal education. The data is Years of education Income

114 You would have found that and b = X T X = [ [ ] ], s 2 =

115 So a joint 95% confidence interval is given by [ β β 1 ] [ ] [ β β 1 ] which simplifies to 6β β β 0 β β β This is an ellipse (which is true in general).

116

117 Generalized least squares Throughout this section, we have made the assumption that the errors ε have mean 0 and variance σ 2 I, and sometimes that they are normally distributed. However, these assumptions are not always correct. If the errors do not have 0 mean, then the model is probably wrong! It is not always satisfying to have normal errors, but (a) they occur quite often in practice and (b) there isn t all that much (easily analysed) alternative. What if the variance of ε is not σ 2 I?

118 Suppose that the variance of ε is a positive definite matrix V, but ε is still (jointly) normally distributed. It can be shown that the maximum likelihood estimator minimizes the function e T V 1 e = (y X b) T V 1 (y X b) and satisfies the equivalent of the normal equations, X T V 1 X b = X T V 1 y.

119 This gives the generalized least squares estimators b = (X T V 1 X ) 1 X T V 1 y. If V = σ 2 I, this reduces to ordinary least squares. It can be shown that under these circumstances, the Gauss-Markov theorem still holds, i.e. the generalized least squares estimator is BLUE.

120 Weighted least squares In this situation, the errors are uncorrelated but do not have a common variance: σ σ var ε = σn 2 The same formulae can be applied as above; here we are minimising (y X b) T V 1 (y X b) = n i=1 ( ) 2 ei so we weight each residual by the inverse of the corresponding standard deviation. So a point with high variance influences b less than a point with low variance. σ i

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

CURVE FITTING LEAST SQUARES APPROXIMATION

CURVE FITTING LEAST SQUARES APPROXIMATION CURVE FITTING LEAST SQUARES APPROXIMATION Data analysis and curve fitting: Imagine that we are studying a physical system involving two quantities: x and y Also suppose that we expect a linear relationship

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

More information

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1. MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

More information

Quadratic forms Cochran s theorem, degrees of freedom, and all that

Quadratic forms Cochran s theorem, degrees of freedom, and all that Quadratic forms Cochran s theorem, degrees of freedom, and all that Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 1, Slide 1 Why We Care Cochran s theorem tells us

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

More information

3.1 Least squares in matrix form

3.1 Least squares in matrix form 118 3 Multiple Regression 3.1 Least squares in matrix form E Uses Appendix A.2 A.4, A.6, A.7. 3.1.1 Introduction More than one explanatory variable In the foregoing chapter we considered the simple regression

More information

1 Introduction to Matrices

1 Introduction to Matrices 1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns

More information

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes for Marsden and Tromba Vector Calculus Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

More information

4.5 Linear Dependence and Linear Independence

4.5 Linear Dependence and Linear Independence 4.5 Linear Dependence and Linear Independence 267 32. {v 1, v 2 }, where v 1, v 2 are collinear vectors in R 3. 33. Prove that if S and S are subsets of a vector space V such that S is a subset of S, then

More information

Factorization Theorems

Factorization Theorems Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

More information

Solution to Homework 2

Solution to Homework 2 Solution to Homework 2 Olena Bormashenko September 23, 2011 Section 1.4: 1(a)(b)(i)(k), 4, 5, 14; Section 1.5: 1(a)(b)(c)(d)(e)(n), 2(a)(c), 13, 16, 17, 18, 27 Section 1.4 1. Compute the following, if

More information

1 Sufficient statistics

1 Sufficient statistics 1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =

More information

Factor analysis. Angela Montanari

Factor analysis. Angela Montanari Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

More information

Master s Theory Exam Spring 2006

Master s Theory Exam Spring 2006 Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

More information

is identically equal to x 2 +3x +2

is identically equal to x 2 +3x +2 Partial fractions 3.6 Introduction It is often helpful to break down a complicated algebraic fraction into a sum of simpler fractions. 4x+7 For example it can be shown that has the same value as 1 + 3

More information

Similarity and Diagonalization. Similar Matrices

Similarity and Diagonalization. Similar Matrices MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

1 Another method of estimation: least squares

1 Another method of estimation: least squares 1 Another method of estimation: least squares erm: -estim.tex, Dec8, 009: 6 p.m. (draft - typos/writos likely exist) Corrections, comments, suggestions welcome. 1.1 Least squares in general Assume Y i

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Multivariate Analysis of Variance (MANOVA): I. Theory

Multivariate Analysis of Variance (MANOVA): I. Theory Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the

More information

1.5 Oneway Analysis of Variance

1.5 Oneway Analysis of Variance Statistics: Rosie Cornish. 200. 1.5 Oneway Analysis of Variance 1 Introduction Oneway analysis of variance (ANOVA) is used to compare several means. This method is often used in scientific or medical experiments

More information

The Method of Least Squares

The Method of Least Squares Hervé Abdi 1 1 Introduction The least square methods (LSM) is probably the most popular technique in statistics. This is due to several factors. First, most common estimators can be casted within this

More information

Notes on Determinant

Notes on Determinant ENGG2012B Advanced Engineering Mathematics Notes on Determinant Lecturer: Kenneth Shum Lecture 9-18/02/2013 The determinant of a system of linear equations determines whether the solution is unique, without

More information

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12 This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Matrix Differentiation

Matrix Differentiation 1 Introduction Matrix Differentiation ( and some other stuff ) Randal J. Barnes Department of Civil Engineering, University of Minnesota Minneapolis, Minnesota, USA Throughout this presentation I have

More information

LS.6 Solution Matrices

LS.6 Solution Matrices LS.6 Solution Matrices In the literature, solutions to linear systems often are expressed using square matrices rather than vectors. You need to get used to the terminology. As before, we state the definitions

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

3.6. Partial Fractions. Introduction. Prerequisites. Learning Outcomes

3.6. Partial Fractions. Introduction. Prerequisites. Learning Outcomes Partial Fractions 3.6 Introduction It is often helpful to break down a complicated algebraic fraction into a sum of simpler fractions. For 4x + 7 example it can be shown that x 2 + 3x + 2 has the same

More information

7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization 7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

More information

Math 202-0 Quizzes Winter 2009

Math 202-0 Quizzes Winter 2009 Quiz : Basic Probability Ten Scrabble tiles are placed in a bag Four of the tiles have the letter printed on them, and there are two tiles each with the letters B, C and D on them (a) Suppose one tile

More information

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL STATIsTICs 4 IV. RANDOm VECTORs 1. JOINTLY DIsTRIBUTED RANDOm VARIABLEs If are two rom variables defined on the same sample space we define the joint

More information

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position Chapter 27: Taxation 27.1: Introduction We consider the effect of taxation on some good on the market for that good. We ask the questions: who pays the tax? what effect does it have on the equilibrium

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Inner Product Spaces

Inner Product Spaces Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Solving Linear Systems, Continued and The Inverse of a Matrix

Solving Linear Systems, Continued and The Inverse of a Matrix , Continued and The of a Matrix Calculus III Summer 2013, Session II Monday, July 15, 2013 Agenda 1. The rank of a matrix 2. The inverse of a square matrix Gaussian Gaussian solves a linear system by reducing

More information

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013

Notes on Orthogonal and Symmetric Matrices MENU, Winter 2013 Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,

More information

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

Chapter 2 Portfolio Management and the Capital Asset Pricing Model

Chapter 2 Portfolio Management and the Capital Asset Pricing Model Chapter 2 Portfolio Management and the Capital Asset Pricing Model In this chapter, we explore the issue of risk management in a portfolio of assets. The main issue is how to balance a portfolio, that

More information

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements

More information

6.3 Conditional Probability and Independence

6.3 Conditional Probability and Independence 222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Integrals of Rational Functions

Integrals of Rational Functions Integrals of Rational Functions Scott R. Fulton Overview A rational function has the form where p and q are polynomials. For example, r(x) = p(x) q(x) f(x) = x2 3 x 4 + 3, g(t) = t6 + 4t 2 3, 7t 5 + 3t

More information

Week 4: Standard Error and Confidence Intervals

Week 4: Standard Error and Confidence Intervals Health Sciences M.Sc. Programme Applied Biostatistics Week 4: Standard Error and Confidence Intervals Sampling Most research data come from subjects we think of as samples drawn from a larger population.

More information

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009. Notes on Algebra U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, 2009 Notes on Algebra These notes contain as little theory as possible, and most results are stated without proof. Any introductory

More information

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015 ECON 459 Game Theory Lecture Notes Auctions Luca Anderlini Spring 2015 These notes have been used before. If you can still spot any errors or have any suggestions for improvement, please let me know. 1

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

More information

WHERE DOES THE 10% CONDITION COME FROM?

WHERE DOES THE 10% CONDITION COME FROM? 1 WHERE DOES THE 10% CONDITION COME FROM? The text has mentioned The 10% Condition (at least) twice so far: p. 407 Bernoulli trials must be independent. If that assumption is violated, it is still okay

More information

Chapter 6: Multivariate Cointegration Analysis

Chapter 6: Multivariate Cointegration Analysis Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration

More information

Second Order Linear Nonhomogeneous Differential Equations; Method of Undetermined Coefficients. y + p(t) y + q(t) y = g(t), g(t) 0.

Second Order Linear Nonhomogeneous Differential Equations; Method of Undetermined Coefficients. y + p(t) y + q(t) y = g(t), g(t) 0. Second Order Linear Nonhomogeneous Differential Equations; Method of Undetermined Coefficients We will now turn our attention to nonhomogeneous second order linear equations, equations with the standard

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

University of Lille I PC first year list of exercises n 7. Review

University of Lille I PC first year list of exercises n 7. Review University of Lille I PC first year list of exercises n 7 Review Exercise Solve the following systems in 4 different ways (by substitution, by the Gauss method, by inverting the matrix of coefficients

More information

NOTES ON LINEAR TRANSFORMATIONS

NOTES ON LINEAR TRANSFORMATIONS NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all

More information

Math 4310 Handout - Quotient Vector Spaces

Math 4310 Handout - Quotient Vector Spaces Math 4310 Handout - Quotient Vector Spaces Dan Collins The textbook defines a subspace of a vector space in Chapter 4, but it avoids ever discussing the notion of a quotient space. This is understandable

More information

9.2 Summation Notation

9.2 Summation Notation 9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

CAPM, Arbitrage, and Linear Factor Models

CAPM, Arbitrage, and Linear Factor Models CAPM, Arbitrage, and Linear Factor Models CAPM, Arbitrage, Linear Factor Models 1/ 41 Introduction We now assume all investors actually choose mean-variance e cient portfolios. By equating these investors

More information

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation

More information

Notes on Applied Linear Regression

Notes on Applied Linear Regression Notes on Applied Linear Regression Jamie DeCoster Department of Social Psychology Free University Amsterdam Van der Boechorststraat 1 1081 BT Amsterdam The Netherlands phone: +31 (0)20 444-8935 email:

More information

POLYNOMIAL FUNCTIONS

POLYNOMIAL FUNCTIONS POLYNOMIAL FUNCTIONS Polynomial Division.. 314 The Rational Zero Test.....317 Descarte s Rule of Signs... 319 The Remainder Theorem.....31 Finding all Zeros of a Polynomial Function.......33 Writing a

More information

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce

More information

Understanding and Applying Kalman Filtering

Understanding and Applying Kalman Filtering Understanding and Applying Kalman Filtering Lindsay Kleeman Department of Electrical and Computer Systems Engineering Monash University, Clayton 1 Introduction Objectives: 1. Provide a basic understanding

More information

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they

More information

5 Homogeneous systems

5 Homogeneous systems 5 Homogeneous systems Definition: A homogeneous (ho-mo-jeen -i-us) system of linear algebraic equations is one in which all the numbers on the right hand side are equal to : a x +... + a n x n =.. a m

More information

1 Short Introduction to Time Series

1 Short Introduction to Time Series ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

More information

Algebra 1 Course Title

Algebra 1 Course Title Algebra 1 Course Title Course- wide 1. What patterns and methods are being used? Course- wide 1. Students will be adept at solving and graphing linear and quadratic equations 2. Students will be adept

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

3.2. Solving quadratic equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

3.2. Solving quadratic equations. Introduction. Prerequisites. Learning Outcomes. Learning Style Solving quadratic equations 3.2 Introduction A quadratic equation is one which can be written in the form ax 2 + bx + c = 0 where a, b and c are numbers and x is the unknown whose value(s) we wish to find.

More information

Zeros of a Polynomial Function

Zeros of a Polynomial Function Zeros of a Polynomial Function An important consequence of the Factor Theorem is that finding the zeros of a polynomial is really the same thing as factoring it into linear factors. In this section we

More information

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89 by Joseph Collison Copyright 2000 by Joseph Collison All rights reserved Reproduction or translation of any part of this work beyond that permitted by Sections

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Solving Quadratic Equations

Solving Quadratic Equations 9.3 Solving Quadratic Equations by Using the Quadratic Formula 9.3 OBJECTIVES 1. Solve a quadratic equation by using the quadratic formula 2. Determine the nature of the solutions of a quadratic equation

More information

5.4 Solving Percent Problems Using the Percent Equation

5.4 Solving Percent Problems Using the Percent Equation 5. Solving Percent Problems Using the Percent Equation In this section we will develop and use a more algebraic equation approach to solving percent equations. Recall the percent proportion from the last

More information

α = u v. In other words, Orthogonal Projection

α = u v. In other words, Orthogonal Projection Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v

More information

Polynomial Invariants

Polynomial Invariants Polynomial Invariants Dylan Wilson October 9, 2014 (1) Today we will be interested in the following Question 1.1. What are all the possible polynomials in two variables f(x, y) such that f(x, y) = f(y,

More information

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014. University of Ljubljana Doctoral Programme in Statistics ethodology of Statistical Research Written examination February 14 th, 2014 Name and surname: ID number: Instructions Read carefully the wording

More information

DERIVATIVES AS MATRICES; CHAIN RULE

DERIVATIVES AS MATRICES; CHAIN RULE DERIVATIVES AS MATRICES; CHAIN RULE 1. Derivatives of Real-valued Functions Let s first consider functions f : R 2 R. Recall that if the partial derivatives of f exist at the point (x 0, y 0 ), then we

More information

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2014 Timo Koski () Mathematisk statistik 24.09.2014 1 / 75 Learning outcomes Random vectors, mean vector, covariance

More information

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.

More information

MATHEMATICS FOR ENGINEERING BASIC ALGEBRA

MATHEMATICS FOR ENGINEERING BASIC ALGEBRA MATHEMATICS FOR ENGINEERING BASIC ALGEBRA TUTORIAL 3 EQUATIONS This is the one of a series of basic tutorials in mathematics aimed at beginners or anyone wanting to refresh themselves on fundamentals.

More information

Matrices 2. Solving Square Systems of Linear Equations; Inverse Matrices

Matrices 2. Solving Square Systems of Linear Equations; Inverse Matrices Matrices 2. Solving Square Systems of Linear Equations; Inverse Matrices Solving square systems of linear equations; inverse matrices. Linear algebra is essentially about solving systems of linear equations,

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Solving Systems of Linear Equations

Solving Systems of Linear Equations LECTURE 5 Solving Systems of Linear Equations Recall that we introduced the notion of matrices as a way of standardizing the expression of systems of linear equations In today s lecture I shall show how

More information

Methods for Finding Bases

Methods for Finding Bases Methods for Finding Bases Bases for the subspaces of a matrix Row-reduction methods can be used to find bases. Let us now look at an example illustrating how to obtain bases for the row space, null space,

More information