# 1. Introduction to multivariate data

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 . Introduction to multivariate data. Books Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall Krzanowski, W.J. Principles of multivariate analysis. Oxford.000 Johnson, R.A.and D.W. Wichern Applied multivariate statistical analysis. Prentice Hall.. Applications The need often arises in science, medicine and social science (business, management) to analyze data on p variables (Note that p ) data are bivariate). Suppose we have a simple random sample of size n. The sample consists of n vectors of measurements on p variates i.e. n p vectors (by convention column vectors) x ; :::x n which are inserted as rows x T ; :::xt n into a (n p) data matrix X: When p we can plot the rows in -dimensional space, but in higher dimensions, p > ; other techniques are needed. Example Classi cation of plants (taxonomy) Variables: (p 3) leaf size (x ), colour of ower (x ), height of plant (x 3 ) Sample items: n 4 plants from a single species Aims of analysis: i) understand within species variability ii) clasify a new plant species The data matrix may appear as follows Variables x x x 3 6. Plants 8. 8 (Items) Example Credit scoring Variables: personal data held by bank Items: sample of good/bad customers Aims of analysis: i) predict potential defaulters (CRM) ii) risk assessment for new applicant

2 Example 3 Image processing for e.g. quality control Variables: "features" extracted from an image Items: sampled from a production line Aims of analysis: i) quantify "normal" variability ii) reject faulty (o speci cation) batches.3 Sample mean and covariance matrix We shall adopt the following notation: x (p ) X (n p) x (p ) S (p p) R (p p) a random vector of observations on p variables a data matrix whose rows contain an independent random sample x T ; :::; xt n of observations on x sample mean vector x n X n i x i sample covariance matrix containing the sample covariances de ned as s jk n X n i (x ij x j ) (x ik x k ) sample correlation matrix containing the sample correlations de ned as r jk s jk p s jk, say sjj s kk s j s k Notes. x j is de ned as the j th component of x (mean of variable j). the covariance matrix S is square, symmetric ( S S T ), and holds the sample variances s jj s j n X n i (x ij x j ) along its main diagonal 3. the diagonal elements of R are r jj and r jk for each j; k.4 Matrix-vector representations Given a (n p) data matrix X; de ne the n vector of one s (; ; :::; ) T

3 The row sums of X are obtained by pre-multiplying X by T np np T X x i ; ::: ; x ip Hence i (nx ; :::; nx p ) nx T i x n XT (.) The centred data matrix X 0 is derived from X by subtracting the variable mean from each element of X. i.e. x 0 ij x ij x j : or, equivalently, by subtracting a constant vector x T from each row of X. where H matrix as n X 0 X x T X n T X I n n T X HX (.) I n n T is known as the centring matrix. We now de ne the sample covariance the centred sum of squares and products (SSP) matrix where x 0 i x i S n X 0T X 0 (.3a) nx x 0 ix 0 T i (.3b) n i x denotes the i th mean-corrected data point For any real p vector y we then have y T Sy n yt X 0T X 0 y n zt z where z X 0 y n kzk 0 Hence from the de nition of a p.s.d. matrix, we have Proposition The sample covariance matrix S is positive semi-de nite (p.s.d.) Example Two measurements x ; x made at the same position on each of 3 cans of food, resulted in the following X matrix: X Find the sample mean vector x and covariance matris S

4 Solution X 4 x n X X x i 3 i 3 0 S 3 X 0T X 0 5 [x ; x ; x 3 ] T :67 0:67 0:67 :67 Note also that S is built up from individual data points: S and R 0:89 0:89.5 Measures of multivariate scatter It is useful to have a single number as a measure of spread in the data. Based on S we de ne two scalar quantities The total variation is tr (S) trace (S) The generalized variance is In the above example tr (S) :33 jsj 4 3 : px s ii sum of diagonal elements j sum of eigenvalues of S jsj product of eigenvalues of S (.5) 4

5 .6 Random vectors We will in this course generally regard the data as an independent random sample from some continuous population distribution with a probability density function f (x) f (x ; :::; x p ) (.6) Here x (x ; :::; x p ) is regarded as a vector of p random variables. Independence here refers to the rows of the data matrix. If two of the variables (columns) are for example height and weight of individuals (rows), then knowing one individual s weight says nothing about any other individual. However the height and weight for any individual are correlated. For any region D in p space of the variables Z Pr (x D) D f (x) dx Mean vector For any j the population mean of x j is given by the p fold integral Z E (x j ) j x j f (x) dx where the region of integration is R p. In vector form 0 E (x) E x x. x p 0 C A p C A (.7) Covariance matrix The covariance between x j ; x k is de ned as jk Cov (x j ; x k ) E x j j (xk k ) E [x j x k ] j k When j k we obtain the variance of x j h jj E x j j i The covariance matrix is a p p matrix 3 p p ( ij ) p p pp 5

6 The alternative notations V (x) Cov (x) are used. In matrix form h i E (x ) (x ) T (.8a) E xx T T (.8b) More generally we de ne the covariance between two random vectors x (p ) and y (q ) as the (p q) matrix i T Cov (x; y) E h(x x ) y y (.9) Important property of is a positive semi-de nite matrix. Proof Let a (p ) be a constant vector, then E a T x a T E (x) a T and V a T x h E a T x a T i h i a T E (x ) (x ) T a a T a Since variance is always a positive (non-negative) quantity we nd a T a 0: From the de nition (see handout) is a positive semi-de nite (p.s.d.)matrix. Suppose we have an independent random sample x ; x ; :::; x n from a distribution with mean and covariance matrix : What is the relation between (a) the sample and population means, (b) the sample and population covariance matrices? Result We rst establish the mean and covariance of the sample mean x. Proof E (x) (.0a) V (x) n (.0b)! E (x) nx n E x i n i nx E (x i ) i 6

7 0 V (x) n nx x i ; n i nx j x j A n :n noting that Cov (x i ; x i ) and Cov (x i ; x j ) 0 for i 6 j: Hence V (x) n Result We now examine S and derive an unbiased estimator for : Proof S n n E (S) (n ) (.) n nx (x i x) (x i x) T i nx x i x T i i xx T since n P n i x ix T n x P n i xt i xx T : From (.8b) and (.0b) we see that E x i x T i + T E xx T n + T hence E (S) + T n + T Therefore an unbiased estimate of is n n S u n n S (.) n X 0T X 0 7

8 .7 Linear transformations Let x (x ; :::; x p ) T be a random p vector. It is often natural and useful to consider linear combinations of the components of x such as for example y x + x or y x + x 3 x 4 : In general we consider a transformation from the p component vector x to a q component vector y (q < p) given by y Ax + b (.3) where A (q p) and b (q ) are constant matrices. and Suppose that E (x) and V (x) the corresponding expressions for y are These follow from the linearity of the expectation operator E (y) A + b (.4a) V (y) AA T (.4b) E (y) E (Ax + b) AE (x) + E (b) A + b y say V (y) E yy T T y y h i E (Ax + b) (Ax + b) T (A + b) (A + b) T AE xx T A T + AE (x) b T + be x T A T +bb T A T A T Ab T ba T bb T A E xx T T A T AA T as required.8 The Mahalanobis transformation Given a p variate random variable x with E (x) and V (x). A transformation to a standardized set of uncorrelated variates is given by the Mahalanobis transformation. Suppose is positive de nite i.e. there is no exact linear dependence in x. Then the inverse covariance matrix. has a "square root" given by V V T (.5) where V V T is the spectral decomposition (see handout), i.e. V is an orthogonal matrix V T V V V T I p whose columns are the eigenvectors of and diag ( ; :::; p ) are thecorresponding eigenvalues. The Mahalanobis transformation takes the form z (x ) (.6) 8

9 Using results (.4a) and (.4b) we can show that E (z) 0 Proof V (z) I p i E (z) E h (x ) [E (x) ] 0.8. Sample Mahalanobis transformation V (z) I p Given a data matrix X T (x ; :::; x n ) ; the sample Mahalanobis transformation z i S (x i x) for i ; :::; n where S S x is the sample covariance matrix n XT HX creates a transformed data matrix Z T (z ; :::; z n ). Now the the data matrices are related by Z T S X T H or Z HXS (.7) where H is the centring matrix. We may easily show (Ex.) that Z T is centred and that S z I p :.8. Sample scaling transformation A transformation of the data that scales each variable to have mean zero and variance one but preserves the correlation structure is given by y i D (x i x) for i ; :::; n where D diag (s ; :::; s p ) : Now Ex. Show that S y R x : Y T D X T H or Y HXD (.8).8.3 A useful matrix identity Let u; v be n Proof vectors and form the n n matrix A uv T : Then ji + uv T j +v T u (.9) First observe that A and I + A share a common set of eigenvectors since Av v ) (I + A) v ( + ) v: Moreover the eigenvalues of I + A are + i where i are the eigenvalues of A: Now uv T is a rank one matrix, therefore has a single nonzero eigenvalue (see handout). Since uv T u u v T u u where v T u, the eigenvalues of I + uv T are + ; ; :::; : The determinant of I + uv T is the product of the eigenvalues, hence the result. 9

10 . Principal Components Analysis. Outline of technique Let x T (x ; x :::; x p ) be a random vector with mean and covariance matrix : PCA is a technique for dimensionality reduction from p dimensions to k < p dimensions. It tries to nd, in order, the most informative k linear combinations of a set of variables y ; y ; :::; y k : Here information will be interpreted as a percentage of the total variation (as previously de ned) in : The k sample PC s that "explain" x% of the total variation in a sample covariance matrix S may be similarly de ned.. Formulation Let y a T x y a T x. y p a T p x where y j a j x + a j x + ::: + a pj x p are a sequence of standardized linear combinations (SLC s) of the the x 0 s such that a T j a j and a T j a k 0 for j 6 k: i.e. a ; a ; :::; a p form an orthonormal set of p vectors. Equivalently we may de ne A; the p p matrix formed from the columns fa j g ; as an orthogonal matrix so that A T A AA T I p : We choose a to maximize V ar (y ) a T a subject to a T a : Then we choose a to maximize V ar (y ) a T a subject to a T a and a T a 0, which ensures that y will be uncorrelated with y : Subsequent PC s are chosen as the SLC s that have maximum variance subject to being uncorrelated with previous PC s. NB. Sometimes the PC s are taken to be "mean-corrected" linear transformations of the x 0 s i.e. y j a T j (x ) emphasizing that the PCS s can be considered as direction vectors in p space relative to the "centre" of a distribution in which the spread is maximized. In any case V ar (y j ) is the same whichever de nition is used. 0

11 .3 Computation To nd the rst PC we use the Lagrange multiplier technique for nding the maximum of a function f (x) subject to an equality constraint g (x) 0. We de ne the Lagrangean function where is a Lagrange multiplier. Di erentiating, we obtain L (a ) a T a a a a 0 a a Therefore a should be chosen to be an eigenvector of with eigenvalue : Suppose the eigenvalues of are distinct and ranked in decreasing order > > ::: > p > 0. V ar (y ) a T a a T a Therefore a should be chosen as the eigenvector corresponding to the largest eigenvalue of. nd PC The Lagrangean is where ; are Lagrange multipliers. L (a ) a T a a T a a T a since a T a 0: However Therefore ( I p ) a a 0 a T a 0 a T a a T a a T a 0 a a so a is the eigenvector of corresponding to the second largest eigenvalue.

12 .4 Example The covariance matrix corresponding to scaled (standardized) variables x ; x is (in fact a correlation matrix). Note has total variation. The eigenvalues of are the roots of j Ij 0 0 ( ) 0 Hence + ; : If > 0 then + ; : To nd a we substitute into a a. Note: this gives just one equation in terms of the components of a T (a ; a ) so a a. Applying the normalization a + a 0 a T a a + a we obtain Similarly a a " p p " p p # # so that y p (x + x ) y p (x x ) 00 ( + ) 00 ( ) are the PC s explaining respectively % and % of the total variation. Notice that the PC s are independent of while the proportion of the total variation explained by each PC does depend on :

13 .5 PCA and spectral decomposition Since (also S) is a real symmetric matrix, we know that it has the spectral decomposition (eigenanalysis) AA T px i a i a T i i where fa i g are the eigenvectors of which we have inserted as columns of the (p p) matrix A and ::: p are the corresponding eigenvalues. If some eigenvalues are not distinct, so k k+ ::: l, the eigenvectors are not unique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension l k + (cf. the major/minor axes of an ellipse x a + y b as b with the equicorrelation matrix (see Class Exercise ). The transformation of a random p components (PC s) contained in the p! a:). Such a situation arises vector x (corrected for its mean ) to its set of principal vector y is y A T (x ) y is the linear combination (SLC) of x having maximum variance, y is the SLC having maximum variance subject to being uncorrelated with y etc. We have seen that V ar (y ) ; V ar (y ) ; :::.6 Explanation of variance The interpretation of PC s (y)as components of variance "explaining" the total variation, i.e. the sum of the variances of the original variables (x) is clari ed by the following result Result The sum of the variances of the original variables and their PC s are the same. Proof A note on trace () The sum of diagonal elements of a (p p) square matrix is known as the trace of px tr () We show from this de nition that tr (AB) tr (BA) whenever AB and BA are de ned [i.e. A is (m n) and B is (n m)] tr (AB) X X a ij b ji i j X X b ji a ij j i tr (BA) 3 i ii

14 The sum of the variances for the PC s is X V ar (y i ) X i i i tr () Now AA T is the spectral decomposition and A is orthonormal so A T A I p hence tr () tr AA T tr A T A tr () Since is the covariance matrix of x the sum of its diagonal elements is the sum of the variances ii of the original variables. Hence the result is proved. Consequence (interpretation of PC s) It is therefore possible to interpret i + + ::: + p as the proportion of the total variation in the original data explained by the i th principal component and + :: + k + + ::: + p as the proportion of the total variation explained by the rst k PC s. From a PCA on a (0 0) sample covariance matrix S; we could for example conclude that the rst 3 PC s (out of a total of p 0 PC s) account for 80% of the total variation in the data. This would mean that the variation in the data is largely con ned to a 3-dimensional subspace described by the PC s y ; y ; y 3..7 Scale invariance This unfortunately is a property that PCA does not possess! In practice we often have to choose units of measurement for our individual variables fx i g and the amount of the total variation accounted for by a particular variable x i is dependent on this choice (tonnes, kg. or grams). In a practical study, the data vector x often comprises of physically incomparable quantities (e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibility is to perform PCA on a correlation matrix (e ectively choosing each variable to have unit sample variance), but this is still an implicit choice of scaling. The main point is that the results of a PCA depends on the scaling adopted. 4

15 .8 Principal component scores The sample PC transform on a data matrix X takes the form for the r th individual (r th row of the sample) y 0 r A T (x r x) where the columns of A are the eigenvectors of the sample covariance matrix S: Notice that the rst component y corresponds to the scalar product of the rst column of A with x 0 r etc. The components of y r are known as the (mean-corrected) principal component scores for the r th individual. The quantities y r A T x r are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of each data point with respect to new axes de ned by the PC s, i.e. w.r.t. a rotated frame of reference. The scores can provide qualitative information about individuals..9 Correlation of PC s with original variables The correlations (x i ; y k ) of the k th PC with variable x i are an aid to interpreting the PC s. Since y A T (x ) we have Cov (x; y) E (x ) y T h i E (x ) (x ) T A A and from the spectral decomposition A AA T A A Post-multiplying A by a diagonal matrix has the e ect of scaling its columns, so that Cov (x i ; y k ) k a ik is the covariance between the i th variable and the k th PC. The correlation (x i ; y k ) Cov (x i ; y k ) V ar (x i ) V ar (y k ) k a ik p p ii k k a ik ii can be interpreted as the proportion of the variation in x i explained by the k th PC. 5

16 Exercise Find the PC s of the covariance matrix and show that they account for amounts of the total variation in : 5:83 :00 3 0:7 Compute the correlations (x i ; y k ) and try to interpret the PC s qualitatively. 6

17 3. Multivariate Normal Distribution The MVN distribution is a generalization of the univariate normal distribution which has the density function (p.d.f.) ( ) f (x) p (x ) exp < x < where mean of distribution, variance. In p dimensions the density becomes f (x) () p exp jj (x )T (x ) Within the mean vector there are p (independent) parameters and within the symmetric covariance matrix there are p (p + ) independent parameters [ p (p + 3) independent parameters in total]. We use the notation x s N p (; ) (3.) to denote a RV x having the MVN distribution with E (x) Cov (x) Note that MVN distributions are entirely characterized by the rst and second moments of the distribution. Basic properties If x (p )is MVN with mean and covariance matrix (3.) Any linear combination of x is MVN Let y Ax + c with A (q p) and c (q ) then y s N q y ; y where y A + c and y AA T : Any subset of variables in x has a MVN distribution. If a set of variables is uncorrelated, then they are independently distributed. In particular i) if ij 0 then x i ; x j are independent ii) if x is MVN woth covariance matrix, then Ax and Bx are independent if and only if Cov (Ax; Bx) AB T (3.3) 0 Conditional distributions are MVN. Result For the MVN distribution, variable are uncorrelated, variable are independent. 7

18 Proof Let x (p ) be partitioned as with mean vector and covariance matrix x x x q p q p q q q p q q p q i) Independent ) uncorrelated (always holds). Suppose x ; x are independent. h i Then Cov (x ; x ) E (x ) (x ) T factorizes into the product of E [(x )] h i and E (x ) T which are both zero since E (x ) and E (x ) : Hence 0: ii) Uncorrelated ) independent (for MVN) This result depends on factorizing the p.d.f. (3.) when 0: In this case (x ) T (x ) has the partitioned form x T T x T T 0 x 0 x x T T x T T 0 x 0 x (x ) T (x ) + (x ) T (x ) so that expf(x ) T (x )g factorizes into the product of n o n o exp (x ) T (x ) and exp (x ) T (x ) : Therefore the p.d.f. can be written as proving that x and x are independent. f (x) g (x ) h (x ) Result x q Let x be MVN with mean and covariance matrix : x p q The conditional distribution of x given x is MVN with E (x jx ) + (x ) (3.4a) Cov (x jx ) (3.4b) 8

19 Proof Let x 0 x x : We rst show that x 0 and x are independent. Consider the linear transformation x x 0 I 0 I x x (3.5a) Ax say. (3.5b) This linear relationship shows that x ; x 0 are jointly MVN (by rst property of MVN stated above. We may show that x and x 0 are uncorrelated in two ways Firstly Cov x ; x 0 Cov x ; x Cov(x ; x ) Cov (x ; x ) 0 B or, if we write A in (3.5) and apply (3.3) C Cov x ; x 0 Cov (Bx; Cx) BC T Cov x ; x 0 I 0 I I 0 Since MVN and uncorrelated we have shown that x 0 and x are independent. Therefore E x 0 jx E x 0 E x x Now since x 0 x x as required. E (x jx ) E x 0 jx + x + x + (x ) Because x and x 0 are independent Cov x 0 jx Cov x 0 9

20 Conditional on x a given constant, x 0 x x i.e. x 0 and x di er by a constant. Hence Therefore where C I so Example Cov (x jx ) Cov x 0 jx Cov (x jx ) Cov x 0 CC T I I 0 I Let x have a MVN distribution with covariance matrix Show that the conditional distribution of (x ; x ) given x 3 is also MVN with mean + (x 3 3 ) and covariance matrix 4 0

21 3. Maximum-likelihood estimation Let X T (x ; :::; x n ) contain an independent random sample of size n from N p (; ) :The maximum likelihood estimates (MLEs ) of ; are b x (3.6a) b S (3.6b) The likelihood function is a function of the parameters ; given the data X L (; jx) ny f (x r j; ) (3.7) The RHS is evaluated by substituting the individual data vectors fx ; :::; x n g in turn into the p.d.f. of N p (; ) and taking the product. r ny r f (x r j; ) () np jj n exp Maximizing L is equivalent to minimizing ( l log L nx log f (x r j; ) r K + n log jj+ where K is a constant independent of ; : ) nx (x r ) T (x r ) r nx (x r ) T (x r ) Noting that x r (x r x) + (x ) the nal term in the above may be written Thus r nx (x r x) T (x r x) r + + nx (x r x) T (x ) r nx (x ) T (x r x) r +n (x ) T (x ) l (; ) tr A + nd T d (3.8a) n tr S +d T d (3.8b)

22 where we de ne for ease of notation A ns (3.9a) d x (3.9b) and S is the sample covariance matrix (with divisor n). We have made use of ns C T C where C is the (n p) centred data matrix C T (x x; x x; :::; x n x) We see that nx (x r x) T (x r x) tr C C T r tr C T C tr A ntr S Notice that l l (; ) and the dependence on is entirely through d in (3.8). Now assume that is positive de nite (p.d.), then so is (why?). Thus 8d 6 0 we have d T d > 0 showing that l is minimized with respect to for xed when d 0. Hence b x To minimize the log-likelihood l (b; ) w.r.t. up to an arbitrary additive constant. l (x; ) n log jj + tr A n log jj + tr S Let We show that () n log jj + tr S (3.0) () (S) n log jj log jsj + tr S p n tr S log j Sj p (3.) 0 Lemma S is positive de nite. (proved elsewhere) Lemma For any set of positive numbers A log G + where A and G are the arithmetic, geometric means respectively.

23 Proof For all x we have e x + x (simple exercise). For each y i 0 of a set i f; :::; ng therefore y i + log y i X yi n + X log y i as required. Y n A + log yi + log G In (3.) assuming that the eigenvalues of S are positive, recall that for any square matrix A; we have tr (A) P i the sum of the eigenvalues, and j Aj Y i the product of the eigenvalues. Let i (i ; :::; p) be the eigenvalues of S and substitute in (3.) Y log j Sj log i p log G tr S X i pa () (S) np fa log G g 0 This show that the MLE s are as stated in (3:6) : 3

24 3. Sampling distribution of x and S The Wishart distribution (De nition) If M (p p) can be written M X T X where X (m p) is a data matrix from N p (0; ) then M is said to have a Wishart distribution with scale matrix and degrees of freedom m: We write When I p the distribution is said to be in standard form. Note: M s W p (;m) (3.) The Wishart distribution is the multivariate generalization of the chi-square distribution Additive property of matrices with a Wishart distribution Let M, M be matrices having the Wishart distribution independently, then M s W p (;m ) M s W p (;m ) M + M s W p (;m + m ) This property follows from the de nition of the Wishart distribution because data matrices are additive in the sense that if X X X is a combined data matrix consisting of m + m rows then X T X X T X +X T X is matrix (known as the "Gram matrix") formed from the combined data matrix X: Case of p When p we know from the de nition of r as the distribution of the sum of squares of r independent N (0; ) variates that mx M x i s m so that W i ; m m 4

25 Sampling distributions Let x ; x ; :::; x n be a random sample of size n from N p (; ). Then. The sample mean x has the normal distribution x s N p ; n. The sample covariance matrix S MLE: S n CT C has the Wishart distribution ns s W p (;n ) 3. The distributrions of x and S are independent. 3.3 Estimators for special circumstances 3.3. proportional to a given vector Sometimes is known to be proportional to a given vector, so k 0. For example if x represents a sample of repeated measurements then k where (; ; :::; ) T is the p vector of 0 s: We nd the MLE of k for this situation. Suppose is known and k 0 the log likelihood is l log L n n log j j+ tr S o + (x k 0 ) T (x k 0 ) 0 to minimize l w.r.t. k x T x k T 0 x+ k T from which ^k T 0 x T 0 0 (3.3) We may show that ^k is an unbiased estimator of k and determine the variance of ^k In (3.3) ^k takes the form ct x with c T T 0 and a T 0 0 so i E h^k ct E [x] k ct 0 : 5

26 Hence i E h^k k (3.4) showing that ^k is an unbiased estimator. Note that V ar [x] n and therefore that V ar c T x n ct c we have V ar ^k n ct c n T 0 (3.5) Linear restriction on We determine an estimator for to satisfy a linear restriction where A is (m p) and b (m ) A b Introduce a vector of m Lagrange multipliers and seek to minimize l + T (A b) n n (x ) T (x ) + T (A o b) Di erentiate w.r.t. (x ) + A T 0 x A T (3.6) We use the constraint A b to evaluate the Lagrange multipliers : Premultiply by A Ax b AA T AA T (Ax b) Substitute into (3.6) ^ x A T AA T (Ax b) (3.7) 6

27 3.3.3 Covariance matrix proportional to a given matrix We consider estimating k when k 0 when 0 is given. The likelihood (3.8) takes the form plus terms not involving k: Hence l n log jk 0 j + tr k 0 S dl dk l p k p log k + k tr 0 S k tr 0 S 0 ^k tr 0 S p (3.8) 7

28 4. Hypothesis testing (Hotelling s t -statistic) Consider the test of hypothesis H 0 : 0 H A 6 0 () 4. The Union-Intersection Principle W accept the hypothesis H 0 as valid if and only if H 0 (a) : a T a T 0 is accepted for all a: [In some sense the union of all such hypotheses] For xed a we set y a T x so that in the population under H 0 ; and in our sample E (y) a T 0 V ar (y) a T a y a T x s:e: (y) at Sa p n The univariate t-statistic for testing H 0 (a) against the alternative (y) 6 a T 0 is t (a) y at 0 s:e: (y) p n a T (x 0 ) p a T Sa The acceptance threshold for H 0 (a) takes the form t (a) R for some R. The multivariate acceptance region is the intersection \ t (a) R (4.) which is true if and only if max t (a) R: Therefore we adopt max t (a) as the test statistic for H 0 : Equivalently Maximize (n ) a T (x 0 ) (x 0 ) T a subject to a T Sa (4.) Write d x satisfy 0 we introduce a Lagrangean multiplier and seek to determine and a to d h i a T (x da 0 ) (x 0 ) T a a T Sa 0 8

29 dd T a Sa 0 (4.3a) S dd T I a 0 (4.3b) js dd T Ij 0 (4.3c) (4.3b) can be written Ma a showing that a is an eigenvector of S dd T. (4.3c) is the determinantal equation satis ed by the eigenvalues of S dd T. Premultiplying (4.3a) by a T gives a T dd T a a T Sa 0 at dd T a a T Sa t (a) Therefore in order to maximize t (a) we choose to be the largest eigenvalue of S dd T : This is a rank matrix with the single non-zero eigenvalue tr S dd T d T S d and the maximum of (4.) is known as Hotelling s T statistic T (n ) (x 0 ) T S (x 0 ) (4.4) which is (n ) the sample Mahalanobis distance between x and Distribution of T Under H 0 it can be shown that T n s p n p F p;n p (4.5) where F p;n p is the F distribution on p and n p degrees of freedom. Note that depending on the covariance matrix used, T has slightly di erent forms ( T (n ) (x 0 ) T S (x 0 ) n (x 0 ) T S U (x 0) where S U is the unbiased estimator of (with divisor n ). Example In an investigation of adult intelligence, scores were obtained on two tests "verbal" and "performance" for 0 subjects aged 60 to 64. Doppelt and Wallace (955) reported the following mean score and covariance matrix: x 55:4 x 34:97 0:54 6:99 S U 6:99 9:68 9

30 At the :0 (%) level, test the hypothesis that and We rst compute S :039 :0400 U :0400 :03 d x 0 4:76 5:03 T The T statistic is then T 0 4:76 5:03 :039 :0400 4:76 :0400 :03 5:03 4:76 0 :039 4:76 5:03 : :03 :03 357:4 This gives F : :9 The nearest tabulated % value corresponds to F ;60 and is Therefore we conclude the null hypothesis should be rejected. The sample probably arose from a population with a much lower mean vector, rather closer to the sample mean. Example The change in levels of free fatty acid (FFA) were measured on 5 hypnotised subjects who had been asked to experience fear, depression and anger e ects while under hypnosis. The mean FFA changes were x :699 x :78 x 3 :558 Given that the covariance matrix of the stress di erences y i x i x i and y i x i x i3 is :7343 :666 S U :666 :7733 S 0:804 0:338 U 0:338 :7733 test at the 0.05 level of signi cance, whether each e ect produced the same change in FFA. [T :68 and F :4 with degrees of freedom,3. Do not reject the hypothesis "no emotion e ect" at the :05 level] 30

31 4.3 Invariance of T T is unafected by changes in the scale or origin of the (response) variables. Consider where C is (p p) and non-singular. y Cx + d The null hypothesis H 0 : x 0 is equivalent to H 0 : y C 0 + d. We have under linear transformation y C x + d S y CSC T so that n T y y y T S y y y (x 0 ) T C T CSC T C (x 0 ) (x 0 ) T C T C T S C C (x 0 ) (x 0 ) T S (x 0 ) which demonstrates invariance. 4.4 Con dence interval for a mean A con dence region for can be obtained given the distribution of T (n ) (x ) T S (x ) s p (n ) n p F p;n p (4.6) by substituting the data values x and S : In Example above we have x (55:4; 34:97) T 00S :3 :40 :40 :3 and F ;99 (:0) is approximately 4.83 (by interpolation). Hence :3 ( 55:4) :80 ( 5:4) ( 34:97) +:3 ( 34:97) :00 4:83 9:76 99 This is an ellipse in p dimensional space (can be plotted). In higher dimensions an ellipsoidal con dence region is obtained. 3

32 4.5 Likelihood ratio test Given a data matrix X of observations on a random vector x whose distribution depends on a vector of parameters, the likelihood ratio for testing the null hypothesis H 0 : 0 H : against the alternative is de ned as sup 0 L sup L where L L (; X) is the likelihood function. In a likelihood ratio test (LRT) we reject H 0 for low values of : In a likelihood ratio test (LRT) we reject H 0 for low values of ; i.e. if < c where c is chosen so that the probability of Type I error is a: If we de ne l0 log L 0 where L 0 is the value of the numerator and similarly l log L, the rejection criterion takes the form L log log 0 Result L (4.7) l 0 l > k (4.8) When H 0 is true and for n large the log likelihood ratio (4.8) has the -distribution on r degrees of freedom, r, where r equals the number of free parameters under H minus the number of free parameters under H 0 : 4.6 LRT for a mean when is known H 0 : 0 a given value when is known Given a random sample from N (; ) resulting in x and S the likelihood given in (3.8b) is (to within an additive constant) n l (; ) n log jj + tr S o + (x ) T (x ) (4.9) Under H 0 the value of is known and l0 l ( 0 ; ) n n log jj + tr S o + (x 0 ) T (x 0 ) Under H with no restriction on ; the m.l.e. of is ^ x: Thus l n log jj + tr S Therefore log l 0 l n (x 0 ) T (x 0 ) (4.0) 3

33 which is n times the Mahalanobis distance of x from 0. Note the similarity with Hotelling s T statistic. Given the distribution of x under H 0 is x s N p 0 ; n and (4.0) may be written using the transformation y independent N (0; ) variates as n (x 0 ) to a standard set of px log y T y yi (4.) we have the exact distribution showing that in this case the asymptotic distribution of Example i log s p (4.) log is exact for the small sample case. Measurements of the length of skull were made on a sample of rst and second sons from 5 families. 85:7 x 83:84 9:48 66:88 S 96:78 Assuming that in fact test at the :05 level the hypothesis H 0 : 8 8 T Solution log 5 3:7 :84 :0 0 3:7 0 :0 :84 0:5 3:7 + :84 4:3 Since (:05) 5:99 do not reject H 0 33

34 4.7 LRT for mean when is unknown Consider the test of hypothesis H 0 : 0 when is unknown. H : 6 0 In this case must be estimated under H 0 and also under H : Under H 0 n l ( 0 ; ) n log jj + tr S o + (x 0 ) T (x 0 ) (4.3a) n log jj + tr S + d T 0 d 0 (4.3b) n log jj + tr S + tr d T 0 d 0 (4.3c) n log jj + tr S + tr d 0 d T 0 (4.3d) n log jj + tr S + d 0 d T 0 (4.3e) writing d 0 for x 0 : Under H n l (^; ) n log jj + tr S o + (x ^) T (x ^) l ^; ^ (4.4a) n log jj + tr S (4.4b) n log jsj + tr S S (4.4c) n flog jsj + tr (I p )g (4.4d) l n log jsj + np (4.4e) after substitution of the m.l.e. s ^ x and ^ S obtained previously. Comparing (4:3e) with (4:4b) we see that the m.l..e. of under H 0 must be ^ S + d 0 d T 0 and that the corresponding value of l log L is l 0 n log js + d 0 d T 0 j + np l 0 l n log js + d 0 d T 0 j n log jsj n log js j n log js + d 0 d T 0 j n log js S + d 0 d T 0 j n log ji p +S d 0 d T 0 j n log + d T 0 S d 0 (4.5) making use of the useful matrix result proved in (:8:3) that ji p +uv T j + v T u : 34

35 Since log n log + T n (4.6) we see that and T are monotonically related. Therefore we can conclude that the LRT of H 0 : 0 when is unknown is equivalent to use of Hotelling s T statistic. 4.8 LRT for 0 with unknown H 0 : 0 when is unknown. H : 6 0 Under H 0 we substitute ^ x into n o l (^; 0 ) n log j 0 j + tr 0 S + (x ^) T 0 (x ^) giving l 0 n log j 0 j + tr 0 S (4.7) Under H we substitute the unrestricted m.l.e. s ^ x and ^ S giving as in (4:4e) l n log jsj + np (4.8) l0 l n log j 0 j + tr 0 S log jsj p n log j0 Sj+ tr 0 S p (4.9) This statistic depends only on the eigenvalues of the positive de nite matrix 0 S and has the property that l0 l log! 0 as S approaches 0: Let A be the arithmetic mean and G the geometric mean of the eigenvalues of 0 S then tr 0 pa Sj Gp j 0 log n fpa p log G pg The general result for the distribution of (4:0) for large n gives np fa log G g (4.0) l 0 l s r (4.) where r p (p + ) is the number of independent parameters in : 35

36 4.0 Test for sphericity A covariance matrix is said to have the property of "sphericity" if ki p (4.) for some k: We see that this is a special case of the more general situation k 0 treated in Section (3.3.3). The same procedure can be applied. The general likelihood:expresion for a sample from the MVN distribution is: log L n log jj + tr S + dd T Under H 0 : ki p and ^ x so log L n log jki p j + tr k S n p log k + k tr S [ log L] 0 at a minimum p k k tr S 0 ^k tr S p (4.4) which is in fact the arithmetic mean A of the eigenvalues of S: Substitute back into (4.3) gives l0 np (log A + ) Under H : ^ x and ^ S l n log jsj + np np (log G + ) thus log l 0 l np log A G (4.5) The number of free parameters contained in is under H 0 and p (p + ) under H : Hence the appropriate distribution for comparing log is r where r p (p + ) (p ) (p + ) (4.6) 36

37 4. Test for independence Independence of the variables x ; :::; x p is manifest by a diagonal covariance matrix diag ( ; :::; pp ) (4.7) We consider H 0 : is diagonal H :. is unrestricted against the general alternative Under H 0 it is clear in fact that we will nd ^ ii s ii because the estimators of ii for each x i are independent. We can also show this formally n log jj + tr S + dd T ( px n log ii + i px i s ii ii ii ( log L) 0 ii s ii ii 0 b ii s ii Therefore ( px ) n log s ii + p i n flog jdj + pg where D diag (s ; :::; s pp ) : Under H as before we nd Therefore l n log jsj + np l 0 l n[log jdj log jsj] n log jd Sj n log jd SD j n log jrj (4.8) The number of free parameters contained in is p under H 0 and p (p + ) under H : Hence the appropriate distribution for comparing log is r where r p (p + ) p p (p ) (4.9) 37

38 4. Simultaneous con dence intervals (Sche e, Roy & Bose) The union-intersection method for deriving Hotelling s T statistic provides "simultaneous con - dence intervals" for the parameters when is unknown. Following Section 4. let T (n ) (x ) T S (x ) (4.30) where is the unknown (true) mean. Let t (a) be the univariate t linear compound y a T x: Then max a t (a) T and for all p where vectors a statistic corresponding to the t (a) T (4.3) t (a) y y s y p n p n a T (x ) p a T Sa (4.3) so From Section 4. the distribution of T is T n s p n p F p;n p Pr T (n ) p n p F p;n p () therefore from (4.3), for all p vectors a Pr t (a) (n ) p n p F p;n p () (4.33) Substituting from (4.3), the con dence statement in (4.33) is: With probability for all p vectors a where K is the constant (n ) p ja T x a T j n p F p;n p () s a K T Sa n s a T Sa n say, (4.34) (n ) p K n p F p;n p () (4.35) A 00 ( ) % con dence interval for the linear compound a T is therefore s a T a x K T Sa n (4.36) 38

39 How can we apply this result? We might be interested in a de ned set of linear combinations (linear compounds) of : The i th component of is for example the linear compound de ned by a T (0; :::; ; :::0) the unit vector with a single in the i th position. For a large number of such sets of CI s we would expect 00 ( ) % to contain no mis-statements while 00% would contain at least one mis-statement. We can relate the T con dence intervals to the T test of H 0 : 0. If this H 0 is rejected at signi cance level then there exists at least one vector a such that the interval (4.36) does not include the value a T 0 : NB. If the covariance matrix S u (with denominator n ) is supplied, then in (4.36) r a T S u a may be replaced by : n r a T Sa n 4.3 The Bonferroni method This provides another way to construct simultaneous CI s for a small number of linear compounds of whilst controlling the overall level of con dence. Consider a set of events A ; A ; :::; A m Pr (A \ ::: \ A m ) Pr A [ ::: [ A m From the additive law of probabilities X m Pr A [ ::: [ A m Pr A i i Therefore Pr (A \ ::: \ A m ) mx Pr A i i (4.37) Let C k denote a con dence statement about the value of some linear compound a T k with Pr (C k true) k : Pr (all C k true) ( + ::: + m ) (4.38) Therefore we can control the overall error rate given by + ::: + m say. For example, in order to construct simultaneous 00 ( ) % CI s for all p components k of we could choose k p (k ; :::; p) leading to x t n x p t n p. p r s n r spp n if s ii derives from S u : 39

40 Example Intelligence scores data on n 0 subjects: x x x S U 55:4 34:97 0:54 6:99 6:99 9:68. Construct 99% simultaneous con dence intervals for ; and : For take a T (; 0) Now take :0 a T x 0 55:4 55:4 34:97 a T S u a 0:54 K (n (n ) p p) F p;n p () 00 F ;99 (:0) 99 3: taking F ;99 (:0) 4:83 (approx). Therefore the CI for is r 0:54 55:4 3: 0 55:4 4:50 giving an interval (50:7; 59:7) For we already have K, take a T (0; ) then The CI for is a T x 34:97 a T S u a 9:68 r 9:68 34:97 3: 34:97 3:40 0 giving an interval (3:6; 38:4) For take a T [; ] a T 55:4 x [; ] 0:7 34:97 a T 0:54 6:99 S u a [; ] 6:99 9:68 0:54 6:9 + 9:68 76:4 40

41 CI for is 0:7 3: r 76:4 0 0:7 :7 (7:6; 3:0). Construct CI s for ; by Bonferroni method. Use :0: Individual CI s are constructed using k :0 :005 (k ; ) : Then k t 00 t 00 (:005) ' (:0075) CI for is :8 55:4 :8 r 0: :4 4:06 (5:; 59:3) and for is 34:97 :8 r 9: :97 3:06 (3:9; 38:0) Comparing CI s obtained by the two methods we see that the simultaneous CI s for and and are 8.7% wider than the coirresponding Bonferroni CI s. NB. If we had required 99% Bonferroni CI s for ; and then m 3 in (4.38) and m :0 :007: The corresponding percentage point of t would be 6 t 00 (:007) ' (:9983) :93 leading to a slightly wider CI Than obtained above. 4

42 4.4 Two sample procedures Suppose we have two independent random samples fx ; :::; x n g fx ; :::; x n g of size n ; n from two populations. : x s N p ( ; ) : x s N p ( ; ) giving rise to sample means x ; x and sample covariance matrices S ; S. Note the assumption of a common covariance matrix : We consider testing H 0 : against H : 6 Let d x x : Under H 0 d s N 0; + n n (a) Case of known Analogously to the one sample case n n d n + n s N (0; Ip ) n n n dt d s p where n n + n (b) Case of unknown We have the Wishart distributed quantitities n S s W p (;n ) n S s W p (;n ) Let S p n S + n S n be the pooled estimator of the covariance matrix : Then from the additive properties of the Wishart distribution (n ) S p has the Wishart distribution W p (; n ) and n n d s N (0; ) n It may be shown that T n n d T Sp d n has the distribution of a Hotelling s T statistic. In fact T s (n ) p n p F p;n p (4.39) 4

43 4.5 Multi-sample procedures (MANOVA) We consider the case of k samples from populations ; :::; k : The sample from population i is of size n i : By analogy with the univariate case we can decompose the SSP matrix into orthogonal parts. This decomposition can be represented as a Multivariate Analysis of Variance (MANOVA) table. The MANOVA model is x ij + i + e ij j ; :::; n i and i ; :::; k (4.40) where e ij are independent N p (0; ) variables. Here the parameter vector is the overall (grand) mean and the i is the i th treatment e ect with kx n i i 0 (4.4) i De ne the i th sample mean as x i X ni n x ij: i j The Between Groups sum of squares and cross-products (SSP) matrix is B kx n i (x i x) (x i x) T (4.4) i The Grand Mean is x X k i n ix i and the Total SSP matrix is T kx Xn i (x ij x) (x ij x) (4.43) i j It can be shown algebraically that T B + W where W is the Within Groups (or residual) SSP matrix given by kx Xn i W (x ij x i ) (x ij x i ) T (4.44) The MANOVA table is i j Source Matrix of SS and Degrees of of variation cross-products (SSP) freedom (d.f.) Treatment Residual B X k n i (x i x) (x i x) T k i W X k i X ni j (x ij x i ) (x ij x i ) T X k i n i k Total (corrected for the mean) T B + W X k i 43 X ni j (x ij x) (x ij x) X k i n i

44 We are interested in testing the hypothesis H 0 : ::: k (4.4) whether the samples in fact come from the same population against the general alternative H : 6 6 ::: 6 k (4.43) We can derive a likelihood ratio test statistic known as Wilk s : Under H 0 the m.l.e. s are ^ x ^ S leading to the maximized log likelihood (minimum of log L) where Under H the m.l.e. s are This follows from l min ;d i min ( ( W l 0 np + n log jsj (4.44) ^ i x i ^ n W kx W i i n log jj + kx n i S i i kx n i tr i n log jj + n tr n S i + d i d T ) i!) kx n i S i i since ^d i x i ^ i 0. Hence ^ n W and l np + n log n W (4.45) Therefore since T ns jw j l0 l n log jt j n log (4.46) where is known as Wilk s statistic. We reject H 0 for small values of or large values of n log : Asymptotically, the rejection region is the upper tail of a p(k ). Under H 0 the unknown has p parameters and under H the number of parameters for ; :::; k is pk: Hence the d.f. of the is p (k ). Apart from this asymptotic result, other approximate distributions (notably Bartlett s approximation) are available, but the details are outside the scope of this course. 44

45 4.5. Calculation of Wilk s Result Let ; :::; p be the eigenvalues of W B then Proof py ( + j ) (4.47) j T W (W + B) W W (W + B) I + W B py ( + j ) (4.48) j by the useful identity proved earlier in the notes Case k We show that use of Wilk s for k groups is equivalent to using Hotelling s T statistic. Speci cally, we show that is a monotonic function of T. Thus to reject H 0 for < is equivalent to rejecting H 0 for T > (for some constants ; ): Proof For k we can show (Ex.) that where d x x. Then B n n n ddt (4.49) I + W B I + n n n W dd T + n n n dt W d Now W is just (n ) S p where S p is the pooled estimator of : Thus + T n (4.50) 45

46 5. Discriminant Analysis (Classi cation) Given k populations (groups) ; :::; k : An individual from j has p.d.f. measurement x. f j (x) for a set of p The purpose of discriminant analysis is to allocate an individual to one of the groups f j g on the basis of x, making as few "mistakes" as possible. For example a patient presents at a doctor s surgery with a set of symptom x. The symptoms suggest a number of posible disease groups f j g to which the patient might belong. What is the most likely diagnosis? The aim initially is to nd a partition of R P into disjoint regions R ; :::; R k together with a decision rule x R j ) allocate x to j The decision rule will be more accurate if " j has most of its probability concentrated in R j " for each j: 5. The maximum likelihood (ML) rule Allocate x to population j that gives the largest likelihood to x. Choose j by (break ties arbitrarily). Result L j (x) max ik L i (x) If f i g is the multivariate normal (MVN) population N p ( i ; ) for i ; :::; k; the ML rule allocates x to population i that minimize the Mahalanobis distance between x and i : Proof L i (x) jj exp (x i) T (x i ) so the likelihood is maximized when the exponent is minimized. Result When k the ML rule allocates x to if d T (x ) > 0 (5.) where d ( ) and ( + ) and to otherwise. Proof For the two group case, the ML rule is to allocate x to if (x ) T (x ) < (x ) T (x ) 46

47 which reduces to d T x > ( ) T ( + ) d T ( + ) Hence the result. The function h (x) ( ) T x ( + ) (5.) is known as the discriminant function (DF). In this case the DF is linear in x. 5. Sample ML rule In practice ; ; are estimated by, respectively x ; x ; S P where S P is the pooled (unbiased) estimator of covariance matrix. Example The eminent statistician R.A. Fisher took measurements on samples of size 50 of 4 types of iris. Two of the variables: x sepal length and x sepal width gave the following data on species I and II: x S (The data have been rounded for clarity). Hence giving the rule: Allocate x to if 5:0 3:4 : :0 :0 :4 6:0 x :8 S S p 50S + 50S 98 0:9 0:09 0:09 0: d S p (x x ) 0:9 0:09 :0 0:09 0: 0:6 5:5 (x + x ) 3: :6 :08 :08 :0 :4 4: :4 (x 5:5) + 4: (x 3:) > 0 :4x + 4:x + 9:0 > 0 47

48 5.3 Misclassi cation probabilities The misclassi cation probabilities p ij de ned as p ij Pr [Allocate to i when in fact from j ] form a k k matrix, of which the diagonal elements fp ii g are a measure of the classi er s accuracy. For the case k Since h (x) d T (x Given that x :- p Pr [h (x) > 0 j ] ) is a linear compound of x it has a (univariate) normal distribution. E [h (x)] d T ( + ) dt ( ) where ( ) T ( ) is the Mahalanobis distance between and : The variance of is d T d ( ) T ( ) ( ) T ( ) By symmetry this is also p i.e. Example (contd.) Pr [h (x) > 0] Pr Pr " h (x) + " Z > # > # (5.3) p p We can estimate the misclassi cation probability from the sample Mahalanobis distance between x and x The misclassi cation rate is.3%. D (x x ) T Sp (x x ) :0 0:6 :4 ' 9:9 4: D ( :3) 0:03 48

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Hypothesis Testing in the Classical Regression Model

LECTURE 5 Hypothesis Testing in the Classical Regression Model The Normal Distribution and the Sampling Distributions It is often appropriate to assume that the elements of the disturbance vector ε within

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### 4.6 Null Space, Column Space, Row Space

NULL SPACE, COLUMN SPACE, ROW SPACE Null Space, Column Space, Row Space In applications of linear algebra, subspaces of R n typically arise in one of two situations: ) as the set of solutions of a linear

### 1 Introduction to Matrices

1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns

### 1 Introduction. 2 Matrices: Definition. Matrix Algebra. Hervé Abdi Lynne J. Williams

In Neil Salkind (Ed.), Encyclopedia of Research Design. Thousand Oaks, CA: Sage. 00 Matrix Algebra Hervé Abdi Lynne J. Williams Introduction Sylvester developed the modern concept of matrices in the 9th

### Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

### Multivariate Analysis of Variance (MANOVA): I. Theory

Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the

### Multivariate Statistical Inference and Applications

Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim

### Chapter 4: Statistical Hypothesis Testing

Chapter 4: Statistical Hypothesis Testing Christophe Hurlin November 20, 2015 Christophe Hurlin () Advanced Econometrics - Master ESA November 20, 2015 1 / 225 Section 1 Introduction Christophe Hurlin

### 1 Another method of estimation: least squares

1 Another method of estimation: least squares erm: -estim.tex, Dec8, 009: 6 p.m. (draft - typos/writos likely exist) Corrections, comments, suggestions welcome. 1.1 Least squares in general Assume Y i

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### Introduction to Matrix Algebra

Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

### 1 Vector Spaces and Matrix Notation

1 Vector Spaces and Matrix Notation De nition 1 A matrix: is rectangular array of numbers with n rows and m columns. 1 1 1 a11 a Example 1 a. b. c. 1 0 0 a 1 a The rst is square with n = and m = ; the

### SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2015 Timo Koski Matematisk statistik 24.09.2015 1 / 1 Learning outcomes Random vectors, mean vector, covariance matrix,

### Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### Inner Product Spaces and Orthogonality

Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,

### Math 312 Homework 1 Solutions

Math 31 Homework 1 Solutions Last modified: July 15, 01 This homework is due on Thursday, July 1th, 01 at 1:10pm Please turn it in during class, or in my mailbox in the main math office (next to 4W1) Please

### Linear Algebra Review. Vectors

Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Chapter 3: The Multiple Linear Regression Model

Chapter 3: The Multiple Linear Regression Model Advanced Econometrics - HEC Lausanne Christophe Hurlin University of Orléans November 23, 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics

### Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round \$200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

### The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every

### MATH 304 Linear Algebra Lecture 4: Matrix multiplication. Diagonal matrices. Inverse matrix.

MATH 304 Linear Algebra Lecture 4: Matrix multiplication. Diagonal matrices. Inverse matrix. Matrices Definition. An m-by-n matrix is a rectangular array of numbers that has m rows and n columns: a 11

### Multivariate normal distribution and testing for means (see MKB Ch 3)

Multivariate normal distribution and testing for means (see MKB Ch 3) Where are we going? 2 One-sample t-test (univariate).................................................. 3 Two-sample t-test (univariate).................................................

### A matrix over a field F is a rectangular array of elements from F. The symbol

Chapter MATRICES Matrix arithmetic A matrix over a field F is a rectangular array of elements from F The symbol M m n (F) denotes the collection of all m n matrices over F Matrices will usually be denoted

### 1. The Classical Linear Regression Model: The Bivariate Case

Business School, Brunel University MSc. EC5501/5509 Modelling Financial Decisions and Markets/Introduction to Quantitative Methods Prof. Menelaos Karanasos (Room SS69, Tel. 018956584) Lecture Notes 3 1.

### SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2014 Timo Koski () Mathematisk statistik 24.09.2014 1 / 75 Learning outcomes Random vectors, mean vector, covariance

### Section 6.1 - Inner Products and Norms

Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,

### Introduction to Principal Components and FactorAnalysis

Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a

### Factor Analysis. Factor Analysis

Factor Analysis Principal Components Analysis, e.g. of stock price movements, sometimes suggests that several variables may be responding to a small number of underlying forces. In the factor model, we

### Math 576: Quantitative Risk Management

Math 576: Quantitative Risk Management Haijun Li lih@math.wsu.edu Department of Mathematics Washington State University Week 4 Haijun Li Math 576: Quantitative Risk Management Week 4 1 / 22 Outline 1 Basics

### CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation

Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### A Introduction to Matrix Algebra and Principal Components Analysis

A Introduction to Matrix Algebra and Principal Components Analysis Multivariate Methods in Education ERSH 8350 Lecture #2 August 24, 2011 ERSH 8350: Lecture 2 Today s Class An introduction to matrix algebra

### Applied Linear Algebra I Review page 1

Applied Linear Algebra Review 1 I. Determinants A. Definition of a determinant 1. Using sum a. Permutations i. Sign of a permutation ii. Cycle 2. Uniqueness of the determinant function in terms of properties

### UNIT 2 MATRICES - I 2.0 INTRODUCTION. Structure

UNIT 2 MATRICES - I Matrices - I Structure 2.0 Introduction 2.1 Objectives 2.2 Matrices 2.3 Operation on Matrices 2.4 Invertible Matrices 2.5 Systems of Linear Equations 2.6 Answers to Check Your Progress

### 1 2 3 1 1 2 x = + x 2 + x 4 1 0 1

(d) If the vector b is the sum of the four columns of A, write down the complete solution to Ax = b. 1 2 3 1 1 2 x = + x 2 + x 4 1 0 0 1 0 1 2. (11 points) This problem finds the curve y = C + D 2 t which

### CITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION

No: CITY UNIVERSITY LONDON BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION ENGINEERING MATHEMATICS 2 (resit) EX2005 Date: August

### Factor Analysis. Chapter 420. Introduction

Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

### 3. The Multivariate Normal Distribution

3. The Multivariate Normal Distribution 3.1 Introduction A generalization of the familiar bell shaped normal density to several dimensions plays a fundamental role in multivariate analysis While real data

### Summary of week 8 (Lectures 22, 23 and 24)

WEEK 8 Summary of week 8 (Lectures 22, 23 and 24) This week we completed our discussion of Chapter 5 of [VST] Recall that if V and W are inner product spaces then a linear map T : V W is called an isometry

### MATH 240 Fall, Chapter 1: Linear Equations and Matrices

MATH 240 Fall, 2007 Chapter Summaries for Kolman / Hill, Elementary Linear Algebra, 9th Ed. written by Prof. J. Beachy Sections 1.1 1.5, 2.1 2.3, 4.2 4.9, 3.1 3.5, 5.3 5.5, 6.1 6.3, 6.5, 7.1 7.3 DEFINITIONS

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### 9 MATRICES AND TRANSFORMATIONS

9 MATRICES AND TRANSFORMATIONS Chapter 9 Matrices and Transformations Objectives After studying this chapter you should be able to handle matrix (and vector) algebra with confidence, and understand the

### Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### Nonlinear Iterative Partial Least Squares Method

Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for

### Vector and Matrix Norms

Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

### MATH 304 Linear Algebra Lecture 8: Inverse matrix (continued). Elementary matrices. Transpose of a matrix.

MATH 304 Linear Algebra Lecture 8: Inverse matrix (continued). Elementary matrices. Transpose of a matrix. Inverse matrix Definition. Let A be an n n matrix. The inverse of A is an n n matrix, denoted

### Notes on Symmetric Matrices

CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.

### Sections 2.11 and 5.8

Sections 211 and 58 Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/25 Gesell data Let X be the age in in months a child speaks his/her first word and

### LINEAR ALGEBRA W W L CHEN

LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,

### Random Vectors and the Variance Covariance Matrix

Random Vectors and the Variance Covariance Matrix Definition 1. A random vector X is a vector (X 1, X 2,..., X p ) of jointly distributed random variables. As is customary in linear algebra, we will write

### Inner Product Spaces

Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

### 3. Let A and B be two n n orthogonal matrices. Then prove that AB and BA are both orthogonal matrices. Prove a similar result for unitary matrices.

Exercise 1 1. Let A be an n n orthogonal matrix. Then prove that (a) the rows of A form an orthonormal basis of R n. (b) the columns of A form an orthonormal basis of R n. (c) for any two vectors x,y R

### EC327: Advanced Econometrics, Spring 2007

EC327: Advanced Econometrics, Spring 2007 Wooldridge, Introductory Econometrics (3rd ed, 2006) Appendix D: Summary of matrix algebra Basic definitions A matrix is a rectangular array of numbers, with m

### CAPM, Arbitrage, and Linear Factor Models

CAPM, Arbitrage, and Linear Factor Models CAPM, Arbitrage, Linear Factor Models 1/ 41 Introduction We now assume all investors actually choose mean-variance e cient portfolios. By equating these investors

### Principal Components Analysis (PCA)

Principal Components Analysis (PCA) Janette Walde janette.walde@uibk.ac.at Department of Statistics University of Innsbruck Outline I Introduction Idea of PCA Principle of the Method Decomposing an Association

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### α = u v. In other words, Orthogonal Projection

Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v

### Chapter 6: Multivariate Cointegration Analysis

Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

### Multivariate Analysis (Slides 13)

Multivariate Analysis (Slides 13) The final topic we consider is Factor Analysis. A Factor Analysis is a mathematical approach for attempting to explain the correlation between a large set of variables

### Inverses and powers: Rules of Matrix Arithmetic

Contents 1 Inverses and powers: Rules of Matrix Arithmetic 1.1 What about division of matrices? 1.2 Properties of the Inverse of a Matrix 1.2.1 Theorem (Uniqueness of Inverse) 1.2.2 Inverse Test 1.2.3

### Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes

Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes Yong Bao a, Aman Ullah b, Yun Wang c, and Jun Yu d a Purdue University, IN, USA b University of California, Riverside, CA, USA

### 1 Orthogonal projections and the approximation

Math 1512 Fall 2010 Notes on least squares approximation Given n data points (x 1, y 1 ),..., (x n, y n ), we would like to find the line L, with an equation of the form y = mx + b, which is the best fit

### Topic 1: Matrices and Systems of Linear Equations.

Topic 1: Matrices and Systems of Linear Equations Let us start with a review of some linear algebra concepts we have already learned, such as matrices, determinants, etc Also, we shall review the method

### Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

### 3.1 Least squares in matrix form

118 3 Multiple Regression 3.1 Least squares in matrix form E Uses Appendix A.2 A.4, A.6, A.7. 3.1.1 Introduction More than one explanatory variable In the foregoing chapter we considered the simple regression

### 1 VECTOR SPACES AND SUBSPACES

1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such

### Canonical Correlation

Chapter 400 Introduction Canonical correlation analysis is the study of the linear relations between two sets of variables. It is the multivariate extension of correlation analysis. Although we will present

### Eigenvalues and Eigenvectors

Chapter 6 Eigenvalues and Eigenvectors 6. Introduction to Eigenvalues Linear equations Ax D b come from steady state problems. Eigenvalues have their greatest importance in dynamic problems. The solution

### Using the Singular Value Decomposition

Using the Singular Value Decomposition Emmett J. Ientilucci Chester F. Carlson Center for Imaging Science Rochester Institute of Technology emmett@cis.rit.edu May 9, 003 Abstract This report introduces

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### Mathematics for Economics (Part I) Note 5: Convex Sets and Concave Functions

Natalia Lazzati Mathematics for Economics (Part I) Note 5: Convex Sets and Concave Functions Note 5 is based on Madden (1986, Ch. 1, 2, 4 and 7) and Simon and Blume (1994, Ch. 13 and 21). Concave functions

### Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

### Elements of probability theory

2 Elements of probability theory Probability theory provides mathematical models for random phenomena, that is, phenomena which under repeated observations yield di erent outcomes that cannot be predicted

### Understanding and Applying Kalman Filtering

Understanding and Applying Kalman Filtering Lindsay Kleeman Department of Electrical and Computer Systems Engineering Monash University, Clayton 1 Introduction Objectives: 1. Provide a basic understanding

The Hadamard Product Elizabeth Million April 12, 2007 1 Introduction and Basic Results As inexperienced mathematicians we may have once thought that the natural definition for matrix multiplication would

### Applied Multivariate Analysis

Neil H. Timm Applied Multivariate Analysis With 42 Figures Springer Contents Preface Acknowledgments List of Tables List of Figures vii ix xix xxiii 1 Introduction 1 1.1 Overview 1 1.2 Multivariate Models

### 5. Orthogonal matrices

L Vandenberghe EE133A (Spring 2016) 5 Orthogonal matrices matrices with orthonormal columns orthogonal matrices tall matrices with orthonormal columns complex matrices with orthonormal columns 5-1 Orthonormal

### Diagonal, Symmetric and Triangular Matrices

Contents 1 Diagonal, Symmetric Triangular Matrices 2 Diagonal Matrices 2.1 Products, Powers Inverses of Diagonal Matrices 2.1.1 Theorem (Powers of Matrices) 2.2 Multiplying Matrices on the Left Right by

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### 2.1: MATRIX OPERATIONS

.: MATRIX OPERATIONS What are diagonal entries and the main diagonal of a matrix? What is a diagonal matrix? When are matrices equal? Scalar Multiplication 45 Matrix Addition Theorem (pg 0) Let A, B, and

### Linear Algebra: Matrices

B Linear Algebra: Matrices B 1 Appendix B: LINEAR ALGEBRA: MATRICES TABLE OF CONTENTS Page B.1. Matrices B 3 B.1.1. Concept................... B 3 B.1.2. Real and Complex Matrices............ B 3 B.1.3.

### The Scalar Algebra of Means, Covariances, and Correlations

3 The Scalar Algebra of Means, Covariances, and Correlations In this chapter, we review the definitions of some key statistical concepts: means, covariances, and correlations. We show how the means, variances,

### Chapter 17. Orthogonal Matrices and Symmetries of Space

Chapter 17. Orthogonal Matrices and Symmetries of Space Take a random matrix, say 1 3 A = 4 5 6, 7 8 9 and compare the lengths of e 1 and Ae 1. The vector e 1 has length 1, while Ae 1 = (1, 4, 7) has length

### Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Chapter 1 Vocabulary identity - A statement that equates two equivalent expressions. verbal model- A word equation that represents a real-life problem. algebraic expression - An expression with variables.

### Matrix Algebra LECTURE 1. Simultaneous Equations Consider a system of m linear equations in n unknowns: y 1 = a 11 x 1 + a 12 x 2 + +a 1n x n,

LECTURE 1 Matrix Algebra Simultaneous Equations Consider a system of m linear equations in n unknowns: y 1 a 11 x 1 + a 12 x 2 + +a 1n x n, (1) y 2 a 21 x 1 + a 22 x 2 + +a 2n x n, y m a m1 x 1 +a m2 x

### Lecture 9: Continuous

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 9: Continuous Latent Variable Models 1 Example: continuous underlying variables What are the intrinsic latent dimensions in these two datasets?

### Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree

### CS3220 Lecture Notes: QR factorization and orthogonal transformations

CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss

### Solution based on matrix technique Rewrite. ) = 8x 2 1 4x 1x 2 + 5x x1 2x 2 2x 1 + 5x 2

8.2 Quadratic Forms Example 1 Consider the function q(x 1, x 2 ) = 8x 2 1 4x 1x 2 + 5x 2 2 Determine whether q(0, 0) is the global minimum. Solution based on matrix technique Rewrite q( x1 x 2 = x1 ) =

### Chapter 6. Orthogonality

6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be