1. Introduction to multivariate data

. Introduction to multivariate data. Books Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall Krzanowski, W.J. Principles of multivariate analysis. Oxford.000 Johnson, R.A.and D.W. Wichern Applied multivariate statistical analysis. Prentice Hall.. Applications The need often arises in science, medicine and social science (business, management) to analyze data on p variables (Note that p ) data are bivariate). Suppose we have a simple random sample of size n. The sample consists of n vectors of measurements on p variates i.e. n p vectors (by convention column vectors) x ; :::x n which are inserted as rows x T ; :::xt n into a (n p) data matrix X: When p we can plot the rows in -dimensional space, but in higher dimensions, p > ; other techniques are needed. Example Classi cation of plants (taxonomy) Variables: (p 3) leaf size (x ), colour of ower (x ), height of plant (x 3 ) Sample items: n 4 plants from a single species Aims of analysis: i) understand within species variability ii) clasify a new plant species The data matrix may appear as follows Variables x x x 3 6. Plants 8. 8 (Items) 3 5.3 0 9 4 6.4 0 Example Credit scoring Variables: personal data held by bank Items: sample of good/bad customers Aims of analysis: i) predict potential defaulters (CRM) ii) risk assessment for new applicant

Example 3 Image processing for e.g. quality control Variables: "features" extracted from an image Items: sampled from a production line Aims of analysis: i) quantify "normal" variability ii) reject faulty (o speci cation) batches.3 Sample mean and covariance matrix We shall adopt the following notation: x (p ) X (n p) x (p ) S (p p) R (p p) a random vector of observations on p variables a data matrix whose rows contain an independent random sample x T ; :::; xt n of observations on x sample mean vector x n X n i x i sample covariance matrix containing the sample covariances de ned as s jk n X n i (x ij x j ) (x ik x k ) sample correlation matrix containing the sample correlations de ned as r jk s jk p s jk, say sjj s kk s j s k Notes. x j is de ned as the j th component of x (mean of variable j). the covariance matrix S is square, symmetric ( S S T ), and holds the sample variances s jj s j n X n i (x ij x j ) along its main diagonal 3. the diagonal elements of R are r jj and r jk for each j; k.4 Matrix-vector representations Given a (n p) data matrix X; de ne the n vector of one s (; ; :::; ) T

The row sums of X are obtained by pre-multiplying X by T np np T X x i ; ::: ; x ip Hence i (nx ; :::; nx p ) nx T i x n XT (.) The centred data matrix X 0 is derived from X by subtracting the variable mean from each element of X. i.e. x 0 ij x ij x j : or, equivalently, by subtracting a constant vector x T from each row of X. where H matrix as n X 0 X x T X n T X I n n T X HX (.) I n n T is known as the centring matrix. We now de ne the sample covariance the centred sum of squares and products (SSP) matrix where x 0 i x i S n X 0T X 0 (.3a) nx x 0 ix 0 T i (.3b) n i x denotes the i th mean-corrected data point For any real p vector y we then have y T Sy n yt X 0T X 0 y n zt z where z X 0 y n kzk 0 Hence from the de nition of a p.s.d. matrix, we have Proposition The sample covariance matrix S is positive semi-de nite (p.s.d.) Example Two measurements x ; x made at the same position on each of 3 cans of food, resulted in the following X matrix: X 4 4 3 3 5 Find the sample mean vector x and covariance matris S. 3 5 3

Solution X 4 x n X 0 4 3 4 3 3 5 3X x i 3 i 3 0 S 3 X 0T X 0 5 [x ; x ; x 3 ] T 3 5 4 + 4 3 3 3 8 3 + 3 3 5 3 4:67 0:67 0:67 :67 Note also that S is built up from individual data points: S 3 + 3 0 + 3 0 and R 0:89 0:89.5 Measures of multivariate scatter It is useful to have a single number as a measure of spread in the data. Based on S we de ne two scalar quantities The total variation is tr (S) trace (S) The generalized variance is In the above example tr (S) 4 3 + 8 3 7:33 jsj 4 3 : 8 3 3 px s ii sum of diagonal elements j sum of eigenvalues of S jsj product of eigenvalues of S (.5) 4

.6 Random vectors We will in this course generally regard the data as an independent random sample from some continuous population distribution with a probability density function f (x) f (x ; :::; x p ) (.6) Here x (x ; :::; x p ) is regarded as a vector of p random variables. Independence here refers to the rows of the data matrix. If two of the variables (columns) are for example height and weight of individuals (rows), then knowing one individual s weight says nothing about any other individual. However the height and weight for any individual are correlated. For any region D in p space of the variables Z Pr (x D) D f (x) dx Mean vector For any j the population mean of x j is given by the p fold integral Z E (x j ) j x j f (x) dx where the region of integration is R p. In vector form 0 E (x) E B @ x x. x p 0 C A B @. p C A (.7) Covariance matrix The covariance between x j ; x k is de ned as jk Cov (x j ; x k ) E x j j (xk k ) E [x j x k ] j k When j k we obtain the variance of x j h jj E x j j i The covariance matrix is a p p matrix 3 p p ( ij ) 6 7 4.. 5 p p pp 5

The alternative notations V (x) Cov (x) are used. In matrix form h i E (x ) (x ) T (.8a) E xx T T (.8b) More generally we de ne the covariance between two random vectors x (p ) and y (q ) as the (p q) matrix i T Cov (x; y) E h(x x ) y y (.9) Important property of is a positive semi-de nite matrix. Proof Let a (p ) be a constant vector, then E a T x a T E (x) a T and V a T x h E a T x a T i h i a T E (x ) (x ) T a a T a Since variance is always a positive (non-negative) quantity we nd a T a 0: From the de nition (see handout) is a positive semi-de nite (p.s.d.)matrix. Suppose we have an independent random sample x ; x ; :::; x n from a distribution with mean and covariance matrix : What is the relation between (a) the sample and population means, (b) the sample and population covariance matrices? Result We rst establish the mean and covariance of the sample mean x. Proof E (x) (.0a) V (x) n (.0b)! E (x) nx n E x i n i nx E (x i ) i 6

0 V (x) Cov @ n nx x i ; n i nx j x j A n :n noting that Cov (x i ; x i ) and Cov (x i ; x j ) 0 for i 6 j: Hence V (x) n Result We now examine S and derive an unbiased estimator for : Proof S n n E (S) (n ) (.) n nx (x i x) (x i x) T i nx x i x T i i xx T since n P n i x ix T n x P n i xt i xx T : From (.8b) and (.0b) we see that E x i x T i + T E xx T n + T hence E (S) + T n + T Therefore an unbiased estimate of is n n S u n n S (.) n X 0T X 0 7

.7 Linear transformations Let x (x ; :::; x p ) T be a random p vector. It is often natural and useful to consider linear combinations of the components of x such as for example y x + x or y x + x 3 x 4 : In general we consider a transformation from the p component vector x to a q component vector y (q < p) given by y Ax + b (.3) where A (q p) and b (q ) are constant matrices. and Suppose that E (x) and V (x) the corresponding expressions for y are These follow from the linearity of the expectation operator E (y) A + b (.4a) V (y) AA T (.4b) E (y) E (Ax + b) AE (x) + E (b) A + b y say V (y) E yy T T y y h i E (Ax + b) (Ax + b) T (A + b) (A + b) T AE xx T A T + AE (x) b T + be x T A T +bb T A T A T Ab T ba T bb T A E xx T T A T AA T as required.8 The Mahalanobis transformation Given a p variate random variable x with E (x) and V (x). A transformation to a standardized set of uncorrelated variates is given by the Mahalanobis transformation. Suppose is positive de nite i.e. there is no exact linear dependence in x. Then the inverse covariance matrix. has a "square root" given by V V T (.5) where V V T is the spectral decomposition (see handout), i.e. V is an orthogonal matrix V T V V V T I p whose columns are the eigenvectors of and diag ( ; :::; p ) are thecorresponding eigenvalues. The Mahalanobis transformation takes the form z (x ) (.6) 8

Using results (.4a) and (.4b) we can show that E (z) 0 Proof V (z) I p i E (z) E h (x ) [E (x) ] 0.8. Sample Mahalanobis transformation V (z) I p Given a data matrix X T (x ; :::; x n ) ; the sample Mahalanobis transformation z i S (x i x) for i ; :::; n where S S x is the sample covariance matrix n XT HX creates a transformed data matrix Z T (z ; :::; z n ). Now the the data matrices are related by Z T S X T H or Z HXS (.7) where H is the centring matrix. We may easily show (Ex.) that Z T is centred and that S z I p :.8. Sample scaling transformation A transformation of the data that scales each variable to have mean zero and variance one but preserves the correlation structure is given by y i D (x i x) for i ; :::; n where D diag (s ; :::; s p ) : Now Ex. Show that S y R x : Y T D X T H or Y HXD (.8).8.3 A useful matrix identity Let u; v be n Proof vectors and form the n n matrix A uv T : Then ji + uv T j +v T u (.9) First observe that A and I + A share a common set of eigenvectors since Av v ) (I + A) v ( + ) v: Moreover the eigenvalues of I + A are + i where i are the eigenvalues of A: Now uv T is a rank one matrix, therefore has a single nonzero eigenvalue (see handout). Since uv T u u v T u u where v T u, the eigenvalues of I + uv T are + ; ; :::; : The determinant of I + uv T is the product of the eigenvalues, hence the result. 9

. Principal Components Analysis. Outline of technique Let x T (x ; x :::; x p ) be a random vector with mean and covariance matrix : PCA is a technique for dimensionality reduction from p dimensions to k < p dimensions. It tries to nd, in order, the most informative k linear combinations of a set of variables y ; y ; :::; y k : Here information will be interpreted as a percentage of the total variation (as previously de ned) in : The k sample PC s that "explain" x% of the total variation in a sample covariance matrix S may be similarly de ned.. Formulation Let y a T x y a T x. y p a T p x where y j a j x + a j x + ::: + a pj x p are a sequence of standardized linear combinations (SLC s) of the the x 0 s such that a T j a j and a T j a k 0 for j 6 k: i.e. a ; a ; :::; a p form an orthonormal set of p vectors. Equivalently we may de ne A; the p p matrix formed from the columns fa j g ; as an orthogonal matrix so that A T A AA T I p : We choose a to maximize V ar (y ) a T a subject to a T a : Then we choose a to maximize V ar (y ) a T a subject to a T a and a T a 0, which ensures that y will be uncorrelated with y : Subsequent PC s are chosen as the SLC s that have maximum variance subject to being uncorrelated with previous PC s. NB. Sometimes the PC s are taken to be "mean-corrected" linear transformations of the x 0 s i.e. y j a T j (x ) emphasizing that the PCS s can be considered as direction vectors in p space relative to the "centre" of a distribution in which the spread is maximized. In any case V ar (y j ) is the same whichever de nition is used. 0

.3 Computation To nd the rst PC we use the Lagrange multiplier technique for nding the maximum of a function f (x) subject to an equality constraint g (x) 0. We de ne the Lagrangean function where is a Lagrange multiplier. Di erentiating, we obtain L (a ) a T a a T a @L @a a a 0 a a Therefore a should be chosen to be an eigenvector of with eigenvalue : Suppose the eigenvalues of are distinct and ranked in decreasing order > > ::: > p > 0. V ar (y ) a T a a T a Therefore a should be chosen as the eigenvector corresponding to the largest eigenvalue of. nd PC The Lagrangean is where ; are Lagrange multipliers. L (a ) a T a a T a a T a since a T a 0: However Therefore 0 and @L @a ( I p ) a a 0 a T a 0 a T a a T a a T a 0 a a so a is the eigenvector of corresponding to the second largest eigenvalue.

.4 Example The covariance matrix corresponding to scaled (standardized) variables x ; x is (in fact a correlation matrix). Note has total variation. The eigenvalues of are the roots of j Ij 0 0 ( ) 0 Hence + ; : If > 0 then + ; : To nd a we substitute into a a. Note: this gives just one equation in terms of the components of a T (a ; a ) so a a. Applying the normalization a + a 0 a T a a + a we obtain Similarly a a " p p " p p # # so that y p (x + x ) y p (x x ) 00 ( + ) 00 ( ) are the PC s explaining respectively % and % of the total variation. Notice that the PC s are independent of while the proportion of the total variation explained by each PC does depend on :

.5 PCA and spectral decomposition Since (also S) is a real symmetric matrix, we know that it has the spectral decomposition (eigenanalysis) AA T px i a i a T i i where fa i g are the eigenvectors of which we have inserted as columns of the (p p) matrix A and ::: p are the corresponding eigenvalues. If some eigenvalues are not distinct, so k k+ ::: l, the eigenvectors are not unique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension l k + (cf. the major/minor axes of an ellipse x a + y b as b with the equicorrelation matrix (see Class Exercise ). The transformation of a random p components (PC s) contained in the p! a:). Such a situation arises vector x (corrected for its mean ) to its set of principal vector y is y A T (x ) y is the linear combination (SLC) of x having maximum variance, y is the SLC having maximum variance subject to being uncorrelated with y etc. We have seen that V ar (y ) ; V ar (y ) ; :::.6 Explanation of variance The interpretation of PC s (y)as components of variance "explaining" the total variation, i.e. the sum of the variances of the original variables (x) is clari ed by the following result Result The sum of the variances of the original variables and their PC s are the same. Proof A note on trace () The sum of diagonal elements of a (p p) square matrix is known as the trace of px tr () We show from this de nition that tr (AB) tr (BA) whenever AB and BA are de ned [i.e. A is (m n) and B is (n m)] tr (AB) X X a ij b ji i j X X b ji a ij j i tr (BA) 3 i ii

The sum of the variances for the PC s is X V ar (y i ) X i i i tr () Now AA T is the spectral decomposition and A is orthonormal so A T A I p hence tr () tr AA T tr A T A tr () Since is the covariance matrix of x the sum of its diagonal elements is the sum of the variances ii of the original variables. Hence the result is proved. Consequence (interpretation of PC s) It is therefore possible to interpret i + + ::: + p as the proportion of the total variation in the original data explained by the i th principal component and + :: + k + + ::: + p as the proportion of the total variation explained by the rst k PC s. From a PCA on a (0 0) sample covariance matrix S; we could for example conclude that the rst 3 PC s (out of a total of p 0 PC s) account for 80% of the total variation in the data. This would mean that the variation in the data is largely con ned to a 3-dimensional subspace described by the PC s y ; y ; y 3..7 Scale invariance This unfortunately is a property that PCA does not possess! In practice we often have to choose units of measurement for our individual variables fx i g and the amount of the total variation accounted for by a particular variable x i is dependent on this choice (tonnes, kg. or grams). In a practical study, the data vector x often comprises of physically incomparable quantities (e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibility is to perform PCA on a correlation matrix (e ectively choosing each variable to have unit sample variance), but this is still an implicit choice of scaling. The main point is that the results of a PCA depends on the scaling adopted. 4

.8 Principal component scores The sample PC transform on a data matrix X takes the form for the r th individual (r th row of the sample) y 0 r A T (x r x) where the columns of A are the eigenvectors of the sample covariance matrix S: Notice that the rst component y corresponds to the scalar product of the rst column of A with x 0 r etc. The components of y r are known as the (mean-corrected) principal component scores for the r th individual. The quantities y r A T x r are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of each data point with respect to new axes de ned by the PC s, i.e. w.r.t. a rotated frame of reference. The scores can provide qualitative information about individuals..9 Correlation of PC s with original variables The correlations (x i ; y k ) of the k th PC with variable x i are an aid to interpreting the PC s. Since y A T (x ) we have Cov (x; y) E (x ) y T h i E (x ) (x ) T A A and from the spectral decomposition A AA T A A Post-multiplying A by a diagonal matrix has the e ect of scaling its columns, so that Cov (x i ; y k ) k a ik is the covariance between the i th variable and the k th PC. The correlation (x i ; y k ) Cov (x i ; y k ) V ar (x i ) V ar (y k ) k a ik p p ii k k a ik ii can be interpreted as the proportion of the variation in x i explained by the k th PC. 5

Exercise Find the PC s of the covariance matrix 3 0 4 5 05 0 0 and show that they account for amounts of the total variation in : 5:83 :00 3 0:7 Compute the correlations (x i ; y k ) and try to interpret the PC s qualitatively. 6

3. Multivariate Normal Distribution The MVN distribution is a generalization of the univariate normal distribution which has the density function (p.d.f.) ( ) f (x) p (x ) exp < x < where mean of distribution, variance. In p dimensions the density becomes f (x) () p exp jj (x )T (x ) Within the mean vector there are p (independent) parameters and within the symmetric covariance matrix there are p (p + ) independent parameters [ p (p + 3) independent parameters in total]. We use the notation x s N p (; ) (3.) to denote a RV x having the MVN distribution with E (x) Cov (x) Note that MVN distributions are entirely characterized by the rst and second moments of the distribution. Basic properties If x (p )is MVN with mean and covariance matrix (3.) Any linear combination of x is MVN Let y Ax + c with A (q p) and c (q ) then y s N q y ; y where y A + c and y AA T : Any subset of variables in x has a MVN distribution. If a set of variables is uncorrelated, then they are independently distributed. In particular i) if ij 0 then x i ; x j are independent ii) if x is MVN woth covariance matrix, then Ax and Bx are independent if and only if Cov (Ax; Bx) AB T (3.3) 0 Conditional distributions are MVN. Result For the MVN distribution, variable are uncorrelated, variable are independent. 7

Proof Let x (p ) be partitioned as with mean vector and covariance matrix x x x q p q p q q q p q q p q i) Independent ) uncorrelated (always holds). Suppose x ; x are independent. h i Then Cov (x ; x ) E (x ) (x ) T factorizes into the product of E [(x )] h i and E (x ) T which are both zero since E (x ) and E (x ) : Hence 0: ii) Uncorrelated ) independent (for MVN) This result depends on factorizing the p.d.f. (3.) when 0: In this case (x ) T (x ) has the partitioned form x T T x T T 0 x 0 x x T T x T T 0 x 0 x (x ) T (x ) + (x ) T (x ) so that expf(x ) T (x )g factorizes into the product of n o n o exp (x ) T (x ) and exp (x ) T (x ) : Therefore the p.d.f. can be written as proving that x and x are independent. f (x) g (x ) h (x ) Result x q Let x be MVN with mean and covariance matrix : x p q The conditional distribution of x given x is MVN with E (x jx ) + (x ) (3.4a) Cov (x jx ) (3.4b) 8

Proof Let x 0 x x : We rst show that x 0 and x are independent. Consider the linear transformation x x 0 I 0 I x x (3.5a) Ax say. (3.5b) This linear relationship shows that x ; x 0 are jointly MVN (by rst property of MVN stated above. We may show that x and x 0 are uncorrelated in two ways Firstly Cov x ; x 0 Cov x ; x Cov(x ; x ) Cov (x ; x ) 0 B or, if we write A in (3.5) and apply (3.3) C Cov x ; x 0 Cov (Bx; Cx) BC T Cov x ; x 0 I 0 I I 0 Since MVN and uncorrelated we have shown that x 0 and x are independent. Therefore E x 0 jx E x 0 E x x Now since x 0 x x as required. E (x jx ) E x 0 jx + x + x + (x ) Because x and x 0 are independent Cov x 0 jx Cov x 0 9

Conditional on x a given constant, x 0 x x i.e. x 0 and x di er by a constant. Hence Therefore where C I so Example Cov (x jx ) Cov x 0 jx Cov (x jx ) Cov x 0 CC T I I 0 I Let x have a MVN distribution with covariance matrix 3 4 0 5 Show that the conditional distribution of (x ; x ) given x 3 is also MVN with mean + (x 3 3 ) and covariance matrix 4 0

3. Maximum-likelihood estimation Let X T (x ; :::; x n ) contain an independent random sample of size n from N p (; ) :The maximum likelihood estimates (MLEs ) of ; are b x (3.6a) b S (3.6b) The likelihood function is a function of the parameters ; given the data X L (; jx) ny f (x r j; ) (3.7) The RHS is evaluated by substituting the individual data vectors fx ; :::; x n g in turn into the p.d.f. of N p (; ) and taking the product. r ny r f (x r j; ) () np jj n exp Maximizing L is equivalent to minimizing ( l log L nx log f (x r j; ) r K + n log jj+ where K is a constant independent of ; : ) nx (x r ) T (x r ) r nx (x r ) T (x r ) Noting that x r (x r x) + (x ) the nal term in the above may be written Thus r nx (x r x) T (x r x) r + + nx (x r x) T (x ) r nx (x ) T (x r x) r +n (x ) T (x ) l (; ) tr A + nd T d (3.8a) n tr S +d T d (3.8b)

where we de ne for ease of notation A ns (3.9a) d x (3.9b) and S is the sample covariance matrix (with divisor n). We have made use of ns C T C where C is the (n p) centred data matrix C T (x x; x x; :::; x n x) We see that nx (x r x) T (x r x) tr C C T r tr C T C tr A ntr S Notice that l l (; ) and the dependence on is entirely through d in (3.8). Now assume that is positive de nite (p.d.), then so is (why?). Thus 8d 6 0 we have d T d > 0 showing that l is minimized with respect to for xed when d 0. Hence b x To minimize the log-likelihood l (b; ) w.r.t. up to an arbitrary additive constant. l (x; ) n log jj + tr A n log jj + tr S Let We show that () n log jj + tr S (3.0) () (S) n log jj log jsj + tr S p n tr S log j Sj p (3.) 0 Lemma S is positive de nite. (proved elsewhere) Lemma For any set of positive numbers A log G + where A and G are the arithmetic, geometric means respectively.

Proof For all x we have e x + x (simple exercise). For each y i 0 of a set i f; :::; ng therefore y i + log y i X yi n + X log y i as required. Y n A + log yi + log G In (3.) assuming that the eigenvalues of S are positive, recall that for any square matrix A; we have tr (A) P i the sum of the eigenvalues, and j Aj Y i the product of the eigenvalues. Let i (i ; :::; p) be the eigenvalues of S and substitute in (3.) Y log j Sj log i p log G tr S X i pa () (S) np fa log G g 0 This show that the MLE s are as stated in (3:6) : 3

3. Sampling distribution of x and S The Wishart distribution (De nition) If M (p p) can be written M X T X where X (m p) is a data matrix from N p (0; ) then M is said to have a Wishart distribution with scale matrix and degrees of freedom m: We write When I p the distribution is said to be in standard form. Note: M s W p (;m) (3.) The Wishart distribution is the multivariate generalization of the chi-square distribution Additive property of matrices with a Wishart distribution Let M, M be matrices having the Wishart distribution independently, then M s W p (;m ) M s W p (;m ) M + M s W p (;m + m ) This property follows from the de nition of the Wishart distribution because data matrices are additive in the sense that if X X X is a combined data matrix consisting of m + m rows then X T X X T X +X T X is matrix (known as the "Gram matrix") formed from the combined data matrix X: Case of p When p we know from the de nition of r as the distribution of the sum of squares of r independent N (0; ) variates that mx M x i s m so that W i ; m m 4

Sampling distributions Let x ; x ; :::; x n be a random sample of size n from N p (; ). Then. The sample mean x has the normal distribution x s N p ; n. The sample covariance matrix S MLE: S n CT C has the Wishart distribution ns s W p (;n ) 3. The distributrions of x and S are independent. 3.3 Estimators for special circumstances 3.3. proportional to a given vector Sometimes is known to be proportional to a given vector, so k 0. For example if x represents a sample of repeated measurements then k where (; ; :::; ) T is the p vector of 0 s: We nd the MLE of k for this situation. Suppose is known and k 0 the log likelihood is We set @l @k l log L n n log j j+ tr S o + (x k 0 ) T (x k 0 ) 0 to minimize l w.r.t. k x T x k T 0 x+ k T 0 0 0 from which ^k T 0 x T 0 0 (3.3) We may show that ^k is an unbiased estimator of k and determine the variance of ^k In (3.3) ^k takes the form ct x with c T T 0 and a T 0 0 so i E h^k ct E [x] k ct 0 : 5

Hence i E h^k k (3.4) showing that ^k is an unbiased estimator. Note that V ar [x] n and therefore that V ar c T x n ct c we have V ar ^k n ct c n T 0 (3.5) 0 3.3. Linear restriction on We determine an estimator for to satisfy a linear restriction where A is (m p) and b (m ) A b Introduce a vector of m Lagrange multipliers and seek to minimize l + T (A b) n n (x ) T (x ) + T (A o b) Di erentiate w.r.t. (x ) + A T 0 x A T (3.6) We use the constraint A b to evaluate the Lagrange multipliers : Premultiply by A Ax b AA T AA T (Ax b) Substitute into (3.6) ^ x A T AA T (Ax b) (3.7) 6

3.3.3 Covariance matrix proportional to a given matrix We consider estimating k when k 0 when 0 is given. The likelihood (3.8) takes the form plus terms not involving k: Hence l n log jk 0 j + tr k 0 S dl dk l p k p log k + k tr 0 S k tr 0 S 0 ^k tr 0 S p (3.8) 7

4. Hypothesis testing (Hotelling s t -statistic) Consider the test of hypothesis H 0 : 0 H A 6 0 () 4. The Union-Intersection Principle W accept the hypothesis H 0 as valid if and only if H 0 (a) : a T a T 0 is accepted for all a: [In some sense the union of all such hypotheses] For xed a we set y a T x so that in the population under H 0 ; and in our sample E (y) a T 0 V ar (y) a T a y a T x s:e: (y) at Sa p n The univariate t-statistic for testing H 0 (a) against the alternative (y) 6 a T 0 is t (a) y at 0 s:e: (y) p n a T (x 0 ) p a T Sa The acceptance threshold for H 0 (a) takes the form t (a) R for some R. The multivariate acceptance region is the intersection \ t (a) R (4.) which is true if and only if max t (a) R: Therefore we adopt max t (a) as the test statistic for H 0 : Equivalently Maximize (n ) a T (x 0 ) (x 0 ) T a subject to a T Sa (4.) Write d x satisfy 0 we introduce a Lagrangean multiplier and seek to determine and a to d h i a T (x da 0 ) (x 0 ) T a a T Sa 0 8

dd T a Sa 0 (4.3a) S dd T I a 0 (4.3b) js dd T Ij 0 (4.3c) (4.3b) can be written Ma a showing that a is an eigenvector of S dd T. (4.3c) is the determinantal equation satis ed by the eigenvalues of S dd T. Premultiplying (4.3a) by a T gives a T dd T a a T Sa 0 at dd T a a T Sa t (a) Therefore in order to maximize t (a) we choose to be the largest eigenvalue of S dd T : This is a rank matrix with the single non-zero eigenvalue tr S dd T d T S d and the maximum of (4.) is known as Hotelling s T statistic T (n ) (x 0 ) T S (x 0 ) (4.4) which is (n ) the sample Mahalanobis distance between x and 0. 4. Distribution of T Under H 0 it can be shown that T n s p n p F p;n p (4.5) where F p;n p is the F distribution on p and n p degrees of freedom. Note that depending on the covariance matrix used, T has slightly di erent forms ( T (n ) (x 0 ) T S (x 0 ) n (x 0 ) T S U (x 0) where S U is the unbiased estimator of (with divisor n ). Example In an investigation of adult intelligence, scores were obtained on two tests "verbal" and "performance" for 0 subjects aged 60 to 64. Doppelt and Wallace (955) reported the following mean score and covariance matrix: x 55:4 x 34:97 0:54 6:99 S U 6:99 9:68 9

At the :0 (%) level, test the hypothesis that 60 50 and We rst compute S :039 :0400 U :0400 :03 d x 0 4:76 5:03 T The T statistic is then T 0 4:76 5:03 :039 :0400 4:76 :0400 :03 5:03 4:76 0 :039 4:76 5:03 :0400 +5:03 :03 357:4 This gives F 99 357:4 00 76:9 The nearest tabulated % value corresponds to F ;60 and is 4.98. Therefore we conclude the null hypothesis should be rejected. The sample probably arose from a population with a much lower mean vector, rather closer to the sample mean. Example The change in levels of free fatty acid (FFA) were measured on 5 hypnotised subjects who had been asked to experience fear, depression and anger e ects while under hypnosis. The mean FFA changes were x :699 x :78 x 3 :558 Given that the covariance matrix of the stress di erences y i x i x i and y i x i x i3 is :7343 :666 S U :666 :7733 S 0:804 0:338 U 0:338 :7733 test at the 0.05 level of signi cance, whether each e ect produced the same change in FFA. [T :68 and F :4 with degrees of freedom,3. Do not reject the hypothesis "no emotion e ect" at the :05 level] 30

4.3 Invariance of T T is unafected by changes in the scale or origin of the (response) variables. Consider where C is (p p) and non-singular. y Cx + d The null hypothesis H 0 : x 0 is equivalent to H 0 : y C 0 + d. We have under linear transformation y C x + d S y CSC T so that n T y y y T S y y y (x 0 ) T C T CSC T C (x 0 ) (x 0 ) T C T C T S C C (x 0 ) (x 0 ) T S (x 0 ) which demonstrates invariance. 4.4 Con dence interval for a mean A con dence region for can be obtained given the distribution of T (n ) (x ) T S (x ) s p (n ) n p F p;n p (4.6) by substituting the data values x and S : In Example above we have x (55:4; 34:97) T 00S :3 :40 :40 :3 and F ;99 (:0) is approximately 4.83 (by interpolation). Hence :3 ( 55:4) :80 ( 5:4) ( 34:97) +:3 ( 34:97) :00 4:83 9:76 99 This is an ellipse in p dimensional space (can be plotted). In higher dimensions an ellipsoidal con dence region is obtained. 3

4.5 Likelihood ratio test Given a data matrix X of observations on a random vector x whose distribution depends on a vector of parameters, the likelihood ratio for testing the null hypothesis H 0 : 0 H : against the alternative is de ned as sup 0 L sup L where L L (; X) is the likelihood function. In a likelihood ratio test (LRT) we reject H 0 for low values of : In a likelihood ratio test (LRT) we reject H 0 for low values of ; i.e. if < c where c is chosen so that the probability of Type I error is a: If we de ne l0 log L 0 where L 0 is the value of the numerator and similarly l log L, the rejection criterion takes the form L log log 0 Result L (4.7) l 0 l > k (4.8) When H 0 is true and for n large the log likelihood ratio (4.8) has the -distribution on r degrees of freedom, r, where r equals the number of free parameters under H minus the number of free parameters under H 0 : 4.6 LRT for a mean when is known H 0 : 0 a given value when is known Given a random sample from N (; ) resulting in x and S the likelihood given in (3.8b) is (to within an additive constant) n l (; ) n log jj + tr S o + (x ) T (x ) (4.9) Under H 0 the value of is known and l0 l ( 0 ; ) n n log jj + tr S o + (x 0 ) T (x 0 ) Under H with no restriction on ; the m.l.e. of is ^ x: Thus l n log jj + tr S Therefore log l 0 l n (x 0 ) T (x 0 ) (4.0) 3

which is n times the Mahalanobis distance of x from 0. Note the similarity with Hotelling s T statistic. Given the distribution of x under H 0 is x s N p 0 ; n and (4.0) may be written using the transformation y independent N (0; ) variates as n (x 0 ) to a standard set of px log y T y yi (4.) we have the exact distribution showing that in this case the asymptotic distribution of Example i log s p (4.) log is exact for the small sample case. Measurements of the length of skull were made on a sample of rst and second sons from 5 families. 85:7 x 83:84 9:48 66:88 S 96:78 Assuming that in fact test at the :05 level the hypothesis 00 0 0 00 H 0 : 8 8 T Solution log 5 3:7 :84 :0 0 3:7 0 :0 :84 0:5 3:7 + :84 4:3 Since (:05) 5:99 do not reject H 0 33

4.7 LRT for mean when is unknown Consider the test of hypothesis H 0 : 0 when is unknown. H : 6 0 In this case must be estimated under H 0 and also under H : Under H 0 n l ( 0 ; ) n log jj + tr S o + (x 0 ) T (x 0 ) (4.3a) n log jj + tr S + d T 0 d 0 (4.3b) n log jj + tr S + tr d T 0 d 0 (4.3c) n log jj + tr S + tr d 0 d T 0 (4.3d) n log jj + tr S + d 0 d T 0 (4.3e) writing d 0 for x 0 : Under H n l (^; ) n log jj + tr S o + (x ^) T (x ^) l ^; ^ (4.4a) n log jj + tr S (4.4b) n log jsj + tr S S (4.4c) n flog jsj + tr (I p )g (4.4d) l n log jsj + np (4.4e) after substitution of the m.l.e. s ^ x and ^ S obtained previously. Comparing (4:3e) with (4:4b) we see that the m.l..e. of under H 0 must be ^ S + d 0 d T 0 and that the corresponding value of l log L is l 0 n log js + d 0 d T 0 j + np l 0 l n log js + d 0 d T 0 j n log jsj n log js j n log js + d 0 d T 0 j n log js S + d 0 d T 0 j n log ji p +S d 0 d T 0 j n log + d T 0 S d 0 (4.5) making use of the useful matrix result proved in (:8:3) that ji p +uv T j + v T u : 34

Since log n log + T n (4.6) we see that and T are monotonically related. Therefore we can conclude that the LRT of H 0 : 0 when is unknown is equivalent to use of Hotelling s T statistic. 4.8 LRT for 0 with unknown H 0 : 0 when is unknown. H : 6 0 Under H 0 we substitute ^ x into n o l (^; 0 ) n log j 0 j + tr 0 S + (x ^) T 0 (x ^) giving l 0 n log j 0 j + tr 0 S (4.7) Under H we substitute the unrestricted m.l.e. s ^ x and ^ S giving as in (4:4e) l n log jsj + np (4.8) l0 l n log j 0 j + tr 0 S log jsj p n log j0 Sj+ tr 0 S p (4.9) This statistic depends only on the eigenvalues of the positive de nite matrix 0 S and has the property that l0 l log! 0 as S approaches 0: Let A be the arithmetic mean and G the geometric mean of the eigenvalues of 0 S then tr 0 pa Sj Gp j 0 log n fpa p log G pg The general result for the distribution of (4:0) for large n gives np fa log G g (4.0) l 0 l s r (4.) where r p (p + ) is the number of independent parameters in : 35

4.0 Test for sphericity A covariance matrix is said to have the property of "sphericity" if ki p (4.) for some k: We see that this is a special case of the more general situation k 0 treated in Section (3.3.3). The same procedure can be applied. The general likelihood:expresion for a sample from the MVN distribution is: log L n log jj + tr S + dd T Under H 0 : ki p and ^ x so log L n log jki p j + tr k S n p log k + k tr S (4.3) Set @ @k [ log L] 0 at a minimum p k k tr S 0 ^k tr S p (4.4) which is in fact the arithmetic mean A of the eigenvalues of S: Substitute back into (4.3) gives l0 np (log A + ) Under H : ^ x and ^ S l n log jsj + np np (log G + ) thus log l 0 l np log A G (4.5) The number of free parameters contained in is under H 0 and p (p + ) under H : Hence the appropriate distribution for comparing log is r where r p (p + ) (p ) (p + ) (4.6) 36

4. Test for independence Independence of the variables x ; :::; x p is manifest by a diagonal covariance matrix diag ( ; :::; pp ) (4.7) We consider H 0 : is diagonal H :. is unrestricted against the general alternative Under H 0 it is clear in fact that we will nd ^ ii s ii because the estimators of ii for each x i are independent. We can also show this formally n log jj + tr S + dd T ( px n log ii + i px i s ii ii ) Set @ @ ii ( log L) 0 ii s ii ii 0 b ii s ii Therefore ( px ) n log s ii + p i n flog jdj + pg where D diag (s ; :::; s pp ) : Under H as before we nd Therefore l n log jsj + np l 0 l n[log jdj log jsj] n log jd Sj n log jd SD j n log jrj (4.8) The number of free parameters contained in is p under H 0 and p (p + ) under H : Hence the appropriate distribution for comparing log is r where r p (p + ) p p (p ) (4.9) 37

4. Simultaneous con dence intervals (Sche e, Roy & Bose) The union-intersection method for deriving Hotelling s T statistic provides "simultaneous con - dence intervals" for the parameters when is unknown. Following Section 4. let T (n ) (x ) T S (x ) (4.30) where is the unknown (true) mean. Let t (a) be the univariate t linear compound y a T x: Then max a t (a) T and for all p where vectors a statistic corresponding to the t (a) T (4.3) t (a) y y s y p n p n a T (x ) p a T Sa (4.3) so From Section 4. the distribution of T is T n s p n p F p;n p Pr T (n ) p n p F p;n p () therefore from (4.3), for all p vectors a Pr t (a) (n ) p n p F p;n p () (4.33) Substituting from (4.3), the con dence statement in (4.33) is: With probability for all p vectors a where K is the constant (n ) p ja T x a T j n p F p;n p () s a K T Sa n s a T Sa n say, (4.34) (n ) p K n p F p;n p () (4.35) A 00 ( ) % con dence interval for the linear compound a T is therefore s a T a x K T Sa n (4.36) 38

How can we apply this result? We might be interested in a de ned set of linear combinations (linear compounds) of : The i th component of is for example the linear compound de ned by a T (0; :::; ; :::0) the unit vector with a single in the i th position. For a large number of such sets of CI s we would expect 00 ( ) % to contain no mis-statements while 00% would contain at least one mis-statement. We can relate the T con dence intervals to the T test of H 0 : 0. If this H 0 is rejected at signi cance level then there exists at least one vector a such that the interval (4.36) does not include the value a T 0 : NB. If the covariance matrix S u (with denominator n ) is supplied, then in (4.36) r a T S u a may be replaced by : n r a T Sa n 4.3 The Bonferroni method This provides another way to construct simultaneous CI s for a small number of linear compounds of whilst controlling the overall level of con dence. Consider a set of events A ; A ; :::; A m Pr (A \ ::: \ A m ) Pr A [ ::: [ A m From the additive law of probabilities X m Pr A [ ::: [ A m Pr A i i Therefore Pr (A \ ::: \ A m ) mx Pr A i i (4.37) Let C k denote a con dence statement about the value of some linear compound a T k with Pr (C k true) k : Pr (all C k true) ( + ::: + m ) (4.38) Therefore we can control the overall error rate given by + ::: + m say. For example, in order to construct simultaneous 00 ( ) % CI s for all p components k of we could choose k p (k ; :::; p) leading to x t n x p t n p. p r s n r spp n if s ii derives from S u : 39

Example Intelligence scores data on n 0 subjects: x x x S U 55:4 34:97 0:54 6:99 6:99 9:68. Construct 99% simultaneous con dence intervals for ; and : For take a T (; 0) Now take :0 a T x 0 55:4 55:4 34:97 a T S u a 0:54 K (n (n ) p p) F p;n p () 00 F ;99 (:0) 99 3: taking F ;99 (:0) 4:83 (approx). Therefore the CI for is r 0:54 55:4 3: 0 55:4 4:50 giving an interval (50:7; 59:7) For we already have K, take a T (0; ) then The CI for is a T x 34:97 a T S u a 9:68 r 9:68 34:97 3: 34:97 3:40 0 giving an interval (3:6; 38:4) For take a T [; ] a T 55:4 x [; ] 0:7 34:97 a T 0:54 6:99 S u a [; ] 6:99 9:68 0:54 6:9 + 9:68 76:4 40

CI for is 0:7 3: r 76:4 0 0:7 :7 (7:6; 3:0). Construct CI s for ; by Bonferroni method. Use :0: Individual CI s are constructed using k :0 :005 (k ; ) : Then k t 00 t 00 (:005) ' (:0075) CI for is :8 55:4 :8 r 0:54 0 55:4 4:06 (5:; 59:3) and for is 34:97 :8 r 9:68 0 34:97 3:06 (3:9; 38:0) Comparing CI s obtained by the two methods we see that the simultaneous CI s for and and are 8.7% wider than the coirresponding Bonferroni CI s. NB. If we had required 99% Bonferroni CI s for ; and then m 3 in (4.38) and m :0 :007: The corresponding percentage point of t would be 6 t 00 (:007) ' (:9983) :93 leading to a slightly wider CI Than obtained above. 4

4.4 Two sample procedures Suppose we have two independent random samples fx ; :::; x n g fx ; :::; x n g of size n ; n from two populations. : x s N p ( ; ) : x s N p ( ; ) giving rise to sample means x ; x and sample covariance matrices S ; S. Note the assumption of a common covariance matrix : We consider testing H 0 : against H : 6 Let d x x : Under H 0 d s N 0; + n n (a) Case of known Analogously to the one sample case n n d n + n s N (0; Ip ) n n n dt d s p where n n + n (b) Case of unknown We have the Wishart distributed quantitities n S s W p (;n ) n S s W p (;n ) Let S p n S + n S n be the pooled estimator of the covariance matrix : Then from the additive properties of the Wishart distribution (n ) S p has the Wishart distribution W p (; n ) and n n d s N (0; ) n It may be shown that T n n d T Sp d n has the distribution of a Hotelling s T statistic. In fact T s (n ) p n p F p;n p (4.39) 4

4.5 Multi-sample procedures (MANOVA) We consider the case of k samples from populations ; :::; k : The sample from population i is of size n i : By analogy with the univariate case we can decompose the SSP matrix into orthogonal parts. This decomposition can be represented as a Multivariate Analysis of Variance (MANOVA) table. The MANOVA model is x ij + i + e ij j ; :::; n i and i ; :::; k (4.40) where e ij are independent N p (0; ) variables. Here the parameter vector is the overall (grand) mean and the i is the i th treatment e ect with kx n i i 0 (4.4) i De ne the i th sample mean as x i X ni n x ij: i j The Between Groups sum of squares and cross-products (SSP) matrix is B kx n i (x i x) (x i x) T (4.4) i The Grand Mean is x X k i n ix i and the Total SSP matrix is T kx Xn i (x ij x) (x ij x) (4.43) i j It can be shown algebraically that T B + W where W is the Within Groups (or residual) SSP matrix given by kx Xn i W (x ij x i ) (x ij x i ) T (4.44) The MANOVA table is i j Source Matrix of SS and Degrees of of variation cross-products (SSP) freedom (d.f.) Treatment Residual B X k n i (x i x) (x i x) T k i W X k i X ni j (x ij x i ) (x ij x i ) T X k i n i k Total (corrected for the mean) T B + W X k i 43 X ni j (x ij x) (x ij x) X k i n i

We are interested in testing the hypothesis H 0 : ::: k (4.4) whether the samples in fact come from the same population against the general alternative H : 6 6 ::: 6 k (4.43) We can derive a likelihood ratio test statistic known as Wilk s : Under H 0 the m.l.e. s are ^ x ^ S leading to the maximized log likelihood (minimum of log L) where Under H the m.l.e. s are This follows from l min ;d i min ( ( W l 0 np + n log jsj (4.44) ^ i x i ^ n W kx W i i n log jj + kx n i S i i kx n i tr i n log jj + n tr n S i + d i d T ) i!) kx n i S i i since ^d i x i ^ i 0. Hence ^ n W and l np + n log n W (4.45) Therefore since T ns jw j l0 l n log jt j n log (4.46) where is known as Wilk s statistic. We reject H 0 for small values of or large values of n log : Asymptotically, the rejection region is the upper tail of a p(k ). Under H 0 the unknown has p parameters and under H the number of parameters for ; :::; k is pk: Hence the d.f. of the is p (k ). Apart from this asymptotic result, other approximate distributions (notably Bartlett s approximation) are available, but the details are outside the scope of this course. 44

4.5. Calculation of Wilk s Result Let ; :::; p be the eigenvalues of W B then Proof py ( + j ) (4.47) j T W (W + B) W W (W + B) I + W B py ( + j ) (4.48) j by the useful identity proved earlier in the notes. 4.5. Case k We show that use of Wilk s for k groups is equivalent to using Hotelling s T statistic. Speci cally, we show that is a monotonic function of T. Thus to reject H 0 for < is equivalent to rejecting H 0 for T > (for some constants ; ): Proof For k we can show (Ex.) that where d x x. Then B n n n ddt (4.49) I + W B I + n n n W dd T + n n n dt W d Now W is just (n ) S p where S p is the pooled estimator of : Thus + T n (4.50) 45

5. Discriminant Analysis (Classi cation) Given k populations (groups) ; :::; k : An individual from j has p.d.f. measurement x. f j (x) for a set of p The purpose of discriminant analysis is to allocate an individual to one of the groups f j g on the basis of x, making as few "mistakes" as possible. For example a patient presents at a doctor s surgery with a set of symptom x. The symptoms suggest a number of posible disease groups f j g to which the patient might belong. What is the most likely diagnosis? The aim initially is to nd a partition of R P into disjoint regions R ; :::; R k together with a decision rule x R j ) allocate x to j The decision rule will be more accurate if " j has most of its probability concentrated in R j " for each j: 5. The maximum likelihood (ML) rule Allocate x to population j that gives the largest likelihood to x. Choose j by (break ties arbitrarily). Result L j (x) max ik L i (x) If f i g is the multivariate normal (MVN) population N p ( i ; ) for i ; :::; k; the ML rule allocates x to population i that minimize the Mahalanobis distance between x and i : Proof L i (x) jj exp (x i) T (x i ) so the likelihood is maximized when the exponent is minimized. Result When k the ML rule allocates x to if d T (x ) > 0 (5.) where d ( ) and ( + ) and to otherwise. Proof For the two group case, the ML rule is to allocate x to if (x ) T (x ) < (x ) T (x ) 46

which reduces to d T x > ( ) T ( + ) d T ( + ) Hence the result. The function h (x) ( ) T x ( + ) (5.) is known as the discriminant function (DF). In this case the DF is linear in x. 5. Sample ML rule In practice ; ; are estimated by, respectively x ; x ; S P where S P is the pooled (unbiased) estimator of covariance matrix. Example The eminent statistician R.A. Fisher took measurements on samples of size 50 of 4 types of iris. Two of the variables: x sepal length and x sepal width gave the following data on species I and II: x S (The data have been rounded for clarity). Hence giving the rule: Allocate x to if 5:0 3:4 : :0 :0 :4 6:0 x :8 S S p 50S + 50S 98 0:9 0:09 0:09 0: d S p (x x ) 0:9 0:09 :0 0:09 0: 0:6 5:5 (x + x ) 3: :6 :08 :08 :0 :4 4: :4 (x 5:5) + 4: (x 3:) > 0 :4x + 4:x + 9:0 > 0 47

5.3 Misclassi cation probabilities The misclassi cation probabilities p ij de ned as p ij Pr [Allocate to i when in fact from j ] form a k k matrix, of which the diagonal elements fp ii g are a measure of the classi er s accuracy. For the case k Since h (x) d T (x Given that x :- p Pr [h (x) > 0 j ] ) is a linear compound of x it has a (univariate) normal distribution. E [h (x)] d T ( + ) dt ( ) where ( ) T ( ) is the Mahalanobis distance between and : The variance of is d T d ( ) T ( ) ( ) T ( ) By symmetry this is also p i.e. Example (contd.) Pr [h (x) > 0] Pr Pr " h (x) + " Z > # > # (5.3) p p We can estimate the misclassi cation probability from the sample Mahalanobis distance between x and x The misclassi cation rate is.3%. D (x x ) T Sp (x x ) :0 0:6 :4 ' 9:9 4: D ( :3) 0:03 48