1. Introduction to multivariate data
|
|
- Harriet Whitehead
- 8 years ago
- Views:
Transcription
1 . Introduction to multivariate data. Books Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall Krzanowski, W.J. Principles of multivariate analysis. Oxford.000 Johnson, R.A.and D.W. Wichern Applied multivariate statistical analysis. Prentice Hall.. Applications The need often arises in science, medicine and social science (business, management) to analyze data on p variables (Note that p ) data are bivariate). Suppose we have a simple random sample of size n. The sample consists of n vectors of measurements on p variates i.e. n p vectors (by convention column vectors) x ; :::x n which are inserted as rows x T ; :::xt n into a (n p) data matrix X: When p we can plot the rows in -dimensional space, but in higher dimensions, p > ; other techniques are needed. Example Classi cation of plants (taxonomy) Variables: (p 3) leaf size (x ), colour of ower (x ), height of plant (x 3 ) Sample items: n 4 plants from a single species Aims of analysis: i) understand within species variability ii) clasify a new plant species The data matrix may appear as follows Variables x x x 3 6. Plants 8. 8 (Items) Example Credit scoring Variables: personal data held by bank Items: sample of good/bad customers Aims of analysis: i) predict potential defaulters (CRM) ii) risk assessment for new applicant
2 Example 3 Image processing for e.g. quality control Variables: "features" extracted from an image Items: sampled from a production line Aims of analysis: i) quantify "normal" variability ii) reject faulty (o speci cation) batches.3 Sample mean and covariance matrix We shall adopt the following notation: x (p ) X (n p) x (p ) S (p p) R (p p) a random vector of observations on p variables a data matrix whose rows contain an independent random sample x T ; :::; xt n of observations on x sample mean vector x n X n i x i sample covariance matrix containing the sample covariances de ned as s jk n X n i (x ij x j ) (x ik x k ) sample correlation matrix containing the sample correlations de ned as r jk s jk p s jk, say sjj s kk s j s k Notes. x j is de ned as the j th component of x (mean of variable j). the covariance matrix S is square, symmetric ( S S T ), and holds the sample variances s jj s j n X n i (x ij x j ) along its main diagonal 3. the diagonal elements of R are r jj and r jk for each j; k.4 Matrix-vector representations Given a (n p) data matrix X; de ne the n vector of one s (; ; :::; ) T
3 The row sums of X are obtained by pre-multiplying X by T np np T X x i ; ::: ; x ip Hence i (nx ; :::; nx p ) nx T i x n XT (.) The centred data matrix X 0 is derived from X by subtracting the variable mean from each element of X. i.e. x 0 ij x ij x j : or, equivalently, by subtracting a constant vector x T from each row of X. where H matrix as n X 0 X x T X n T X I n n T X HX (.) I n n T is known as the centring matrix. We now de ne the sample covariance the centred sum of squares and products (SSP) matrix where x 0 i x i S n X 0T X 0 (.3a) nx x 0 ix 0 T i (.3b) n i x denotes the i th mean-corrected data point For any real p vector y we then have y T Sy n yt X 0T X 0 y n zt z where z X 0 y n kzk 0 Hence from the de nition of a p.s.d. matrix, we have Proposition The sample covariance matrix S is positive semi-de nite (p.s.d.) Example Two measurements x ; x made at the same position on each of 3 cans of food, resulted in the following X matrix: X Find the sample mean vector x and covariance matris S
4 Solution X 4 x n X X x i 3 i 3 0 S 3 X 0T X 0 5 [x ; x ; x 3 ] T :67 0:67 0:67 :67 Note also that S is built up from individual data points: S and R 0:89 0:89.5 Measures of multivariate scatter It is useful to have a single number as a measure of spread in the data. Based on S we de ne two scalar quantities The total variation is tr (S) trace (S) The generalized variance is In the above example tr (S) :33 jsj 4 3 : px s ii sum of diagonal elements j sum of eigenvalues of S jsj product of eigenvalues of S (.5) 4
5 .6 Random vectors We will in this course generally regard the data as an independent random sample from some continuous population distribution with a probability density function f (x) f (x ; :::; x p ) (.6) Here x (x ; :::; x p ) is regarded as a vector of p random variables. Independence here refers to the rows of the data matrix. If two of the variables (columns) are for example height and weight of individuals (rows), then knowing one individual s weight says nothing about any other individual. However the height and weight for any individual are correlated. For any region D in p space of the variables Z Pr (x D) D f (x) dx Mean vector For any j the population mean of x j is given by the p fold integral Z E (x j ) j x j f (x) dx where the region of integration is R p. In vector form 0 E (x) E x x. x p 0 C A p C A (.7) Covariance matrix The covariance between x j ; x k is de ned as jk Cov (x j ; x k ) E x j j (xk k ) E [x j x k ] j k When j k we obtain the variance of x j h jj E x j j i The covariance matrix is a p p matrix 3 p p ( ij ) p p pp 5
6 The alternative notations V (x) Cov (x) are used. In matrix form h i E (x ) (x ) T (.8a) E xx T T (.8b) More generally we de ne the covariance between two random vectors x (p ) and y (q ) as the (p q) matrix i T Cov (x; y) E h(x x ) y y (.9) Important property of is a positive semi-de nite matrix. Proof Let a (p ) be a constant vector, then E a T x a T E (x) a T and V a T x h E a T x a T i h i a T E (x ) (x ) T a a T a Since variance is always a positive (non-negative) quantity we nd a T a 0: From the de nition (see handout) is a positive semi-de nite (p.s.d.)matrix. Suppose we have an independent random sample x ; x ; :::; x n from a distribution with mean and covariance matrix : What is the relation between (a) the sample and population means, (b) the sample and population covariance matrices? Result We rst establish the mean and covariance of the sample mean x. Proof E (x) (.0a) V (x) n (.0b)! E (x) nx n E x i n i nx E (x i ) i 6
7 0 V (x) n nx x i ; n i nx j x j A n :n noting that Cov (x i ; x i ) and Cov (x i ; x j ) 0 for i 6 j: Hence V (x) n Result We now examine S and derive an unbiased estimator for : Proof S n n E (S) (n ) (.) n nx (x i x) (x i x) T i nx x i x T i i xx T since n P n i x ix T n x P n i xt i xx T : From (.8b) and (.0b) we see that E x i x T i + T E xx T n + T hence E (S) + T n + T Therefore an unbiased estimate of is n n S u n n S (.) n X 0T X 0 7
8 .7 Linear transformations Let x (x ; :::; x p ) T be a random p vector. It is often natural and useful to consider linear combinations of the components of x such as for example y x + x or y x + x 3 x 4 : In general we consider a transformation from the p component vector x to a q component vector y (q < p) given by y Ax + b (.3) where A (q p) and b (q ) are constant matrices. and Suppose that E (x) and V (x) the corresponding expressions for y are These follow from the linearity of the expectation operator E (y) A + b (.4a) V (y) AA T (.4b) E (y) E (Ax + b) AE (x) + E (b) A + b y say V (y) E yy T T y y h i E (Ax + b) (Ax + b) T (A + b) (A + b) T AE xx T A T + AE (x) b T + be x T A T +bb T A T A T Ab T ba T bb T A E xx T T A T AA T as required.8 The Mahalanobis transformation Given a p variate random variable x with E (x) and V (x). A transformation to a standardized set of uncorrelated variates is given by the Mahalanobis transformation. Suppose is positive de nite i.e. there is no exact linear dependence in x. Then the inverse covariance matrix. has a "square root" given by V V T (.5) where V V T is the spectral decomposition (see handout), i.e. V is an orthogonal matrix V T V V V T I p whose columns are the eigenvectors of and diag ( ; :::; p ) are thecorresponding eigenvalues. The Mahalanobis transformation takes the form z (x ) (.6) 8
9 Using results (.4a) and (.4b) we can show that E (z) 0 Proof V (z) I p i E (z) E h (x ) [E (x) ] 0.8. Sample Mahalanobis transformation V (z) I p Given a data matrix X T (x ; :::; x n ) ; the sample Mahalanobis transformation z i S (x i x) for i ; :::; n where S S x is the sample covariance matrix n XT HX creates a transformed data matrix Z T (z ; :::; z n ). Now the the data matrices are related by Z T S X T H or Z HXS (.7) where H is the centring matrix. We may easily show (Ex.) that Z T is centred and that S z I p :.8. Sample scaling transformation A transformation of the data that scales each variable to have mean zero and variance one but preserves the correlation structure is given by y i D (x i x) for i ; :::; n where D diag (s ; :::; s p ) : Now Ex. Show that S y R x : Y T D X T H or Y HXD (.8).8.3 A useful matrix identity Let u; v be n Proof vectors and form the n n matrix A uv T : Then ji + uv T j +v T u (.9) First observe that A and I + A share a common set of eigenvectors since Av v ) (I + A) v ( + ) v: Moreover the eigenvalues of I + A are + i where i are the eigenvalues of A: Now uv T is a rank one matrix, therefore has a single nonzero eigenvalue (see handout). Since uv T u u v T u u where v T u, the eigenvalues of I + uv T are + ; ; :::; : The determinant of I + uv T is the product of the eigenvalues, hence the result. 9
10 . Principal Components Analysis. Outline of technique Let x T (x ; x :::; x p ) be a random vector with mean and covariance matrix : PCA is a technique for dimensionality reduction from p dimensions to k < p dimensions. It tries to nd, in order, the most informative k linear combinations of a set of variables y ; y ; :::; y k : Here information will be interpreted as a percentage of the total variation (as previously de ned) in : The k sample PC s that "explain" x% of the total variation in a sample covariance matrix S may be similarly de ned.. Formulation Let y a T x y a T x. y p a T p x where y j a j x + a j x + ::: + a pj x p are a sequence of standardized linear combinations (SLC s) of the the x 0 s such that a T j a j and a T j a k 0 for j 6 k: i.e. a ; a ; :::; a p form an orthonormal set of p vectors. Equivalently we may de ne A; the p p matrix formed from the columns fa j g ; as an orthogonal matrix so that A T A AA T I p : We choose a to maximize V ar (y ) a T a subject to a T a : Then we choose a to maximize V ar (y ) a T a subject to a T a and a T a 0, which ensures that y will be uncorrelated with y : Subsequent PC s are chosen as the SLC s that have maximum variance subject to being uncorrelated with previous PC s. NB. Sometimes the PC s are taken to be "mean-corrected" linear transformations of the x 0 s i.e. y j a T j (x ) emphasizing that the PCS s can be considered as direction vectors in p space relative to the "centre" of a distribution in which the spread is maximized. In any case V ar (y j ) is the same whichever de nition is used. 0
11 .3 Computation To nd the rst PC we use the Lagrange multiplier technique for nding the maximum of a function f (x) subject to an equality constraint g (x) 0. We de ne the Lagrangean function where is a Lagrange multiplier. Di erentiating, we obtain L (a ) a T a a a a 0 a a Therefore a should be chosen to be an eigenvector of with eigenvalue : Suppose the eigenvalues of are distinct and ranked in decreasing order > > ::: > p > 0. V ar (y ) a T a a T a Therefore a should be chosen as the eigenvector corresponding to the largest eigenvalue of. nd PC The Lagrangean is where ; are Lagrange multipliers. L (a ) a T a a T a a T a since a T a 0: However Therefore ( I p ) a a 0 a T a 0 a T a a T a a T a 0 a a so a is the eigenvector of corresponding to the second largest eigenvalue.
12 .4 Example The covariance matrix corresponding to scaled (standardized) variables x ; x is (in fact a correlation matrix). Note has total variation. The eigenvalues of are the roots of j Ij 0 0 ( ) 0 Hence + ; : If > 0 then + ; : To nd a we substitute into a a. Note: this gives just one equation in terms of the components of a T (a ; a ) so a a. Applying the normalization a + a 0 a T a a + a we obtain Similarly a a " p p " p p # # so that y p (x + x ) y p (x x ) 00 ( + ) 00 ( ) are the PC s explaining respectively % and % of the total variation. Notice that the PC s are independent of while the proportion of the total variation explained by each PC does depend on :
13 .5 PCA and spectral decomposition Since (also S) is a real symmetric matrix, we know that it has the spectral decomposition (eigenanalysis) AA T px i a i a T i i where fa i g are the eigenvectors of which we have inserted as columns of the (p p) matrix A and ::: p are the corresponding eigenvalues. If some eigenvalues are not distinct, so k k+ ::: l, the eigenvectors are not unique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension l k + (cf. the major/minor axes of an ellipse x a + y b as b with the equicorrelation matrix (see Class Exercise ). The transformation of a random p components (PC s) contained in the p! a:). Such a situation arises vector x (corrected for its mean ) to its set of principal vector y is y A T (x ) y is the linear combination (SLC) of x having maximum variance, y is the SLC having maximum variance subject to being uncorrelated with y etc. We have seen that V ar (y ) ; V ar (y ) ; :::.6 Explanation of variance The interpretation of PC s (y)as components of variance "explaining" the total variation, i.e. the sum of the variances of the original variables (x) is clari ed by the following result Result The sum of the variances of the original variables and their PC s are the same. Proof A note on trace () The sum of diagonal elements of a (p p) square matrix is known as the trace of px tr () We show from this de nition that tr (AB) tr (BA) whenever AB and BA are de ned [i.e. A is (m n) and B is (n m)] tr (AB) X X a ij b ji i j X X b ji a ij j i tr (BA) 3 i ii
14 The sum of the variances for the PC s is X V ar (y i ) X i i i tr () Now AA T is the spectral decomposition and A is orthonormal so A T A I p hence tr () tr AA T tr A T A tr () Since is the covariance matrix of x the sum of its diagonal elements is the sum of the variances ii of the original variables. Hence the result is proved. Consequence (interpretation of PC s) It is therefore possible to interpret i + + ::: + p as the proportion of the total variation in the original data explained by the i th principal component and + :: + k + + ::: + p as the proportion of the total variation explained by the rst k PC s. From a PCA on a (0 0) sample covariance matrix S; we could for example conclude that the rst 3 PC s (out of a total of p 0 PC s) account for 80% of the total variation in the data. This would mean that the variation in the data is largely con ned to a 3-dimensional subspace described by the PC s y ; y ; y 3..7 Scale invariance This unfortunately is a property that PCA does not possess! In practice we often have to choose units of measurement for our individual variables fx i g and the amount of the total variation accounted for by a particular variable x i is dependent on this choice (tonnes, kg. or grams). In a practical study, the data vector x often comprises of physically incomparable quantities (e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibility is to perform PCA on a correlation matrix (e ectively choosing each variable to have unit sample variance), but this is still an implicit choice of scaling. The main point is that the results of a PCA depends on the scaling adopted. 4
15 .8 Principal component scores The sample PC transform on a data matrix X takes the form for the r th individual (r th row of the sample) y 0 r A T (x r x) where the columns of A are the eigenvectors of the sample covariance matrix S: Notice that the rst component y corresponds to the scalar product of the rst column of A with x 0 r etc. The components of y r are known as the (mean-corrected) principal component scores for the r th individual. The quantities y r A T x r are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of each data point with respect to new axes de ned by the PC s, i.e. w.r.t. a rotated frame of reference. The scores can provide qualitative information about individuals..9 Correlation of PC s with original variables The correlations (x i ; y k ) of the k th PC with variable x i are an aid to interpreting the PC s. Since y A T (x ) we have Cov (x; y) E (x ) y T h i E (x ) (x ) T A A and from the spectral decomposition A AA T A A Post-multiplying A by a diagonal matrix has the e ect of scaling its columns, so that Cov (x i ; y k ) k a ik is the covariance between the i th variable and the k th PC. The correlation (x i ; y k ) Cov (x i ; y k ) V ar (x i ) V ar (y k ) k a ik p p ii k k a ik ii can be interpreted as the proportion of the variation in x i explained by the k th PC. 5
16 Exercise Find the PC s of the covariance matrix and show that they account for amounts of the total variation in : 5:83 :00 3 0:7 Compute the correlations (x i ; y k ) and try to interpret the PC s qualitatively. 6
17 3. Multivariate Normal Distribution The MVN distribution is a generalization of the univariate normal distribution which has the density function (p.d.f.) ( ) f (x) p (x ) exp < x < where mean of distribution, variance. In p dimensions the density becomes f (x) () p exp jj (x )T (x ) Within the mean vector there are p (independent) parameters and within the symmetric covariance matrix there are p (p + ) independent parameters [ p (p + 3) independent parameters in total]. We use the notation x s N p (; ) (3.) to denote a RV x having the MVN distribution with E (x) Cov (x) Note that MVN distributions are entirely characterized by the rst and second moments of the distribution. Basic properties If x (p )is MVN with mean and covariance matrix (3.) Any linear combination of x is MVN Let y Ax + c with A (q p) and c (q ) then y s N q y ; y where y A + c and y AA T : Any subset of variables in x has a MVN distribution. If a set of variables is uncorrelated, then they are independently distributed. In particular i) if ij 0 then x i ; x j are independent ii) if x is MVN woth covariance matrix, then Ax and Bx are independent if and only if Cov (Ax; Bx) AB T (3.3) 0 Conditional distributions are MVN. Result For the MVN distribution, variable are uncorrelated, variable are independent. 7
18 Proof Let x (p ) be partitioned as with mean vector and covariance matrix x x x q p q p q q q p q q p q i) Independent ) uncorrelated (always holds). Suppose x ; x are independent. h i Then Cov (x ; x ) E (x ) (x ) T factorizes into the product of E [(x )] h i and E (x ) T which are both zero since E (x ) and E (x ) : Hence 0: ii) Uncorrelated ) independent (for MVN) This result depends on factorizing the p.d.f. (3.) when 0: In this case (x ) T (x ) has the partitioned form x T T x T T 0 x 0 x x T T x T T 0 x 0 x (x ) T (x ) + (x ) T (x ) so that expf(x ) T (x )g factorizes into the product of n o n o exp (x ) T (x ) and exp (x ) T (x ) : Therefore the p.d.f. can be written as proving that x and x are independent. f (x) g (x ) h (x ) Result x q Let x be MVN with mean and covariance matrix : x p q The conditional distribution of x given x is MVN with E (x jx ) + (x ) (3.4a) Cov (x jx ) (3.4b) 8
19 Proof Let x 0 x x : We rst show that x 0 and x are independent. Consider the linear transformation x x 0 I 0 I x x (3.5a) Ax say. (3.5b) This linear relationship shows that x ; x 0 are jointly MVN (by rst property of MVN stated above. We may show that x and x 0 are uncorrelated in two ways Firstly Cov x ; x 0 Cov x ; x Cov(x ; x ) Cov (x ; x ) 0 B or, if we write A in (3.5) and apply (3.3) C Cov x ; x 0 Cov (Bx; Cx) BC T Cov x ; x 0 I 0 I I 0 Since MVN and uncorrelated we have shown that x 0 and x are independent. Therefore E x 0 jx E x 0 E x x Now since x 0 x x as required. E (x jx ) E x 0 jx + x + x + (x ) Because x and x 0 are independent Cov x 0 jx Cov x 0 9
20 Conditional on x a given constant, x 0 x x i.e. x 0 and x di er by a constant. Hence Therefore where C I so Example Cov (x jx ) Cov x 0 jx Cov (x jx ) Cov x 0 CC T I I 0 I Let x have a MVN distribution with covariance matrix Show that the conditional distribution of (x ; x ) given x 3 is also MVN with mean + (x 3 3 ) and covariance matrix 4 0
21 3. Maximum-likelihood estimation Let X T (x ; :::; x n ) contain an independent random sample of size n from N p (; ) :The maximum likelihood estimates (MLEs ) of ; are b x (3.6a) b S (3.6b) The likelihood function is a function of the parameters ; given the data X L (; jx) ny f (x r j; ) (3.7) The RHS is evaluated by substituting the individual data vectors fx ; :::; x n g in turn into the p.d.f. of N p (; ) and taking the product. r ny r f (x r j; ) () np jj n exp Maximizing L is equivalent to minimizing ( l log L nx log f (x r j; ) r K + n log jj+ where K is a constant independent of ; : ) nx (x r ) T (x r ) r nx (x r ) T (x r ) Noting that x r (x r x) + (x ) the nal term in the above may be written Thus r nx (x r x) T (x r x) r + + nx (x r x) T (x ) r nx (x ) T (x r x) r +n (x ) T (x ) l (; ) tr A + nd T d (3.8a) n tr S +d T d (3.8b)
22 where we de ne for ease of notation A ns (3.9a) d x (3.9b) and S is the sample covariance matrix (with divisor n). We have made use of ns C T C where C is the (n p) centred data matrix C T (x x; x x; :::; x n x) We see that nx (x r x) T (x r x) tr C C T r tr C T C tr A ntr S Notice that l l (; ) and the dependence on is entirely through d in (3.8). Now assume that is positive de nite (p.d.), then so is (why?). Thus 8d 6 0 we have d T d > 0 showing that l is minimized with respect to for xed when d 0. Hence b x To minimize the log-likelihood l (b; ) w.r.t. up to an arbitrary additive constant. l (x; ) n log jj + tr A n log jj + tr S Let We show that () n log jj + tr S (3.0) () (S) n log jj log jsj + tr S p n tr S log j Sj p (3.) 0 Lemma S is positive de nite. (proved elsewhere) Lemma For any set of positive numbers A log G + where A and G are the arithmetic, geometric means respectively.
23 Proof For all x we have e x + x (simple exercise). For each y i 0 of a set i f; :::; ng therefore y i + log y i X yi n + X log y i as required. Y n A + log yi + log G In (3.) assuming that the eigenvalues of S are positive, recall that for any square matrix A; we have tr (A) P i the sum of the eigenvalues, and j Aj Y i the product of the eigenvalues. Let i (i ; :::; p) be the eigenvalues of S and substitute in (3.) Y log j Sj log i p log G tr S X i pa () (S) np fa log G g 0 This show that the MLE s are as stated in (3:6) : 3
24 3. Sampling distribution of x and S The Wishart distribution (De nition) If M (p p) can be written M X T X where X (m p) is a data matrix from N p (0; ) then M is said to have a Wishart distribution with scale matrix and degrees of freedom m: We write When I p the distribution is said to be in standard form. Note: M s W p (;m) (3.) The Wishart distribution is the multivariate generalization of the chi-square distribution Additive property of matrices with a Wishart distribution Let M, M be matrices having the Wishart distribution independently, then M s W p (;m ) M s W p (;m ) M + M s W p (;m + m ) This property follows from the de nition of the Wishart distribution because data matrices are additive in the sense that if X X X is a combined data matrix consisting of m + m rows then X T X X T X +X T X is matrix (known as the "Gram matrix") formed from the combined data matrix X: Case of p When p we know from the de nition of r as the distribution of the sum of squares of r independent N (0; ) variates that mx M x i s m so that W i ; m m 4
25 Sampling distributions Let x ; x ; :::; x n be a random sample of size n from N p (; ). Then. The sample mean x has the normal distribution x s N p ; n. The sample covariance matrix S MLE: S n CT C has the Wishart distribution ns s W p (;n ) 3. The distributrions of x and S are independent. 3.3 Estimators for special circumstances 3.3. proportional to a given vector Sometimes is known to be proportional to a given vector, so k 0. For example if x represents a sample of repeated measurements then k where (; ; :::; ) T is the p vector of 0 s: We nd the MLE of k for this situation. Suppose is known and k 0 the log likelihood is l log L n n log j j+ tr S o + (x k 0 ) T (x k 0 ) 0 to minimize l w.r.t. k x T x k T 0 x+ k T from which ^k T 0 x T 0 0 (3.3) We may show that ^k is an unbiased estimator of k and determine the variance of ^k In (3.3) ^k takes the form ct x with c T T 0 and a T 0 0 so i E h^k ct E [x] k ct 0 : 5
26 Hence i E h^k k (3.4) showing that ^k is an unbiased estimator. Note that V ar [x] n and therefore that V ar c T x n ct c we have V ar ^k n ct c n T 0 (3.5) Linear restriction on We determine an estimator for to satisfy a linear restriction where A is (m p) and b (m ) A b Introduce a vector of m Lagrange multipliers and seek to minimize l + T (A b) n n (x ) T (x ) + T (A o b) Di erentiate w.r.t. (x ) + A T 0 x A T (3.6) We use the constraint A b to evaluate the Lagrange multipliers : Premultiply by A Ax b AA T AA T (Ax b) Substitute into (3.6) ^ x A T AA T (Ax b) (3.7) 6
27 3.3.3 Covariance matrix proportional to a given matrix We consider estimating k when k 0 when 0 is given. The likelihood (3.8) takes the form plus terms not involving k: Hence l n log jk 0 j + tr k 0 S dl dk l p k p log k + k tr 0 S k tr 0 S 0 ^k tr 0 S p (3.8) 7
28 4. Hypothesis testing (Hotelling s t -statistic) Consider the test of hypothesis H 0 : 0 H A 6 0 () 4. The Union-Intersection Principle W accept the hypothesis H 0 as valid if and only if H 0 (a) : a T a T 0 is accepted for all a: [In some sense the union of all such hypotheses] For xed a we set y a T x so that in the population under H 0 ; and in our sample E (y) a T 0 V ar (y) a T a y a T x s:e: (y) at Sa p n The univariate t-statistic for testing H 0 (a) against the alternative (y) 6 a T 0 is t (a) y at 0 s:e: (y) p n a T (x 0 ) p a T Sa The acceptance threshold for H 0 (a) takes the form t (a) R for some R. The multivariate acceptance region is the intersection \ t (a) R (4.) which is true if and only if max t (a) R: Therefore we adopt max t (a) as the test statistic for H 0 : Equivalently Maximize (n ) a T (x 0 ) (x 0 ) T a subject to a T Sa (4.) Write d x satisfy 0 we introduce a Lagrangean multiplier and seek to determine and a to d h i a T (x da 0 ) (x 0 ) T a a T Sa 0 8
29 dd T a Sa 0 (4.3a) S dd T I a 0 (4.3b) js dd T Ij 0 (4.3c) (4.3b) can be written Ma a showing that a is an eigenvector of S dd T. (4.3c) is the determinantal equation satis ed by the eigenvalues of S dd T. Premultiplying (4.3a) by a T gives a T dd T a a T Sa 0 at dd T a a T Sa t (a) Therefore in order to maximize t (a) we choose to be the largest eigenvalue of S dd T : This is a rank matrix with the single non-zero eigenvalue tr S dd T d T S d and the maximum of (4.) is known as Hotelling s T statistic T (n ) (x 0 ) T S (x 0 ) (4.4) which is (n ) the sample Mahalanobis distance between x and Distribution of T Under H 0 it can be shown that T n s p n p F p;n p (4.5) where F p;n p is the F distribution on p and n p degrees of freedom. Note that depending on the covariance matrix used, T has slightly di erent forms ( T (n ) (x 0 ) T S (x 0 ) n (x 0 ) T S U (x 0) where S U is the unbiased estimator of (with divisor n ). Example In an investigation of adult intelligence, scores were obtained on two tests "verbal" and "performance" for 0 subjects aged 60 to 64. Doppelt and Wallace (955) reported the following mean score and covariance matrix: x 55:4 x 34:97 0:54 6:99 S U 6:99 9:68 9
30 At the :0 (%) level, test the hypothesis that and We rst compute S :039 :0400 U :0400 :03 d x 0 4:76 5:03 T The T statistic is then T 0 4:76 5:03 :039 :0400 4:76 :0400 :03 5:03 4:76 0 :039 4:76 5:03 : :03 :03 357:4 This gives F : :9 The nearest tabulated % value corresponds to F ;60 and is Therefore we conclude the null hypothesis should be rejected. The sample probably arose from a population with a much lower mean vector, rather closer to the sample mean. Example The change in levels of free fatty acid (FFA) were measured on 5 hypnotised subjects who had been asked to experience fear, depression and anger e ects while under hypnosis. The mean FFA changes were x :699 x :78 x 3 :558 Given that the covariance matrix of the stress di erences y i x i x i and y i x i x i3 is :7343 :666 S U :666 :7733 S 0:804 0:338 U 0:338 :7733 test at the 0.05 level of signi cance, whether each e ect produced the same change in FFA. [T :68 and F :4 with degrees of freedom,3. Do not reject the hypothesis "no emotion e ect" at the :05 level] 30
31 4.3 Invariance of T T is unafected by changes in the scale or origin of the (response) variables. Consider where C is (p p) and non-singular. y Cx + d The null hypothesis H 0 : x 0 is equivalent to H 0 : y C 0 + d. We have under linear transformation y C x + d S y CSC T so that n T y y y T S y y y (x 0 ) T C T CSC T C (x 0 ) (x 0 ) T C T C T S C C (x 0 ) (x 0 ) T S (x 0 ) which demonstrates invariance. 4.4 Con dence interval for a mean A con dence region for can be obtained given the distribution of T (n ) (x ) T S (x ) s p (n ) n p F p;n p (4.6) by substituting the data values x and S : In Example above we have x (55:4; 34:97) T 00S :3 :40 :40 :3 and F ;99 (:0) is approximately 4.83 (by interpolation). Hence :3 ( 55:4) :80 ( 5:4) ( 34:97) +:3 ( 34:97) :00 4:83 9:76 99 This is an ellipse in p dimensional space (can be plotted). In higher dimensions an ellipsoidal con dence region is obtained. 3
32 4.5 Likelihood ratio test Given a data matrix X of observations on a random vector x whose distribution depends on a vector of parameters, the likelihood ratio for testing the null hypothesis H 0 : 0 H : against the alternative is de ned as sup 0 L sup L where L L (; X) is the likelihood function. In a likelihood ratio test (LRT) we reject H 0 for low values of : In a likelihood ratio test (LRT) we reject H 0 for low values of ; i.e. if < c where c is chosen so that the probability of Type I error is a: If we de ne l0 log L 0 where L 0 is the value of the numerator and similarly l log L, the rejection criterion takes the form L log log 0 Result L (4.7) l 0 l > k (4.8) When H 0 is true and for n large the log likelihood ratio (4.8) has the -distribution on r degrees of freedom, r, where r equals the number of free parameters under H minus the number of free parameters under H 0 : 4.6 LRT for a mean when is known H 0 : 0 a given value when is known Given a random sample from N (; ) resulting in x and S the likelihood given in (3.8b) is (to within an additive constant) n l (; ) n log jj + tr S o + (x ) T (x ) (4.9) Under H 0 the value of is known and l0 l ( 0 ; ) n n log jj + tr S o + (x 0 ) T (x 0 ) Under H with no restriction on ; the m.l.e. of is ^ x: Thus l n log jj + tr S Therefore log l 0 l n (x 0 ) T (x 0 ) (4.0) 3
33 which is n times the Mahalanobis distance of x from 0. Note the similarity with Hotelling s T statistic. Given the distribution of x under H 0 is x s N p 0 ; n and (4.0) may be written using the transformation y independent N (0; ) variates as n (x 0 ) to a standard set of px log y T y yi (4.) we have the exact distribution showing that in this case the asymptotic distribution of Example i log s p (4.) log is exact for the small sample case. Measurements of the length of skull were made on a sample of rst and second sons from 5 families. 85:7 x 83:84 9:48 66:88 S 96:78 Assuming that in fact test at the :05 level the hypothesis H 0 : 8 8 T Solution log 5 3:7 :84 :0 0 3:7 0 :0 :84 0:5 3:7 + :84 4:3 Since (:05) 5:99 do not reject H 0 33
34 4.7 LRT for mean when is unknown Consider the test of hypothesis H 0 : 0 when is unknown. H : 6 0 In this case must be estimated under H 0 and also under H : Under H 0 n l ( 0 ; ) n log jj + tr S o + (x 0 ) T (x 0 ) (4.3a) n log jj + tr S + d T 0 d 0 (4.3b) n log jj + tr S + tr d T 0 d 0 (4.3c) n log jj + tr S + tr d 0 d T 0 (4.3d) n log jj + tr S + d 0 d T 0 (4.3e) writing d 0 for x 0 : Under H n l (^; ) n log jj + tr S o + (x ^) T (x ^) l ^; ^ (4.4a) n log jj + tr S (4.4b) n log jsj + tr S S (4.4c) n flog jsj + tr (I p )g (4.4d) l n log jsj + np (4.4e) after substitution of the m.l.e. s ^ x and ^ S obtained previously. Comparing (4:3e) with (4:4b) we see that the m.l..e. of under H 0 must be ^ S + d 0 d T 0 and that the corresponding value of l log L is l 0 n log js + d 0 d T 0 j + np l 0 l n log js + d 0 d T 0 j n log jsj n log js j n log js + d 0 d T 0 j n log js S + d 0 d T 0 j n log ji p +S d 0 d T 0 j n log + d T 0 S d 0 (4.5) making use of the useful matrix result proved in (:8:3) that ji p +uv T j + v T u : 34
35 Since log n log + T n (4.6) we see that and T are monotonically related. Therefore we can conclude that the LRT of H 0 : 0 when is unknown is equivalent to use of Hotelling s T statistic. 4.8 LRT for 0 with unknown H 0 : 0 when is unknown. H : 6 0 Under H 0 we substitute ^ x into n o l (^; 0 ) n log j 0 j + tr 0 S + (x ^) T 0 (x ^) giving l 0 n log j 0 j + tr 0 S (4.7) Under H we substitute the unrestricted m.l.e. s ^ x and ^ S giving as in (4:4e) l n log jsj + np (4.8) l0 l n log j 0 j + tr 0 S log jsj p n log j0 Sj+ tr 0 S p (4.9) This statistic depends only on the eigenvalues of the positive de nite matrix 0 S and has the property that l0 l log! 0 as S approaches 0: Let A be the arithmetic mean and G the geometric mean of the eigenvalues of 0 S then tr 0 pa Sj Gp j 0 log n fpa p log G pg The general result for the distribution of (4:0) for large n gives np fa log G g (4.0) l 0 l s r (4.) where r p (p + ) is the number of independent parameters in : 35
36 4.0 Test for sphericity A covariance matrix is said to have the property of "sphericity" if ki p (4.) for some k: We see that this is a special case of the more general situation k 0 treated in Section (3.3.3). The same procedure can be applied. The general likelihood:expresion for a sample from the MVN distribution is: log L n log jj + tr S + dd T Under H 0 : ki p and ^ x so log L n log jki p j + tr k S n p log k + k tr S [ log L] 0 at a minimum p k k tr S 0 ^k tr S p (4.4) which is in fact the arithmetic mean A of the eigenvalues of S: Substitute back into (4.3) gives l0 np (log A + ) Under H : ^ x and ^ S l n log jsj + np np (log G + ) thus log l 0 l np log A G (4.5) The number of free parameters contained in is under H 0 and p (p + ) under H : Hence the appropriate distribution for comparing log is r where r p (p + ) (p ) (p + ) (4.6) 36
37 4. Test for independence Independence of the variables x ; :::; x p is manifest by a diagonal covariance matrix diag ( ; :::; pp ) (4.7) We consider H 0 : is diagonal H :. is unrestricted against the general alternative Under H 0 it is clear in fact that we will nd ^ ii s ii because the estimators of ii for each x i are independent. We can also show this formally n log jj + tr S + dd T ( px n log ii + i px i s ii ii ii ( log L) 0 ii s ii ii 0 b ii s ii Therefore ( px ) n log s ii + p i n flog jdj + pg where D diag (s ; :::; s pp ) : Under H as before we nd Therefore l n log jsj + np l 0 l n[log jdj log jsj] n log jd Sj n log jd SD j n log jrj (4.8) The number of free parameters contained in is p under H 0 and p (p + ) under H : Hence the appropriate distribution for comparing log is r where r p (p + ) p p (p ) (4.9) 37
38 4. Simultaneous con dence intervals (Sche e, Roy & Bose) The union-intersection method for deriving Hotelling s T statistic provides "simultaneous con - dence intervals" for the parameters when is unknown. Following Section 4. let T (n ) (x ) T S (x ) (4.30) where is the unknown (true) mean. Let t (a) be the univariate t linear compound y a T x: Then max a t (a) T and for all p where vectors a statistic corresponding to the t (a) T (4.3) t (a) y y s y p n p n a T (x ) p a T Sa (4.3) so From Section 4. the distribution of T is T n s p n p F p;n p Pr T (n ) p n p F p;n p () therefore from (4.3), for all p vectors a Pr t (a) (n ) p n p F p;n p () (4.33) Substituting from (4.3), the con dence statement in (4.33) is: With probability for all p vectors a where K is the constant (n ) p ja T x a T j n p F p;n p () s a K T Sa n s a T Sa n say, (4.34) (n ) p K n p F p;n p () (4.35) A 00 ( ) % con dence interval for the linear compound a T is therefore s a T a x K T Sa n (4.36) 38
39 How can we apply this result? We might be interested in a de ned set of linear combinations (linear compounds) of : The i th component of is for example the linear compound de ned by a T (0; :::; ; :::0) the unit vector with a single in the i th position. For a large number of such sets of CI s we would expect 00 ( ) % to contain no mis-statements while 00% would contain at least one mis-statement. We can relate the T con dence intervals to the T test of H 0 : 0. If this H 0 is rejected at signi cance level then there exists at least one vector a such that the interval (4.36) does not include the value a T 0 : NB. If the covariance matrix S u (with denominator n ) is supplied, then in (4.36) r a T S u a may be replaced by : n r a T Sa n 4.3 The Bonferroni method This provides another way to construct simultaneous CI s for a small number of linear compounds of whilst controlling the overall level of con dence. Consider a set of events A ; A ; :::; A m Pr (A \ ::: \ A m ) Pr A [ ::: [ A m From the additive law of probabilities X m Pr A [ ::: [ A m Pr A i i Therefore Pr (A \ ::: \ A m ) mx Pr A i i (4.37) Let C k denote a con dence statement about the value of some linear compound a T k with Pr (C k true) k : Pr (all C k true) ( + ::: + m ) (4.38) Therefore we can control the overall error rate given by + ::: + m say. For example, in order to construct simultaneous 00 ( ) % CI s for all p components k of we could choose k p (k ; :::; p) leading to x t n x p t n p. p r s n r spp n if s ii derives from S u : 39
40 Example Intelligence scores data on n 0 subjects: x x x S U 55:4 34:97 0:54 6:99 6:99 9:68. Construct 99% simultaneous con dence intervals for ; and : For take a T (; 0) Now take :0 a T x 0 55:4 55:4 34:97 a T S u a 0:54 K (n (n ) p p) F p;n p () 00 F ;99 (:0) 99 3: taking F ;99 (:0) 4:83 (approx). Therefore the CI for is r 0:54 55:4 3: 0 55:4 4:50 giving an interval (50:7; 59:7) For we already have K, take a T (0; ) then The CI for is a T x 34:97 a T S u a 9:68 r 9:68 34:97 3: 34:97 3:40 0 giving an interval (3:6; 38:4) For take a T [; ] a T 55:4 x [; ] 0:7 34:97 a T 0:54 6:99 S u a [; ] 6:99 9:68 0:54 6:9 + 9:68 76:4 40
41 CI for is 0:7 3: r 76:4 0 0:7 :7 (7:6; 3:0). Construct CI s for ; by Bonferroni method. Use :0: Individual CI s are constructed using k :0 :005 (k ; ) : Then k t 00 t 00 (:005) ' (:0075) CI for is :8 55:4 :8 r 0: :4 4:06 (5:; 59:3) and for is 34:97 :8 r 9: :97 3:06 (3:9; 38:0) Comparing CI s obtained by the two methods we see that the simultaneous CI s for and and are 8.7% wider than the coirresponding Bonferroni CI s. NB. If we had required 99% Bonferroni CI s for ; and then m 3 in (4.38) and m :0 :007: The corresponding percentage point of t would be 6 t 00 (:007) ' (:9983) :93 leading to a slightly wider CI Than obtained above. 4
42 4.4 Two sample procedures Suppose we have two independent random samples fx ; :::; x n g fx ; :::; x n g of size n ; n from two populations. : x s N p ( ; ) : x s N p ( ; ) giving rise to sample means x ; x and sample covariance matrices S ; S. Note the assumption of a common covariance matrix : We consider testing H 0 : against H : 6 Let d x x : Under H 0 d s N 0; + n n (a) Case of known Analogously to the one sample case n n d n + n s N (0; Ip ) n n n dt d s p where n n + n (b) Case of unknown We have the Wishart distributed quantitities n S s W p (;n ) n S s W p (;n ) Let S p n S + n S n be the pooled estimator of the covariance matrix : Then from the additive properties of the Wishart distribution (n ) S p has the Wishart distribution W p (; n ) and n n d s N (0; ) n It may be shown that T n n d T Sp d n has the distribution of a Hotelling s T statistic. In fact T s (n ) p n p F p;n p (4.39) 4
43 4.5 Multi-sample procedures (MANOVA) We consider the case of k samples from populations ; :::; k : The sample from population i is of size n i : By analogy with the univariate case we can decompose the SSP matrix into orthogonal parts. This decomposition can be represented as a Multivariate Analysis of Variance (MANOVA) table. The MANOVA model is x ij + i + e ij j ; :::; n i and i ; :::; k (4.40) where e ij are independent N p (0; ) variables. Here the parameter vector is the overall (grand) mean and the i is the i th treatment e ect with kx n i i 0 (4.4) i De ne the i th sample mean as x i X ni n x ij: i j The Between Groups sum of squares and cross-products (SSP) matrix is B kx n i (x i x) (x i x) T (4.4) i The Grand Mean is x X k i n ix i and the Total SSP matrix is T kx Xn i (x ij x) (x ij x) (4.43) i j It can be shown algebraically that T B + W where W is the Within Groups (or residual) SSP matrix given by kx Xn i W (x ij x i ) (x ij x i ) T (4.44) The MANOVA table is i j Source Matrix of SS and Degrees of of variation cross-products (SSP) freedom (d.f.) Treatment Residual B X k n i (x i x) (x i x) T k i W X k i X ni j (x ij x i ) (x ij x i ) T X k i n i k Total (corrected for the mean) T B + W X k i 43 X ni j (x ij x) (x ij x) X k i n i
44 We are interested in testing the hypothesis H 0 : ::: k (4.4) whether the samples in fact come from the same population against the general alternative H : 6 6 ::: 6 k (4.43) We can derive a likelihood ratio test statistic known as Wilk s : Under H 0 the m.l.e. s are ^ x ^ S leading to the maximized log likelihood (minimum of log L) where Under H the m.l.e. s are This follows from l min ;d i min ( ( W l 0 np + n log jsj (4.44) ^ i x i ^ n W kx W i i n log jj + kx n i S i i kx n i tr i n log jj + n tr n S i + d i d T ) i!) kx n i S i i since ^d i x i ^ i 0. Hence ^ n W and l np + n log n W (4.45) Therefore since T ns jw j l0 l n log jt j n log (4.46) where is known as Wilk s statistic. We reject H 0 for small values of or large values of n log : Asymptotically, the rejection region is the upper tail of a p(k ). Under H 0 the unknown has p parameters and under H the number of parameters for ; :::; k is pk: Hence the d.f. of the is p (k ). Apart from this asymptotic result, other approximate distributions (notably Bartlett s approximation) are available, but the details are outside the scope of this course. 44
45 4.5. Calculation of Wilk s Result Let ; :::; p be the eigenvalues of W B then Proof py ( + j ) (4.47) j T W (W + B) W W (W + B) I + W B py ( + j ) (4.48) j by the useful identity proved earlier in the notes Case k We show that use of Wilk s for k groups is equivalent to using Hotelling s T statistic. Speci cally, we show that is a monotonic function of T. Thus to reject H 0 for < is equivalent to rejecting H 0 for T > (for some constants ; ): Proof For k we can show (Ex.) that where d x x. Then B n n n ddt (4.49) I + W B I + n n n W dd T + n n n dt W d Now W is just (n ) S p where S p is the pooled estimator of : Thus + T n (4.50) 45
46 5. Discriminant Analysis (Classi cation) Given k populations (groups) ; :::; k : An individual from j has p.d.f. measurement x. f j (x) for a set of p The purpose of discriminant analysis is to allocate an individual to one of the groups f j g on the basis of x, making as few "mistakes" as possible. For example a patient presents at a doctor s surgery with a set of symptom x. The symptoms suggest a number of posible disease groups f j g to which the patient might belong. What is the most likely diagnosis? The aim initially is to nd a partition of R P into disjoint regions R ; :::; R k together with a decision rule x R j ) allocate x to j The decision rule will be more accurate if " j has most of its probability concentrated in R j " for each j: 5. The maximum likelihood (ML) rule Allocate x to population j that gives the largest likelihood to x. Choose j by (break ties arbitrarily). Result L j (x) max ik L i (x) If f i g is the multivariate normal (MVN) population N p ( i ; ) for i ; :::; k; the ML rule allocates x to population i that minimize the Mahalanobis distance between x and i : Proof L i (x) jj exp (x i) T (x i ) so the likelihood is maximized when the exponent is minimized. Result When k the ML rule allocates x to if d T (x ) > 0 (5.) where d ( ) and ( + ) and to otherwise. Proof For the two group case, the ML rule is to allocate x to if (x ) T (x ) < (x ) T (x ) 46
47 which reduces to d T x > ( ) T ( + ) d T ( + ) Hence the result. The function h (x) ( ) T x ( + ) (5.) is known as the discriminant function (DF). In this case the DF is linear in x. 5. Sample ML rule In practice ; ; are estimated by, respectively x ; x ; S P where S P is the pooled (unbiased) estimator of covariance matrix. Example The eminent statistician R.A. Fisher took measurements on samples of size 50 of 4 types of iris. Two of the variables: x sepal length and x sepal width gave the following data on species I and II: x S (The data have been rounded for clarity). Hence giving the rule: Allocate x to if 5:0 3:4 : :0 :0 :4 6:0 x :8 S S p 50S + 50S 98 0:9 0:09 0:09 0: d S p (x x ) 0:9 0:09 :0 0:09 0: 0:6 5:5 (x + x ) 3: :6 :08 :08 :0 :4 4: :4 (x 5:5) + 4: (x 3:) > 0 :4x + 4:x + 9:0 > 0 47
48 5.3 Misclassi cation probabilities The misclassi cation probabilities p ij de ned as p ij Pr [Allocate to i when in fact from j ] form a k k matrix, of which the diagonal elements fp ii g are a measure of the classi er s accuracy. For the case k Since h (x) d T (x Given that x :- p Pr [h (x) > 0 j ] ) is a linear compound of x it has a (univariate) normal distribution. E [h (x)] d T ( + ) dt ( ) where ( ) T ( ) is the Mahalanobis distance between and : The variance of is d T d ( ) T ( ) ( ) T ( ) By symmetry this is also p i.e. Example (contd.) Pr [h (x) > 0] Pr Pr " h (x) + " Z > # > # (5.3) p p We can estimate the misclassi cation probability from the sample Mahalanobis distance between x and x The misclassi cation rate is.3%. D (x x ) T Sp (x x ) :0 0:6 :4 ' 9:9 4: D ( :3) 0:03 48
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More information1 Introduction to Matrices
1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns
More informationFactor analysis. Angela Montanari
Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
More informationMultivariate Statistical Inference and Applications
Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim
More informationChapter 4: Statistical Hypothesis Testing
Chapter 4: Statistical Hypothesis Testing Christophe Hurlin November 20, 2015 Christophe Hurlin () Advanced Econometrics - Master ESA November 20, 2015 1 / 225 Section 1 Introduction Christophe Hurlin
More information1 Another method of estimation: least squares
1 Another method of estimation: least squares erm: -estim.tex, Dec8, 009: 6 p.m. (draft - typos/writos likely exist) Corrections, comments, suggestions welcome. 1.1 Least squares in general Assume Y i
More informationIntroduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
More informationMultivariate Analysis of Variance (MANOVA): I. Theory
Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the
More informationSF2940: Probability theory Lecture 8: Multivariate Normal Distribution
SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2015 Timo Koski Matematisk statistik 24.09.2015 1 / 1 Learning outcomes Random vectors, mean vector, covariance matrix,
More informationEigenvalues, Eigenvectors, Matrix Factoring, and Principal Components
Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they
More informationMathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
More informationLinear Algebra Review. Vectors
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length
More informationInner Product Spaces and Orthogonality
Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,
More informationMath 312 Homework 1 Solutions
Math 31 Homework 1 Solutions Last modified: July 15, 01 This homework is due on Thursday, July 1th, 01 at 1:10pm Please turn it in during class, or in my mailbox in the main math office (next to 4W1) Please
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationChapter 3: The Multiple Linear Regression Model
Chapter 3: The Multiple Linear Regression Model Advanced Econometrics - HEC Lausanne Christophe Hurlin University of Orléans November 23, 2013 Christophe Hurlin (University of Orléans) Advanced Econometrics
More informationThe Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression
The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every
More informationMultivariate normal distribution and testing for means (see MKB Ch 3)
Multivariate normal distribution and testing for means (see MKB Ch 3) Where are we going? 2 One-sample t-test (univariate).................................................. 3 Two-sample t-test (univariate).................................................
More informationReview Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
More informationSection 6.1 - Inner Products and Norms
Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,
More informationSF2940: Probability theory Lecture 8: Multivariate Normal Distribution
SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2014 Timo Koski () Mathematisk statistik 24.09.2014 1 / 75 Learning outcomes Random vectors, mean vector, covariance
More informationIntroduction to Principal Components and FactorAnalysis
Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a
More informationFactor Analysis. Factor Analysis
Factor Analysis Principal Components Analysis, e.g. of stock price movements, sometimes suggests that several variables may be responding to a small number of underlying forces. In the factor model, we
More informationApplied Linear Algebra I Review page 1
Applied Linear Algebra Review 1 I. Determinants A. Definition of a determinant 1. Using sum a. Permutations i. Sign of a permutation ii. Cycle 2. Uniqueness of the determinant function in terms of properties
More informationCONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation
Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in
More informationCITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION
No: CITY UNIVERSITY LONDON BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION ENGINEERING MATHEMATICS 2 (resit) EX2005 Date: August
More informationNonlinear Iterative Partial Least Squares Method
Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
More information1 2 3 1 1 2 x = + x 2 + x 4 1 0 1
(d) If the vector b is the sum of the four columns of A, write down the complete solution to Ax = b. 1 2 3 1 1 2 x = + x 2 + x 4 1 0 0 1 0 1 2. (11 points) This problem finds the curve y = C + D 2 t which
More informationNotes on Symmetric Matrices
CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More information9 MATRICES AND TRANSFORMATIONS
9 MATRICES AND TRANSFORMATIONS Chapter 9 Matrices and Transformations Objectives After studying this chapter you should be able to handle matrix (and vector) algebra with confidence, and understand the
More informationSummary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)
Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume
More informationFactor Analysis. Chapter 420. Introduction
Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.
More informationVector and Matrix Norms
Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty
More informationFactorization Theorems
Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization
More informationInner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
More informationLINEAR ALGEBRA W W L CHEN
LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,
More informationCAPM, Arbitrage, and Linear Factor Models
CAPM, Arbitrage, and Linear Factor Models CAPM, Arbitrage, Linear Factor Models 1/ 41 Introduction We now assume all investors actually choose mean-variance e cient portfolios. By equating these investors
More informationContinued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
More informationSections 2.11 and 5.8
Sections 211 and 58 Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/25 Gesell data Let X be the age in in months a child speaks his/her first word and
More informationSimilarity and Diagonalization. Similar Matrices
MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that
More informationChapter 6: Multivariate Cointegration Analysis
Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration
More information3. Let A and B be two n n orthogonal matrices. Then prove that AB and BA are both orthogonal matrices. Prove a similar result for unitary matrices.
Exercise 1 1. Let A be an n n orthogonal matrix. Then prove that (a) the rows of A form an orthonormal basis of R n. (b) the columns of A form an orthonormal basis of R n. (c) for any two vectors x,y R
More informationPrinciple Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +
More informationBias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes
Bias in the Estimation of Mean Reversion in Continuous-Time Lévy Processes Yong Bao a, Aman Ullah b, Yun Wang c, and Jun Yu d a Purdue University, IN, USA b University of California, Riverside, CA, USA
More informationα = u v. In other words, Orthogonal Projection
Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v
More information11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationHow To Understand Multivariate Models
Neil H. Timm Applied Multivariate Analysis With 42 Figures Springer Contents Preface Acknowledgments List of Tables List of Figures vii ix xix xxiii 1 Introduction 1 1.1 Overview 1 1.2 Multivariate Models
More informationMultivariate Analysis (Slides 13)
Multivariate Analysis (Slides 13) The final topic we consider is Factor Analysis. A Factor Analysis is a mathematical approach for attempting to explain the correlation between a large set of variables
More informationEigenvalues and Eigenvectors
Chapter 6 Eigenvalues and Eigenvectors 6. Introduction to Eigenvalues Linear equations Ax D b come from steady state problems. Eigenvalues have their greatest importance in dynamic problems. The solution
More information1 VECTOR SPACES AND SUBSPACES
1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such
More information5. Orthogonal matrices
L Vandenberghe EE133A (Spring 2016) 5 Orthogonal matrices matrices with orthonormal columns orthogonal matrices tall matrices with orthonormal columns complex matrices with orthonormal columns 5-1 Orthonormal
More informationCHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In
More informationMehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
More informationChapter 6. Orthogonality
6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be
More informationUnderstanding and Applying Kalman Filtering
Understanding and Applying Kalman Filtering Lindsay Kleeman Department of Electrical and Computer Systems Engineering Monash University, Clayton 1 Introduction Objectives: 1. Provide a basic understanding
More informationChapter 17. Orthogonal Matrices and Symmetries of Space
Chapter 17. Orthogonal Matrices and Symmetries of Space Take a random matrix, say 1 3 A = 4 5 6, 7 8 9 and compare the lengths of e 1 and Ae 1. The vector e 1 has length 1, while Ae 1 = (1, 4, 7) has length
More informationPUTNAM TRAINING POLYNOMIALS. Exercises 1. Find a polynomial with integral coefficients whose zeros include 2 + 5.
PUTNAM TRAINING POLYNOMIALS (Last updated: November 17, 2015) Remark. This is a list of exercises on polynomials. Miguel A. Lerma Exercises 1. Find a polynomial with integral coefficients whose zeros include
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationElements of probability theory
2 Elements of probability theory Probability theory provides mathematical models for random phenomena, that is, phenomena which under repeated observations yield di erent outcomes that cannot be predicted
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationAlgebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.
Chapter 1 Vocabulary identity - A statement that equates two equivalent expressions. verbal model- A word equation that represents a real-life problem. algebraic expression - An expression with variables.
More informationy t by left multiplication with 1 (L) as y t = 1 (L) t =ª(L) t 2.5 Variance decomposition and innovation accounting Consider the VAR(p) model where
. Variance decomposition and innovation accounting Consider the VAR(p) model where (L)y t = t, (L) =I m L L p L p is the lag polynomial of order p with m m coe±cient matrices i, i =,...p. Provided that
More information3.1 Least squares in matrix form
118 3 Multiple Regression 3.1 Least squares in matrix form E Uses Appendix A.2 A.4, A.6, A.7. 3.1.1 Introduction More than one explanatory variable In the foregoing chapter we considered the simple regression
More informationLinear Algebraic Equations, SVD, and the Pseudo-Inverse
Linear Algebraic Equations, SVD, and the Pseudo-Inverse Philip N. Sabes October, 21 1 A Little Background 1.1 Singular values and matrix inversion For non-smmetric matrices, the eigenvalues and singular
More informationIDENTIFICATION IN A CLASS OF NONPARAMETRIC SIMULTANEOUS EQUATIONS MODELS. Steven T. Berry and Philip A. Haile. March 2011 Revised April 2011
IDENTIFICATION IN A CLASS OF NONPARAMETRIC SIMULTANEOUS EQUATIONS MODELS By Steven T. Berry and Philip A. Haile March 2011 Revised April 2011 COWLES FOUNDATION DISCUSSION PAPER NO. 1787R COWLES FOUNDATION
More informationFactor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models
Factor Analysis Principal components factor analysis Use of extracted factors in multivariate dependency models 2 KEY CONCEPTS ***** Factor Analysis Interdependency technique Assumptions of factor analysis
More informationOrthogonal Diagonalization of Symmetric Matrices
MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding
More informationPortfolio selection based on upper and lower exponential possibility distributions
European Journal of Operational Research 114 (1999) 115±126 Theory and Methodology Portfolio selection based on upper and lower exponential possibility distributions Hideo Tanaka *, Peijun Guo Department
More informationMore than you wanted to know about quadratic forms
CALIFORNIA INSTITUTE OF TECHNOLOGY Division of the Humanities and Social Sciences More than you wanted to know about quadratic forms KC Border Contents 1 Quadratic forms 1 1.1 Quadratic forms on the unit
More informationCS3220 Lecture Notes: QR factorization and orthogonal transformations
CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss
More informationSystems of Linear Equations
Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and
More informationconstraint. Let us penalize ourselves for making the constraint too big. We end up with a
Chapter 4 Constrained Optimization 4.1 Equality Constraints (Lagrangians) Suppose we have a problem: Maximize 5, (x 1, 2) 2, 2(x 2, 1) 2 subject to x 1 +4x 2 =3 If we ignore the constraint, we get the
More informationLINEAR ALGEBRA. September 23, 2010
LINEAR ALGEBRA September 3, 00 Contents 0. LU-decomposition.................................... 0. Inverses and Transposes................................. 0.3 Column Spaces and NullSpaces.............................
More informationExact Nonparametric Tests for Comparing Means - A Personal Summary
Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationMATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set.
MATH 304 Linear Algebra Lecture 9: Subspaces of vector spaces (continued). Span. Spanning set. Vector space A vector space is a set V equipped with two operations, addition V V (x,y) x + y V and scalar
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationExploratory Factor Analysis Brian Habing - University of South Carolina - October 15, 2003
Exploratory Factor Analysis Brian Habing - University of South Carolina - October 15, 2003 FA is not worth the time necessary to understand it and carry it out. -Hills, 1977 Factor analysis should not
More informationElasticity Theory Basics
G22.3033-002: Topics in Computer Graphics: Lecture #7 Geometric Modeling New York University Elasticity Theory Basics Lecture #7: 20 October 2003 Lecturer: Denis Zorin Scribe: Adrian Secord, Yotam Gingold
More informationSolution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
More informationChapter 4: Vector Autoregressive Models
Chapter 4: Vector Autoregressive Models 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie IV.1 Vector Autoregressive Models (VAR)...
More informationQuadratic forms Cochran s theorem, degrees of freedom, and all that
Quadratic forms Cochran s theorem, degrees of freedom, and all that Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 1, Slide 1 Why We Care Cochran s theorem tells us
More informationAlgebra I Vocabulary Cards
Algebra I Vocabulary Cards Table of Contents Expressions and Operations Natural Numbers Whole Numbers Integers Rational Numbers Irrational Numbers Real Numbers Absolute Value Order of Operations Expression
More informationGoodness of fit assessment of item response theory models
Goodness of fit assessment of item response theory models Alberto Maydeu Olivares University of Barcelona Madrid November 1, 014 Outline Introduction Overall goodness of fit testing Two examples Assessing
More informationa 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
More informationMaster s Theory Exam Spring 2006
Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem
More informationExploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016
and Principal Components Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016 Agenda Brief History and Introductory Example Factor Model Factor Equation Estimation of Loadings
More informationInner products on R n, and more
Inner products on R n, and more Peyam Ryan Tabrizian Friday, April 12th, 2013 1 Introduction You might be wondering: Are there inner products on R n that are not the usual dot product x y = x 1 y 1 + +
More informationA note on companion matrices
Linear Algebra and its Applications 372 (2003) 325 33 www.elsevier.com/locate/laa A note on companion matrices Miroslav Fiedler Academy of Sciences of the Czech Republic Institute of Computer Science Pod
More informationAdvanced Microeconomics
Advanced Microeconomics Ordinal preference theory Harald Wiese University of Leipzig Harald Wiese (University of Leipzig) Advanced Microeconomics 1 / 68 Part A. Basic decision and preference theory 1 Decisions
More informationNotes on Orthogonal and Symmetric Matrices MENU, Winter 2013
Notes on Orthogonal and Symmetric Matrices MENU, Winter 201 These notes summarize the main properties and uses of orthogonal and symmetric matrices. We covered quite a bit of material regarding these topics,
More informationNOTES ON LINEAR TRANSFORMATIONS
NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all
More information