# High-Order Contrasts for Independent Component Analysis

Save this PDF as:

Size: px
Start display at page:

## Transcription

3 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso 2 Contrast Functions and Maximum Likelihood Identification Implicitly or explicitly, ICA tries to fit a model for the distribution of X that is a model of independent components: X = AS, where A is an invertible n n matrix and S is an n 1vector with independent entries. Estimating the parameter A from samples of X yields a separating matrix B = A 1. Even if the model X = AS is not expected to hold exactly for many real data sets, one can still use it to derive contrast functions. This section exhibits the contrast functions associated with the estimation of A by the ML principle a more detailed exposition can be found in Cardoso, 1998). Blind separation based on ML was first considered by Gaeta and Lacoume 199) but the authors used cumulant approximations as those described in section 3), Pham and Garat 1997), and Amari et al. 1996). 2.1 Likelihood. Assume that the probability distribution of each entry S i of S has a density r i ). 1 Then, the distribution P S of the random vector S has a density r ) in the form rs) = n i=1 r is i ), and the density of X for a given mixture A and a given probability density r ) is: px; A, r) = det A 1 ra 1 x), 2.1) so that the normalized) log-likelihood L T A, r) of T independent samples x1),...,xt) of X is L T A, r) def = 1 T = 1 T T log pxt); A, r) t=1 T log ra 1 xt)) log det A. 2.2) t=1 Depending on the assumptions made about the densities r 1,...,r n, several contrast functions can be derived from this log-likelihood. 2.2 LikelihoodContrast. Under mild assumptions, the normalized loglikelihood L T A, r), which is a sample average, converges for large T to its ensemble average by law of large numbers: L T A, r) = 1 T T log ra 1 xt)) log det A t=1 T Elog ra 1 x) log det A, 2.3) 1 All densities considered in this article are with respect to the Lebesgue measure on R or R n. which simple manipulations Cardoso, 1997) show to be equal to HP X ) KP Y P S ). Here and in the following, H ) and K ), respectively, denote the differential entropy and the Kullback-Leibler divergence. Since HP X ) does not depend on the model parameters, the limit for large T of L T A, r) is, up to a constant, equal to φ ML Y) def = KP Y P S ). 2.4) Therefore, the principle of ML coincides with the minimization of a specific contrast function, which is nothing but the Kullback) divergence KP Y P S ) between the distribution P Y of the output and a model distribution P S. The classic entropic contrasts follow from this observation, depending on two options: 1) trying or not to estimate P S from the data and 2) forcing or not the components to be uncorrelated. 2.3 Infomax. The technically simplest statistical assumption about P S is to select fixed densities r 1,...,r n for each component, possibly on the basis of prior knowledge. Then P S is a fixed distributional assumption, and the minimization of φ ML Y) is performed only over P Y via Y = BX. This can be rephrased: Choose B such that Y = BX is as close as possible in distribution to the hypothesized model distribution P S, the closeness in distribution being measured in the Kullback divergence. This is also the contrast function derived from the infomax principle by Bell and Sejnowski 1995). The connection between infomax and ML was noted in Cardoso 1997), MacKay 1996), and Pearlmutter and Parra 1996). 2.4 Mutual Information. The theoretically simplest statistical assumption about P S is to assume no model at all. In this case, the Kullback mismatch KP Y P S ) should be minimized not only by optimizing over B to change the distribution of Y = BX but also with respect to P S. For each fixed B, that is, for each fixed distribution P Y, the result of this minimization is theoretically very simple: the minimum is reached when P S = P Y, which denotes the distribution of independent components with each marginal distribution equal to the corresponding marginal distribution of Y. This stems from the property that KP Y P S ) = KP Y P Y ) + K P Y P S ) 2.5) for any distribution P S with independent components Cover & Thomas, 1991). Therefore, the minimum in P S of KP Y P S ) is reached by taking P S = P Y since this choice ensures K P Y P S ) =. The value of φ ML at this point then is φ MI Y) def = min P S KP Y P S ) = KP Y P Y ). 2.6)

4 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso We use the index MI since this quantity is well known as the mutual information between the entries of Y. It was first proposed by Comon 1994), and it can be seen from the above as deriving from the ML principle when optimization is with respect to both the unknown system A and the distribution of S. This connection was also noted in Obradovic and Deco 1997), and the relation between infomax and mutual information is also discussed in Nadal and Parga 1994). 2.5 Minimum Marginal Entropy. An orthogonal contrast φy) is, by definition, to be optimized under the constraint that Y is spatially white: orthogonal contrasts enforce decorrelation, that is, an exact second-order independence. Any regular contrast can be used under the whiteness constraint, but by taking the whiteness constraint into account, the contrast may be given a simpler expression. This is the case of some cumulant-based contrasts described in section 3. It is also the case of φ MI Y) because the mutual information can also be expressed as φ MI Y) = n i=1 HP Y i ) HP Y ); since the entropy HP Y ) is constant under orthonormal transforms, it is equivalent to consider φ ME Y) = n HP Yi ) 2.7) i=1 to be optimized under the whiteness constraint EYY = I. This contrast could be called orthogonal mutual information, or the marginal entropy contrast. The minimum entropy idea holds more generally under any volumepreserving transform Obradovic & Deco, 1997). 2.6 Empirical Contrast Functions. Among all the above contrasts, only φ ML or its orthogonal version are easily optimized by a gradient technique because the relative gradient of φ ML simply is the matrix EHY) with H ) defined in equation 1.4. Therefore, the relative gradient algorithm, equation 1.1, can be employed using either this function H ) or its symmetrized form, equation 1.5, if one chooses to enforce decorrelation. However, this contrast is based on a prior guess P S about the distribution of the components. If the guess is too far off, the algorithm will fail to discover independent components that might be present in the data. Unfortunately, evaluating the gradient of contrasts based on mutual information or minimum marginal entropy is more difficult because it does not reduce to the expectation of a simple function of Y; for instance, Pham 1996) minimizes explicitly the mutual information, but the algorithm involves a kernel estimation of the marginal distributions of Y. An intermediate approach is to consider a parametric estimation of these distributions as in Moulines, Cardoso, and Gassiat 1997) or Pearlmutter and Parra 1996), for instance. Therefore, all these contrasts require that the distributions of components be known, or approx- imated or estimated. As we shall see next, this is also what the cumulant approximations to contrast functions are implicitly doing. 3 Cumulants This section presents higher-order approximations to entropic contrasts, some known and some novel. To keep the exposition simple, it is restricted to symmetric distributions for which odd-order cumulants are identically zero) and to cumulants of orders 2 and 4. Recall that for random variables X 1,...,X 4,second-order cumulants arecumx 1, X 2 ) def def = E X 1 X 2 where X i = X i EX i and the fourth-order cumulants are CumX 1, X 2, X 3, X 4 ) = E X 1 X 2 X 3 X 4 E X 1 X 2 E X 3 X 4 E X 1 X 3 E X 2 X 4 E X 1 X 4 E X 2 X ) The variance and the kurtosis of a real random variable X are defined as σ 2 X) def = CumX, X) = E X 2, kx) def = CumX, X, X, X) = E X 4 3E 2 X 2, 3.2) that is, they are the second- and fourth-order autocumulants. A cumulant involving at least two different variables is called a cross-cumulant. 3.1 Cumulant-Based Approximations to Entropic Contrasts. Cumulants are useful in many ways. In this section, they show up because the probability density of a scalar random variable U close to the standard normal nu) = 2π) 1/2 exp u 2 /2 can be approximated as pu) nu) 1 + σ 2 U) 1 h 2 u) + ku) ) 2 4! h 4u), 3.3) where h 2 u) = u 2 1 and h 4 u) = u 4 6u 2 + 3, respectively, are the secondand fourth-order Hermite polynomials. This expression is obtained by retaining the leading terms in an Edgeworth expansion McCullagh, 1987). If U and V are two real random variables with distributions close to the standard normal, one can, at least formally, use expansion 3.3 to derive an approximation to KP U P V ). This is KP U P V ) 1 4 σ 2 U) σ 2 V)) ku) kv))2, 3.4) which shows how the pair σ 2, k) of cumulants of order 2 and 4 play in some sense the role of a local coordinate system around nu) with the quadratic

5 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso form 3.4 playing the role of a local metric. This result generalizes to multivariates, in which case we denote for conciseness R U ij = CumU i, U j ) and Q U ijkl = CumU i, U j, U k, U l ) and similarly for another random n-vector V with entries V 1,...,V n. We give without proof the following approximation: KP U P V ) K 24 P U P V ) def = 1 4 ij ) 2 R U ij R V ij ijkl Q U ijkl QV ijkl) ) Expression 3.5 turns out to be the simplest possible multivariate generalization of equation 3.4 the two terms in equation 3.5 are a double sum over all the n 2 pairs of indices and a quadruples over all the n 4 quadruples of indices). Since the entropic contrasts listed above have all been derived from the Kullback divergence, cumulant approximations to all these contrasts can be obtained by replacing the Kullback mismatch KP U P V ) by a cruder measure: its approximation is a cumulant mismatch by equation Approximation to the Likelihood Contrast. The infomax-ml contrast φ ML Y) = KP Y P S ) for ICA see equation 2.4) is readily approximated by using expression 3.5. The assumption P S on the distribution of S is now replaced by an assumption about the cumulants of S. This amounts to very little: all the cross-cumulants of S being thanks to the assumption of independent sources, it is needed only to specify the autocumulants σ 2 S i ) and ks i ). The cumulant approximation see equation 3.5) to the infomax-ml contrast becomes: φ ML Y) K 24 P Y P S ) = 1 4 ) 2 R Y ij σ 2 S i )δ ij ij ijkl Q Y ijkl ks i)δ ijkl ) 2, 3.6) where the Kronecker symbol δ equals 1 with identical indices and otherwise Approximation to the Mutual Information Contrast. The mutual information contrast φ MI Y) was obtained by minimizing KP Y P S ) over all the distributions P S with independent components. In the cumulant approximation, this is trivially done: the free parameters for P S are σ 2 S i ) and ks i ). Each of these scalars enters in only one term of the sums in equation 3.6 so that the minimization is achieved for σ 2 S i ) = R Y ii and ks i ) = Q Y iiii.in other words, the construction of the best approximating distribution with independent marginals P Y, which appears in equation 2.5, boils down, in the cumulant approximation, to the estimation of the variance and kurtosis of each entry of Y. Fitting both σ 2 S i ) and ks i ) to R Y ii and QY iii, respectively, has the effect of exactly cancelling the diagonal terms in equation 3.6, leaving only φ MI Y) φ24 MI def Y) = 1 4 ij ii ) 2 R Y 1 ij + 48 ijkl iiii Q Y ijkl) 2, 3.7) which is our cumulant approximation to the mutual information contrast in equation 2.6. The first term is understood as the sum over all the pairs of distinct indices; the second term is a sum over all quadruples of indices that are not all identical. It contains only off-diagonal terms, that is, cross-cumulants. Since cross-cumulants of independent variables identically vanish, it is not surprising to see the mutual information approximated by a sum of squared cross-cumulants Approximation to the Orthogonal Likelihood Contrast. The cumulant approximation to the orthogonal likelihood is fairly simple. The orthogonal approach consists of first enforcing the whiteness of Y that is R Y ij = δ ij or R Y = I. In other words, it consists of normalizing the components by assuming that σ 2 S i ) = 1 and making sure the second-order mismatch is zero. This is equivalent to replacing the weight 1 4 in equation 3.6 by an infinite weight, hence reducing the problem to the minimization under the whiteness constraint) of the fourth-order mismatch, or the second quadruple) sum in equation 3.6. Thus, the orthogonal likelihood contrast is approximated by φ OML 24 Y) def = 1 48 ijkl Q Y ijkl ks i)δ ijkl ) ) This contrast has an interesting alternate expression. Developing the squares gives φ24 OML Y) = 1 Q Y ijkl 48 )2 + 1 k 2 S i )δijkl ks i )δ ijkl Q Y ijkl 48. ijkl ijkl The first sum above is constant under the whiteness constraint this is readily checked using equation 3.13 for an orthonormal transform), and the second sum does not depend on Y; finally the last sum contains only diagonal nonzero terms. It follows that: φ24 OML Y) = c 1 ks i )Q Y iiii 24 i = 1 ks i )ky i ) = c 1 ks i )EȲi , 3.9) i i ijkl

6 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso where c= denotes an equality up to a constant. An interpretation of the second equality is that the contrast is minimized by maximizing the scalar product between the vector [ky 1 ),...,ky n )] of the kurtosis of the components and the corresponding vector of hypothesized kurtosis [ks 1 ),...,ks n )]. The last equality stems from the definition in equation 3.2 of the kurtosis and the constancy of EȲi 2 under the whiteness constraint. This last form is remarkable because it shows that for zero-mean observations, φ24 OML Y) = c ElY),where ly) = 1 24 i ks i)yi 4,so the contrast is just the expectation ofa simple function of Y. We can expect simple techniques for its maximization Approximation to the Minimum Marginal Entropy Contrast. Under the whiteness constraint, the first sum in the approximation, equation 3.7, is zero this is the whiteness constraint) so that the approximation to mutual information φ MI Y) reduces to the last term: φ ME Y) φ24 ME def Y) = 1 48 ijkl iiii ) 2 Q Y c= 1 ijkl 48 i Q Y iiii) ) Again, the last equality up to constant follows from the constancy of ijkl Q Y ijkl )2 under the whiteness constraint. These approximations had already been obtained by Comon 1994) from an Edgeworth expansion. They say something simple: Edgeworth expansions suggest testing the independence between the entries of Y by summing up all the squared cross-cumulants. In the course of this article, we will find two similar contrast functions. The JADE contrast, φ JADE Y) def = ijkl iikl Q Y ijkl) 2, 3.11) also is a sum of squared cross-cumulants the notation indicates a sum is over all the quadruples ijkl) of indices with i j). Its interest is to be also a criterion of joint diagonality of cumulants matrices. The SHIBBS criterion, φ SH Y) def = ijkl iikk Q Y ijkl) 2, 3.12) is also introduced in section 4.3 as governing a similar but less memorydemanding algorithm. It also involves only cross-cumulants: those with indices ijkl) such that i j or k l. 3.2 Cumulants and Algebraic Structures. Previous sections reviewed the use of cumulants in designing contrast functions. Another thread of ideas using cumulants stems from the method of moments. Such an approach is called for by the multilinearity of the cumulants. Under a linear transform Y = BX, which also reads Y i = p b ipx p, the cumulants of order 4 for instance) transform as: CumY i, Y j, Y k, Y l ) = pqrs b ip b jq b kr b ls CumX p, X q, X r, X s ), 3.13) which can easily be exploited for our purposes since the ICA model is linear. Using this fact and the assumption of independence by which CumS p, S q, S r, S s ) = ks p )δp, q, r, s), we readily obtain the simple algebraic structure of the cumulants of X = AS when S has independent entries, CumX i, X j, X k, X l ) = n ks u )a iu a ju a ku a lu, 3.14) u=1 where a ij denotes the ij)th entry of matrix A. When estimates ĈumX i, X j, X k, X l ) are available, one may try to solve equation 3.4 in the coefficients a ij of A. This is tantamount to cumulant matching on the empirical cumulants of X. Because of the strong algebraic structure of equation 3.14, one may try to devise fourth-order factorizations akin to the familiar second-order singular value decomposition SVD) or eigenvalue decomposition EVD) see Cardoso, 1992; Comon, 1997; De Lathauwer, De Moor, & Vandewalle, 1996). However, these approaches are generally not equivalent to the optimization of a contrast function, resulting in estimates that are generally not equivariant Cardoso, 1995). This point is illustrated below; we introduce cumulant matrices whose simple structure offers straightforward identification techniques, but we stress, as one of their important drawbacks, their lack of equivariance. However, we conclude by showing how the algebraic point of view and the statistical equivariant) point of view can be reconciled Cumulant Matrices. The algebraic nature of cumulants is tensorial McCullagh, 1987), but since we will concern ourselves mainly with secondand fourth-order statistics, a matrix-based notation suffices for the purpose of our exposition; we only introduce the notion of cumulant matrix defined as follows. Given a random n 1 vector X and any n n matrix M, we define the associated cumulant matrix Q X M) as the n n matrix defined component-wise by [Q X M)] ij def = n CumX i, X j, X k, X l )M kl. 3.15) k,l=1 If X is centered, the definition in equation 3.1 shows that Q X M) = E{X MX) XX } R X trmr X ) R X MR X R X M R X,3.16)

7 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso where tr ) denotes the trace and R X denotes the covariance matrix of X, that is, [R X ] ij = CumX i, X j ). Equation 3.16 could have been chosen as an index-free definition of cumulant matrices. It shows that a given cumulant matrix can be computed or estimated at a cost similar to the estimation cost of a covariance matrix; there is no need to compute the whole set of fourthorder cumulants to obtain the value of Q X M) for a particular value of M. Actually, estimating a particular cumulant matrix is one way of collecting part of the fourth-order information in X; collecting the whole fourth-order information requires the estimation of On 4 ) fourth-order cumulants. The structure of a cumulant matrix Q X M) in the ICA model is easily deduced from equation 3.14: Q X M) = A M)A ) M) = Diag ks 1 ) a 1 Ma 1,...,kS n ) a n Ma n, 3.17) where a i denotes the ith column of A, that is, A = [a 1,...,a n ]. In this factorization, the generally unknown) kurtosis enter only in the diagonal matrix M), a fact implicitly exploited by the algebraic techniques described below. 3.3 Blind Identification Using Algebraic Structures. In section 3.1, contrast functions were derived from the ML principle assuming the model X = AS. In this section, we proceed similarly: we consider cumulant-based blind identification of A assuming X = AS from which the structures 3.14 and 3.17 result. Recallthat the orthogonal approach can be implemented by first sphering explicitly vector X. Let W be a whitening, and denote Z def = WX the sphered vector. Without loss of generality, the model can be normalized by assuming that the entries of S have unit variance so that S is spatially white. Since Z = WX = WAS is also white by construction, the matrix U def = WA must be orthonormal: UU = I. Therefore sphering yields the model Z = US with U orthonormal. Of course, this is still a model of independent components so that, similar to equation 3.17, we have for any matrix M the structure of the corresponding cumulant matrix of Z, Q Z M) = U M)U ) M) = Diag ks 1 ) u 1 Mu 1,...,kS n ) u n Mu n, 3.18) where u i denotes the ith column of U. In a practical orthogonal statisticbased technique, one would first estimate a whitening matrix Ŵ, estimate some cumulants of Z = ŴX, compute an orthonormal estimate Û of U using these cumulants, and finally obtain an estimate Â of A as Â = Ŵ 1Û or obtain a separating matrix as B = Û 1Ŵ = Û Ŵ Nonequivariant Blind Identification Procedures. We first present two blind identification procedures that exploit in a straightforward manner the structure 3.17; we explain why, in spite of attractive computational simplicity, they are not well behaved not equivariant) and how they can be fixed for equivariance. The first idea is not based on an orthogonal approach. Let M 1 and M 2 def be two arbitrary n n matrices, and define Q 1 = Q X def M 1 ) and Q 2 = Q X M 2 ). According to equation 3.17, if X = AS we have Q 1 = A 1 A and Q 2 = A 2 A with 1 and 2 two diagonal matrices. Thus, G def = Q 1 Q 1 2 = A 1 A )A 2 A ) 1 = A A 1, where is the diagonal matrix It follows that GA = A, meaning that the columns of A are the eigenvectors of G possibly up to scale factors). An extremely simple algorithm for blind identification of A follows: Select two arbitrary matrices M 1 and M 2 ; compute sample estimates ˆQ 1 and ˆQ 2 using equation 3.16; find the columns of A as the eigenvectors of ˆQ 1 ˆQ 1 2. There is at least one problem with this idea: we have assumed invertible matrices throughout the derivation, and this may lead to instability. However, this specific problem may be fixed by sphering, as examined next. Consider now the orthogonal approach as outlined above. Let M be some arbitrary matrix M, and note that equation 3.18 is an eigendecomposition: the columns of U are the eigenvectors of Q Z M), which are orthonormal indeed because Q Z M) is symmetric. Thus, in the orthogonal approach, another immediate algorithm for blind identification is to estimate U as an orthonormal) diagonalizer of an estimate of Q Z M). Thanks to sphering, problems associated with matrix inversion disappear, but a deeper problem associated with these simple algebraic ideas remains and must be addressed. Recall that the eigenvectors are uniquely determined 2 if and only if the eigenvalues are all distinct. Therefore, we need to make sure that the eigenvalues of Q Z M) are all distinct in order to preserve blind identifiability based on Q Z M). According to equation 3.18, these eigenvalues depend on the sphered) system, which is unknown. Thus, it is not possible to determine a priori if a given matrix M corresponds to distinct eigenvalues of Q Z M). Of course, if M is randomly chosen, then the eigenvalues are distinct with probability 1, but we need more than this in practice because the algorithms use only sample estimates of the cumulant matrices. A small error in the sample estimate of Q Z M) can induce a large deviation of the eigenvectors if the eigenvalues are not well enough separated. Again, this is impossible to guarantee a priori because an appropriate selection of M requires prior knowledge about the unknown mixture. In summary, the diagonalization of a single cumulant matrix is computa- 2 In fact, determined only up to permutations and signs that do not matter in an ICA context.

8 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso tionally attractive and can be proved to be almost surely consistent, but it is not satisfactory because the nondegeneracy of the spectrum cannot be controlled. As a result, the estimation accuracy from a finite number of samples depends on the unknown system and is therefore unpredictable in practice; this lack of equivariance is hardly acceptable. One may also criticize these approaches on the ground that they rely on only a small part of the fourth-order information summarized in an n n cumulant matrix) rather than trying to exploit more cumulants there are On 4 ) fourth-order independent cumulant statistics). We examine next how these two problems can be alleviated by jointly processing several cumulant matrices Recovering Equivariance. Let M ={M 1,...,M P } be a set of P matrices of size n n and denote Q i = Q Z M i ) for 1 i P the associated def cumulant matrices for the sphered data Z = US. Again, as above, for all i we have Q i = U i U with i a diagonal matrix given by equation As a measure of nondiagonality of a matrix F, define OffF) as the sum of the squares of the nondiagonal elements: OffF) def = ) 2 fij. 3.19) i j We have in particular OffU Q i U) = Off i ) = since Q i = U i U and U U = I. For any matrix set M and any orthonormal matrix V, we define the following nonnegative joint diagonality criterion, D M V) def = M i M OffV Q Z M i )V), 3.2) which measures how close to diagonality an orthonormal matrix V can simultaneously bring the cumulants matrices generated by M. To each matrix set M is associated a blind identification algorithm as follows: 1) find a sphering matrix W to whiten in the data X into Z = WX;2) estimate the cumulant matrices Q Z M) for all M M by a sample version of equation 3.16;3) minimize the joint diagonality criterion, equation 3.2, that is, make the cumulant matrices as diagonal as possible by an orthonormal transform V; 4) estimate A as A = VW 1 or its inverse as B = V W or the component vector as Y = V Z = V WX. Such an approach seems to be able to alleviate the drawbacks mentioned above. Finding the orthonormal transform as the minimizer of a set of cumulant matrices goes in the right direction because it involves a larger number of fourth-order statistics and because it decreases the likelihood of degenerate spectra. This argument can be made rigorous by considering a maximal set of cumulant matrices. By definition, this is a set obtained whenever M is an orthonormal basis for the linear space of n n matrices. Such a basis contains n 2 matrices so that the corresponding cumulant matrices total n 2 n 2 = n 4 entries, that is, as many as the whole fourth-order cumulant set. For any such maximal set Cardoso & Souloumiac, 1993): D M V) = φ JADE Y) with Y = V Z, 3.21) where φ JADE Y) is the contrast function defined at equation The joint diagonalization of a maximal set guarantees blind identifiability of A if ks i ) = for at most one entry S i of S Cardoso & Souloumiac, 1993). This is a necessary condition for any algorithm using only second- and fourth-order statistics Comon, 1994). Akey point is made by relationship We managed to turn an algebraic property diagonality) of the cumulants of the sphered) observations into a contrast function a functional of the distribution of the output Y = V Z. This fact guarantees that the resulting estimates are equivariant Cardoso, 1995). The price to pay with this technique for reconciling the algebraic approach with the naturally equivariant contrast-based approach is twofold: it entails the computation of a large actually, maximal) set of cumulant matrices and the joint diagonalization of P = n 2 matrices, which is at least as costly as P times the diagonalization of a single matrix. However, the overall computational burden may be similar see examples in section 5) to the cost of adaptive algorithms. This is because the cumulant matrices need to be estimated once for a given data set and because it exists as a reasonably efficient joint diagonalization algorithm see section 4) that is not based on gradient-style optimization; it thus preserves the possibility of exploiting the underlying algebraic nature of the contrast function, equation Several tricks for increasing efficiency are also discussed in section 4. 4 Jacobi Algorithms This section describes algorithms for ICA sharing a common feature: a Jacobi optimization of an orthogonal contrast function as opposed to optimization by gradient-like algorithms. The principle of Jacobi optimization is applied to a data-based algorithm, a statistic-based algorithm, and a mixed approach. The Jacobi method is an iterative technique of optimization over the set of orthonormal matrices. The orthonormal transform is obtained as a sequence ofplane rotations. Eachplane rotation is a rotation applied to a pair of coordinates hence the name: the rotation operates in a two-dimensional plane). If Y is an n 1 vector, the i, j)th plane rotation by an angle θ ij changes the coordinates i and j of Y according to [ Yi Y j ] [ cosθij ) sinθ ij ) sinθ ij ) cosθ ij ) ][ Yi Y j ], 4.1) while leaving the other coordinates unchanged. Asweep is one pass through

9 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso all the nn 1)/2 possible pairs of distinct indices. This idea is classic in numerical analysis Golub & Van Loan, 1989); it can be considered in a wider context for the optimization of any function of an orthonormal matrix. Comon introduced the Jacobi technique for ICA see Comon, 1994 for a data-based algorithm and an earlier reference in it for the Jacobi update of high-order cumulant tensors). Such a data-based Jacobi algorithm for ICA works through a sequence of Jacobi sweeps on the sphered data until a given orthogonal contrast φy) is optimized. This can be summarized as: 1. Initialization. Compute a whitening matrix W and set Y = WX. 2. One sweep. For all nn 1)/2 pairs, that is for 1 i < j n, do: a. Compute the Givens angle θ ij, optimizing φy) when the pair Y i, Y j ) is rotated. b. Ifθ ij <θ min, do rotate the pairy i, Y j ) according to equation If no pair has been rotated in previous sweep, end. Otherwise go to 2 for another sweep. Thus, the Jacobi approach considers a sequence of two-dimensional ICA problems. Of course, the updating step 2b on a pair i, j) partially undoes the effect of previous optimizations on pairs containing either i or j. For this reason, it is necessary to go through several sweeps before optimization is completed. However, Jacobialgorithms are often very efficient and converge in a small number of sweeps see the examples in section 5), and a key point is that each plane rotation depends on a single parameter, the Givens angle θ ij, reducing the optimization subproblem at each step to a one-dimensional optimization problem. An important benefit of basing ICA on fourth-order contrasts becomes apparent: because fourth-order contrasts are polynomial in the parameters, the Givens angles can often be found in close form. In the above scheme, θ min is a small angle, which controls the accuracy of the optimization. In numerical analysis, it is determined according to machine precision. For a statistical problem as ICA, θ min should be selected in such a way that rotations by a smaller angle are not statistically significant. In our experiments, we take θ min to scale as 1/ T, typically: θ min = 1 2. T This scaling can be related to the existence of a performance bound in the orthogonal approach to ICACardoso, 1994). This value does not seem to be critical, however, because we have found Jacobi algorithms to be very fast at finishing. In the remainder of this section, we describe three possible implementations of these ideas. Each one corresponds to a different type of contrast function and to different options about updating. Section 4.1 describes a data-based algorithm optimizing φ OML Y); section 4.2 describes a statisticbased algorithm optimizingφ JADE Y);section 4.3presents a mixed approach optimizing φ SH Y); finally, section 4.4 discusses the relationships between these contrast functions. 4.1 A Data-Based Jacobi Algorithm: MaxKurt. We start by a Jacobitechnique for optimizing the approximation, equation 3.9, to the orthogonal likelihood. For the sake of exposition, we consider a simplified version of φ OML Y) obtained by setting ks 1 ) = ks 2 ) =... = ks n ) = k, in which case the minimization of contrast function, equation 3.9, is equivalent to the minimization of φ MK Y) def = k i Q Y iiii. 4.2) This criterion is also studied by Moreau and Macchi 1996), who propose a two-stage adaptive procedure for its optimization; it also serves as a starting point for introducing the one-stage adaptive algorithm of Cardoso and Laheld 1996). Denote G ij θ) the plane rotation matrix that rotates the pair i, j) by an angle θ as in step 2b above. Then simple trigonometry yields: φ MK G ij θ)y) = µ ij kλ ij cos4θ ij )), 4.3) where µ ij does not depend on θ and λ ij is nonnegative. The principal determination of angle ij is characterized by ij = 1 ) 4Q 4 arctan Y iiij 4QY ijjj, QY iiii + QY jjjj 6QY iijj, 4.4) where arctany, x) denotes the angle α π,π] such that cosα) = x x 2 +y 2 y and sinα) =.IfY is a zero-mean sphered vector, expression 4.3 x 2 +y2 further simplifies to ij = 1 ) )) 4E 4 arctan Yi 3 Y j Y i Yj 3, E Yi 2 Y2 j )2 4Yi 2 Y2 j. 4.5) The computations are given in the appendix. It is now immediate to minimize φ MK Y) for each pair of components and for either choice of the sign of k. If one looks for components with positive kurtosis often called supergaussian), the minimization of φ MK Y) is identical to the maximization of the sum of the kurtosis of the components since we have k > in this case. The Givens angle simply is θ = ij since this choice makes the cosine in equation 4.2 equal to its maximum value. We refer to the Jacobi algorithm outlined above as MaxKurt. A Matlab implementation is listed in the appendix, whose simplicity is consistent with the data-based approach. Note, however, that it is also possible to use

12 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso point of the SHIBBS algorithm is that for any pair 1 i < j n: CumY i, Y j, Y k, Y k ) CumY i, Y i, Y k, Y k ) k CumY j, Y j, Y k, Y k ) ) =, 4.7) and we can prove that this is also the stationarity condition of φ SH Y).We do not include the proofs of these statements, which are purely technical. 4.4 Comparing Fourth-Order Orthogonal Contrasts. We have considered two approximations, φ JADE Y) and φ SH Y), to the minimum marginal entropy/mutual information contrast φ MI I), which are based on fourthorder cumulants and can be optimized by Jacobi technique. The approximation φ24 ME Y) proposed by Comon also belongs to this category. One may wonder about the relative statistical merits of these three approximations. The contrast φ24 ME Y) stems from an Edgeworth expansion for approximating φ ME Y), which in turn has been shown to derive from the ML principle see section 2). Since ML estimation offers asymptotic) optimality properties, one may be tempted to conclude to the superiority ofφ24 ME Y). However, this is not the case, as discussed now. First, when the ICA model holds, it can be shown that even though φ ME 24 Y) and φjade Y) are different criteria, they have the same asymptotic performance when applied to sample statistics Souloumiac & Cardoso, 1991). This is also true of φ SH Y) since we have seen that it is equivalent to JADE in this case a more rigorous proof is possible, based on equation 4.7, but is not included). Second, when the ICA model does not hold, the notion of identification accuracy does not make sense anymore, but one would certainly favor an orthogonal contrast reaching its minimum at a point as close as possible to the point where the true mutual information φ ME Y) is minimized. However, it seems difficult to find a simple contrast such as those considered here) that would be a good approximation to φ ME Y) for any wide class of distributions of X. Note that the ML argument in favor of φ ME 24 Y) is based on an Edgeworth expansion that is valid for almost gaussian distributions those distributions that make ICA very difficult and of dubious significance: In practice, ICA should be restricted to data sets where the components show a significant amount of nongaussianity, in which case the Edgeworth expansions cannot be expected to be accurate. There is another way than Edgeworth expansion for arriving at φ ME 24 Y). Consider cumulant matching: the matching of the cumulants of Y to the corresponding cumulants of a hypothetical vector S with independent components. The orthogonal contrast functions φ JADE Y), φ SH Y), and φ ME 24 Y) can be seen as matching criteria because they penalize the deviation of the cross-cumulants of Y from zero which is the value of cross-cumulants of a vector S with independent components, indeed), and they do so under the constraint that Y is white that is, by enforcing an exact match of the second-order cumulants of Y. It is possible to devise an asymptotically optimal matching criterion by taking into account the variability of the sample estimates of the cumulants. Such a computation is reported in Cardoso et al. 1996) for the matching of all second- and fourth-order cumulants of complex-valued signals, but a similar computation is possible for real-valued problems. It shows that the optimal weighting of the cross-cumulants depends on the distributions of the components so that the flat weighting of all the cross-cumulants, as in equation 3.1, is not the best one in general. However, in the limit of almost gaussian signals, the optimal weights tend to values corresponding precisely to the contrast K 24 Y S) defined in equation 3.5. This is not unexpected and confirms that the crude cumulant expansion used in deriving equation 3.5 is sensible, though not optimal for significantly nongaussian components. It seems from the definitions of φ JADE Y), φ SH Y), and φ24 ME Y) that these different contrasts involve different types of cumulants. This is, however, an illusion because the compact definitions given above do not take into account the symmetries of the cumulants: the same cross-cumulant may be counted several times in each of these contrasts. For instance, the definition of JADE excludes the cross-cumulant CumY 1, Y 1, Y 2, Y 3 ) but includes the cross-cumulant CumY 1, Y 2, Y 1, Y 3 ), which is identical. Thus, in order to determine if any bit of fourth-order information is ignored by any particular contrast, a nonredundant description should be given. All the possible crosscumulants come in four different patterns of indices: ijkl), iikl), iijj), and ijjj). Nonredundant expressions in terms of these patterns are in the form: φ[y] = C a i<j<k<l e ijkl + C b e iikl + e kkil + e llik ) i<k<l ) + C c e iijj + C d eijjj + e jjji, i<j i<j where e ijkl def = CumY i, Y j, Y k, Y l ) 2 and the C i s are numerical constants. It remains to count how many times a unique cumulant appears in the redundant definitions of the three approximations to mutual information considered so far. We give only the result of this uninspiring task in Table 1, which shows that all the cross-cumulants are actually included in the three contrasts, which therefore differ only by the different scalar weights given to each particular type. It means that the three contrasts essentially do the same thing. In particular,when the number n ofcomponents is large enough, the number of cross-cumulants of type [ijkl] all indices distinct) grows as On 4 ), while the number of other types grows as On 3 ) at most. Therefore, the [ijkl] type outnumbers all the other types for large n: one may conjecture the equivalence of the three contrasts in this limit. Unfortunately, it seems

13 High-Order Contrasts for Independent Component Analysis 181 Table 1: Number of Times a Cross-Cumulant of a Given Type Appears in a Given Contrast. 182 Jean-François Cardoso JADE Difference Constants C a C b C c C d Pattern ijkl iikl iijj ijjj Comon ICA JADE SHIBBS difficult to draw more conclusions. For instance, we have mentioned the asymptotic equivalence between Comon s contrast and the JADE contrast for any n, but it does not reveal itself directly in the weight table. 5 A Comparison on Biomedical Data The performance of the algorithms presented above is illustrated using the averaged event-related potential ERP) data recorded and processed by Makeig and coworkers. A detailed account of their analysis is in Makeig, Bell, Jung, and Sejnowski 1997). For our comparison, we use the data set and the logistic ICA algorithm provided with version 3.1 of Makeig s ICA toolbox. 3 The data set contains 624 data points of averaged ERP sampled from 14 EEG electrodes. The implementation of the logistic ICA provided in the toolbox is somewhat intermediate between equation 1.1 and its off-line counterpart: HY) is averaged through subblocks of the data set. The nonlinear function is taken to be ψy) = 2 1+e y 1 = tanh y 2. This is minus the log-derivative ψy) = r y) ry) of the density ry) = β 1 coshy/2) β is a normalization constant). Therefore, this method maximizes over A the likelihood of model X = AS under the assumptions that S has independent components 1 with densities equal to β coshy/2). Figure 1 shows the components Y JADE produced by JADE first column) and the components Y LICA produced by the logistic ICA included in Makeig s toolbox, which was run with all the default options; the third column shows the difference between the components at the same scale. This direct comparison is made possible with the following postprocessing: the components Y LICA were normalized to have unit variance and were sorted by increasing values of kurtosis. The components Y JADE have unit variance by construction; they were sorted and their signs were changed to match Y LICA. Figure 1 shows that Y JADE and Y LICA essentially agree on 9 of 14 components. 3 Available from scott/. Figure 1: The source signals estimated by JADE and the logistic ICA and their differences.

15 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso 2 JADE Difference Figure 3: Eigenvalues of the covariance matrix R X 1log 1 λ i )). of the data in db i.e., Table 2: Number of Floating-Point Operations and CPU Time. Flops CPU Secs. Flops CPU Secs. Method 14 Components 12 Components 5.5e e JADE 4.e e SHIBBS 5.61e e MaxKurt 1.19e e+6.54 slower than JADE here because the data set is not large enough to give it an edge. These remarks are even more marked when comparing the figures obtained in the extraction of 12 components. It should be clear that these figures do not prove much because they are representative of only a particular data set and of particular implementations of the algorithms, as well as of the various parameters used for tuning the algorithms. However, they do disprove the claim that algebraic-cumulant methods are of no practical value. 6 Summary and Conclusions The definitions of classic entropic contrasts for ICA can all be understood from an ML perspective. An approximation of the Kullback-Leibler divergence yields cumulant-based approximations of these contrasts. In the or- Figure 4: The 12 source signals estimated by JADE and a logistic ICA out of the first 12 principal components of the original data.

17 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso amples presented in this article, but it could indeed show up in other data sets. Appendix: Derivation and Implementation of MaxKurt A.1 Givens Angles for MaxKurt. An explicit form of the MaxKurt contrast as a function of the Givens angles is derived. For conciseness, we denote [ijkl] = Q Y ijkl and we define a ij = [iiii] + [jjjj] 6[iijj] 4 b ij = [iiij] [jjji] λ ij = a 2 ij + b2 ij. A.1) The sum of the kurtosis for the pair of variables Y i and Y j after they have been rotated by an angle θ depends on θ as follows where we set c = cosθ) and s = sinθ)): kcosθ)y i + sinθ)y j ) + k sinθ)y i + cosθ)y j ) A.2) = c 4 [iiii] + 4c 3 s[iiij] + 6c 2 s 2 [iijj] + 4cs 3 [ijjj] + s 4 [jjjj] A.3) + s 4 [iiii] 4s 3 c[iiij] + 6s 2 c 2 [iijj] 4sc 3 [ijjj] + c 4 [jjjj] A.4) = c 4 + s 4 )[iiii] + [jjjj]) + 12c 2 s 2 [iijj] + 4csc 2 s 2 )[iiij] [jjji]) A.5) c = 8c 2 2 [iiii] + [jjjj] 6[iijj] s + 4csc 2 s 2 )[iiij] [jjji]) A.6) 4 = 2sin 2 c 2θ)a ij + 2sin2θ)cos2θ)b ij = cos4θ)aij + sin4θ)b ij A.7) = λ ij cos4θ)cos4 ij ) + sin4θ)sin4 ij ) ) = λ ij cos4θ ij )).A.8) where the angle 4 ij is defined by cos4 ij ) = a ij a 2 ij + b2 ij sin4 ij ) = b ij a 2 ij + b2 ij. A.9) This is obtained by using the multilinearity and the symmetries of the cumulants at lines A.3 and A.4, followed by elementary trigonometrics. If Y i and Y j are zero-mean and sphered, EY i Y j = δ ij, we have [iiii] = Q Y iiii = EY4 i 3E2 Y 2 i = EY 4 i 3 and for i j:[iiij] = QY iiij = EY3 i Y j as well as [iijj] = Q Y iijj = EY2 i Y2 j 1. Hence an alternate expression for a ij and b ij is: a ij = 1 4 E Y 4 i + Y4 j 6Y 2 i Y2 j ) ) b ij = E Yi 3 Y j Y i Yj 3. A.1) It may be interesting to note that all the moments required to determine the Givens angle for a given pair i, j) can be expressed in terms of the two variables ξ ij = Y i Y j and η ij = Yi 2 Y2 j. Indeed, it is easily checked that for a zero-mean sphered pair Y i, Y j ), one has a ij = 1 4 E η 2 ij 4ξ 2 ij ) b ij = E η ij ξ ij ). A.11) A.2 A Simple Matlab Implementation of MaxKurt. A Matlab implementation could be as follows, where we have tried to maximize readability but not the numerical efficiency: function Y = maxkurtx) % [n T] = sizex) ; Y = X - meanx,2)*ones1,t); % Remove the mean Y = invsqrtmx*x /T))*Y ; % Sphere the data encore = 1 ; % Go for first sweep while encore, encore=; for p=1:n-1, % These two loops go for q=p+1:n, % through all pairs xi = Yp,:).*Yq,:); eta = Yp,:).*Yp,:) - Yq,:).*Yq,:); Omega = atan2 4*eta*xi ), eta*eta - 4*xi*xi ) ); end end end return if absomega) >.1/sqrtT) % A statistically small % angle encore = 1 ; % This will not be the %last sweep c = cosomega/4); s = sinomega/4); Y[p q],:) = [ c s ; -s c ] * Y[p q],:) ; % Plane % rotation end Acknowledgments We express our thanks to Scott Makeig and to his coworkers for releasing to the public domain some of their codes and the ERP data set. We also thank the anonymous reviewers whose constructive comments greatly helped in revising the first version of this article.

18 High-Order Contrasts for Independent Component Analysis Jean-François Cardoso References Amari, S.-I. 1996). Neural learning in structured parameter spaces Natural Riemannian gradient. In Proc. NIPS. Amari, S.-I., Cichocki, A., & Yang, H. 1996). A new learning algorithm for blind signal separation. In Advances in neural information processing systems, 8 pp ). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. 1995). An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7, Cardoso, J.-F. 1992). Fourth-order cumulant structure forcing. Application to blind array processing. In Proc. 6th SSAP workshop on statistical signal and array processing pp ). Cardoso, J.-F. 1994). On the performance of orthogonal source separation algorithms. In Proc. EUSIPCO pp ). Edinburgh. Cardoso, J.-F. 1995). The equivariant approach to source separation. In Proc. NOLTA pp. 55 6). Cardoso, J.-F. 1997). Infomax and maximum likelihood for source separation. IEEE Letters on Signal Processing, 4, Cardoso, J.-F. 1998). Blind signal separation: statistical principles. Proc. of the IEEE. Special issue on blind identification and estimation. Cardoso, J.-F., Bose, S., & Friedlander, B. 1996). On optimal source separation based on second and fourth order cumulants. In Proc.IEEEWorkshopon SSAP. Corfu, Greece. Cardoso, J.-F., &Laheld, B. 1996). Equivariant adaptive source separation. IEEE Trans. on Sig. Proc., 44, Cardoso, J.-F., & Souloumiac, A. 1993). Blind beamforming for non Gaussian signals. IEEE Proceedings-F, 14, Comon, P. 1994). Independent component analysis, a new concept? Signal Processing, 36, Comon, P. 1997). Cumulant tensors. In Proc. IEEE SP Workshop on HOS. Banff, Canada. Cover, T. M., & Thomas, J. A. 1991). Elements of information theory. New York: Wiley. De Lathauwer, L., De Moor, B., &Vandewalle, J. 1996). Independent component analysis based on higher-order statistics only. In Proc. IEEE SSAP Workshop pp ). Gaeta, M., & Lacoume, J. L. 199). Source separation without a priori knowledge: The maximum likelihood solution. In Proc. EUSIPCO pp ). Golub, G. H., & Van Loan, C. F. 1989). Matrix computations. Baltimore: Johns Hopkins University Press. MacKay, D. J. C. 1996). Maximum likelihood and covariant algorithms for independent component analysis. In preparation. Unpublished manuscript. Makeig, S., Bell, A., Jung, T.-P., & Sejnowski, T. J. 1997). Blind separation of auditory event-related brain responses into independent components. Proc. Nat. Acad. Sci. USA, 94, McCullagh, P. 1987). Tensor methods in statistics. London: Chapman and Hall. Moreau, E., & Macchi, O. 1996). High order contrasts for self-adaptive source separation. International Journal of Adaptive Control and Signal Processing, 1, Moulines, E., Cardoso, J.-F., & Gassiat, E. 1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In Proc. ICASSP 97 pp ). Nadal, J.-P., & Parga, N. 1994). Nonlinear neurons in the low-noise limit: A factorial code maximizes information transfer. NETWORK, 5, Obradovic, D., & Deco, G. 1997). Unsupervised learning for blind source separation: An information-theoretic approach. In Proc. ICASSP pp ). Pearlmutter, B. A., & Parra, L. C. 1996). A context-sensitive generalization of ICA. In International Conference on Neural Information Processing Hong Kong). Pham, D.-T. 1996). Blind separation of instantaneous mixture of sources via an independent component analysis. IEEE Trans. on Sig. Proc., 44, Pham, D.-T., & Garat, P. 1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Tr. SP, 45, Souloumiac, A.,&Cardoso, J.-F.1991).Comparaison de méthodes de séparation de sources. In Proc. GRETSI. Juan les Pins, France pp ). Received February 7, 1998; accepted June 25, 1998.

### Introduction. Independent Component Analysis

Independent Component Analysis 1 Introduction Independent component analysis (ICA) is a method for finding underlying factors or components from multivariate (multidimensional) statistical data. What distinguishes

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

### Summary of week 8 (Lectures 22, 23 and 24)

WEEK 8 Summary of week 8 (Lectures 22, 23 and 24) This week we completed our discussion of Chapter 5 of [VST] Recall that if V and W are inner product spaces then a linear map T : V W is called an isometry

### ELEC-E8104 Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems

Stochastics models and estimation, Lecture 3b: Linear Estimation in Static Systems Minimum Mean Square Error (MMSE) MMSE estimation of Gaussian random vectors Linear MMSE estimator for arbitrarily distributed

### Solution of Linear Systems

Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

### Nonlinear Iterative Partial Least Squares Method

Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for

### Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

### Independent Component Analysis: Algorithms and Applications

Independent Component Analysis: Algorithms and Applications Aapo Hyvärinen and Erkki Oja Neural Networks Research Centre Helsinki University of Technology P.O. Box 5400, FIN-02015 HUT, Finland Neural Networks,

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Factorization Theorems

Chapter 7 Factorization Theorems This chapter highlights a few of the many factorization theorems for matrices While some factorization results are relatively direct, others are iterative While some factorization

### Iterative Methods for Computing Eigenvalues and Eigenvectors

The Waterloo Mathematics Review 9 Iterative Methods for Computing Eigenvalues and Eigenvectors Maysum Panju University of Waterloo mhpanju@math.uwaterloo.ca Abstract: We examine some numerical iterative

### LINEAR ALGEBRA W W L CHEN

LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,

### Linear Systems. Singular and Nonsingular Matrices. Find x 1, x 2, x 3 such that the following three equations hold:

Linear Systems Example: Find x, x, x such that the following three equations hold: x + x + x = 4x + x + x = x + x + x = 6 We can write this using matrix-vector notation as 4 {{ A x x x {{ x = 6 {{ b General

### CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.

CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In

### Using the Singular Value Decomposition

Using the Singular Value Decomposition Emmett J. Ientilucci Chester F. Carlson Center for Imaging Science Rochester Institute of Technology emmett@cis.rit.edu May 9, 003 Abstract This report introduces

### Eigenvalues and Eigenvectors

Chapter 6 Eigenvalues and Eigenvectors 6. Introduction to Eigenvalues Linear equations Ax D b come from steady state problems. Eigenvalues have their greatest importance in dynamic problems. The solution

### State of Stress at Point

State of Stress at Point Einstein Notation The basic idea of Einstein notation is that a covector and a vector can form a scalar: This is typically written as an explicit sum: According to this convention,

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### The Characteristic Polynomial

Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem

### Nonlinear Blind Source Separation and Independent Component Analysis

Nonlinear Blind Source Separation and Independent Component Analysis Prof. Juha Karhunen Helsinki University of Technology Neural Networks Research Centre Espoo, Finland Helsinki University of Technology,

### 1 Introduction. 2 Matrices: Definition. Matrix Algebra. Hervé Abdi Lynne J. Williams

In Neil Salkind (Ed.), Encyclopedia of Research Design. Thousand Oaks, CA: Sage. 00 Matrix Algebra Hervé Abdi Lynne J. Williams Introduction Sylvester developed the modern concept of matrices in the 9th

### Structure of the Root Spaces for Simple Lie Algebras

Structure of the Root Spaces for Simple Lie Algebras I. Introduction A Cartan subalgebra, H, of a Lie algebra, G, is a subalgebra, H G, such that a. H is nilpotent, i.e., there is some n such that (H)

### Inner Product Spaces and Orthogonality

Inner Product Spaces and Orthogonality week 3-4 Fall 2006 Dot product of R n The inner product or dot product of R n is a function, defined by u, v a b + a 2 b 2 + + a n b n for u a, a 2,, a n T, v b,

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### 1 Short Introduction to Time Series

ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The

### 1 VECTOR SPACES AND SUBSPACES

1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such

### CS3220 Lecture Notes: QR factorization and orthogonal transformations

CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

### Chapter 6. Orthogonality

6.3 Orthogonal Matrices 1 Chapter 6. Orthogonality 6.3 Orthogonal Matrices Definition 6.4. An n n matrix A is orthogonal if A T A = I. Note. We will see that the columns of an orthogonal matrix must be

### 3 Orthogonal Vectors and Matrices

3 Orthogonal Vectors and Matrices The linear algebra portion of this course focuses on three matrix factorizations: QR factorization, singular valued decomposition (SVD), and LU factorization The first

### CONTROLLABILITY. Chapter 2. 2.1 Reachable Set and Controllability. Suppose we have a linear system described by the state equation

Chapter 2 CONTROLLABILITY 2 Reachable Set and Controllability Suppose we have a linear system described by the state equation ẋ Ax + Bu (2) x() x Consider the following problem For a given vector x in

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### ADVANCED LINEAR ALGEBRA FOR ENGINEERS WITH MATLAB. Sohail A. Dianat. Rochester Institute of Technology, New York, U.S.A. Eli S.

ADVANCED LINEAR ALGEBRA FOR ENGINEERS WITH MATLAB Sohail A. Dianat Rochester Institute of Technology, New York, U.S.A. Eli S. Saber Rochester Institute of Technology, New York, U.S.A. (g) CRC Press Taylor

### Tensors on a vector space

APPENDIX B Tensors on a vector space In this Appendix, we gather mathematical definitions and results pertaining to tensors. The purpose is mostly to introduce the modern, geometrical view on tensors,

### Mathematics of Cryptography

CHAPTER 2 Mathematics of Cryptography Part I: Modular Arithmetic, Congruence, and Matrices Objectives This chapter is intended to prepare the reader for the next few chapters in cryptography. The chapter

### A Introduction to Matrix Algebra and Principal Components Analysis

A Introduction to Matrix Algebra and Principal Components Analysis Multivariate Methods in Education ERSH 8350 Lecture #2 August 24, 2011 ERSH 8350: Lecture 2 Today s Class An introduction to matrix algebra

### Inner Product Spaces

Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

### Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round \$200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

### Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they

### Similarity and Diagonalization. Similar Matrices

MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that

### Tangent and normal lines to conics

4.B. Tangent and normal lines to conics Apollonius work on conics includes a study of tangent and normal lines to these curves. The purpose of this document is to relate his approaches to the modern viewpoints

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

### LINEAR ALGEBRA. September 23, 2010

LINEAR ALGEBRA September 3, 00 Contents 0. LU-decomposition.................................... 0. Inverses and Transposes................................. 0.3 Column Spaces and NullSpaces.............................

### Elasticity Theory Basics

G22.3033-002: Topics in Computer Graphics: Lecture #7 Geometric Modeling New York University Elasticity Theory Basics Lecture #7: 20 October 2003 Lecturer: Denis Zorin Scribe: Adrian Secord, Yotam Gingold

### Linear Algebra Done Wrong. Sergei Treil. Department of Mathematics, Brown University

Linear Algebra Done Wrong Sergei Treil Department of Mathematics, Brown University Copyright c Sergei Treil, 2004, 2009, 2011, 2014 Preface The title of the book sounds a bit mysterious. Why should anyone

### Matrix Algebra LECTURE 1. Simultaneous Equations Consider a system of m linear equations in n unknowns: y 1 = a 11 x 1 + a 12 x 2 + +a 1n x n,

LECTURE 1 Matrix Algebra Simultaneous Equations Consider a system of m linear equations in n unknowns: y 1 a 11 x 1 + a 12 x 2 + +a 1n x n, (1) y 2 a 21 x 1 + a 22 x 2 + +a 2n x n, y m a m1 x 1 +a m2 x

### Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition

Markov Chains, Stochastic Processes, and Advanced Matrix Decomposition Jack Gilbert Copyright (c) 2014 Jack Gilbert. Permission is granted to copy, distribute and/or modify this document under the terms

### 3. Let A and B be two n n orthogonal matrices. Then prove that AB and BA are both orthogonal matrices. Prove a similar result for unitary matrices.

Exercise 1 1. Let A be an n n orthogonal matrix. Then prove that (a) the rows of A form an orthonormal basis of R n. (b) the columns of A form an orthonormal basis of R n. (c) for any two vectors x,y R

### Hypothesis Testing in the Classical Regression Model

LECTURE 5 Hypothesis Testing in the Classical Regression Model The Normal Distribution and the Sampling Distributions It is often appropriate to assume that the elements of the disturbance vector ε within

### Diagonalisation. Chapter 3. Introduction. Eigenvalues and eigenvectors. Reading. Definitions

Chapter 3 Diagonalisation Eigenvalues and eigenvectors, diagonalisation of a matrix, orthogonal diagonalisation fo symmetric matrices Reading As in the previous chapter, there is no specific essential

### The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23

(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, high-dimensional

### Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 2, FEBRUARY 2002 359 Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel Lizhong Zheng, Student

### Bindel, Fall 2012 Matrix Computations (CS 6210) Week 8: Friday, Oct 12

Why eigenvalues? Week 8: Friday, Oct 12 I spend a lot of time thinking about eigenvalue problems. In part, this is because I look for problems that can be solved via eigenvalues. But I might have fewer

### 2.5 Gaussian Elimination

page 150 150 CHAPTER 2 Matrices and Systems of Linear Equations 37 10 the linear algebra package of Maple, the three elementary 20 23 1 row operations are 12 1 swaprow(a,i,j): permute rows i and j 3 3

### Understanding and Applying Kalman Filtering

Understanding and Applying Kalman Filtering Lindsay Kleeman Department of Electrical and Computer Systems Engineering Monash University, Clayton 1 Introduction Objectives: 1. Provide a basic understanding

### Basics Inversion and related concepts Random vectors Matrix calculus. Matrix algebra. Patrick Breheny. January 20

Matrix algebra January 20 Introduction Basics The mathematics of multiple regression revolves around ordering and keeping track of large arrays of numbers and solving systems of equations The mathematical

### Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

### MATH 240 Fall, Chapter 1: Linear Equations and Matrices

MATH 240 Fall, 2007 Chapter Summaries for Kolman / Hill, Elementary Linear Algebra, 9th Ed. written by Prof. J. Beachy Sections 1.1 1.5, 2.1 2.3, 4.2 4.9, 3.1 3.5, 5.3 5.5, 6.1 6.3, 6.5, 7.1 7.3 DEFINITIONS

### Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution Timo Koski 24.09.2015 Timo Koski Matematisk statistik 24.09.2015 1 / 1 Learning outcomes Random vectors, mean vector, covariance matrix,

### 1 Sets and Set Notation.

LINEAR ALGEBRA MATH 27.6 SPRING 23 (COHEN) LECTURE NOTES Sets and Set Notation. Definition (Naive Definition of a Set). A set is any collection of objects, called the elements of that set. We will most

### BLOCK JACOBI-TYPE METHODS FOR LOG-LIKELIHOOD BASED LINEAR INDEPENDENT SUBSPACE ANALYSIS

BLOCK JACOBI-TYPE METHODS FOR LOG-LIKELIHOOD BASED LINEAR INDEPENDENT SUBSPACE ANALYSIS Hao Shen, Knut Hüper National ICT Australia, Australia, and The Australian National University, Australia Martin

### ISOMETRIES OF R n KEITH CONRAD

ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x

### Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

This material is posted here with permission of the IEEE Such permission of the IEEE does not in any way imply IEEE endorsement of any of Helsinki University of Technology's products or services Internal

### The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every

### CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye

CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize

### Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Peter J. Olver 6. Eigenvalues and Singular Values In this section, we collect together the basic facts about eigenvalues and eigenvectors. From a geometrical viewpoint,

### Linear Algebra Done Wrong. Sergei Treil. Department of Mathematics, Brown University

Linear Algebra Done Wrong Sergei Treil Department of Mathematics, Brown University Copyright c Sergei Treil, 2004, 2009, 2011, 2014 Preface The title of the book sounds a bit mysterious. Why should anyone

### Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important

### a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.

Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### ASEN 3112 - Structures. MDOF Dynamic Systems. ASEN 3112 Lecture 1 Slide 1

19 MDOF Dynamic Systems ASEN 3112 Lecture 1 Slide 1 A Two-DOF Mass-Spring-Dashpot Dynamic System Consider the lumped-parameter, mass-spring-dashpot dynamic system shown in the Figure. It has two point

### The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey Chapter 10 Boundary Value Problems for Ordinary Differential Equations Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign

### Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

### α = u v. In other words, Orthogonal Projection

Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Section 1.1. Introduction to R n

The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

### 26. Determinants I. 1. Prehistory

26. Determinants I 26.1 Prehistory 26.2 Definitions 26.3 Uniqueness and other properties 26.4 Existence Both as a careful review of a more pedestrian viewpoint, and as a transition to a coordinate-independent

### ECEN 5682 Theory and Practice of Error Control Codes

ECEN 5682 Theory and Practice of Error Control Codes Convolutional Codes University of Colorado Spring 2007 Linear (n, k) block codes take k data symbols at a time and encode them into n code symbols.

### Linear Least Squares

Linear Least Squares Suppose we are given a set of data points {(x i,f i )}, i = 1,...,n. These could be measurements from an experiment or obtained simply by evaluating a function at some points. One

### Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 10

Lecture Notes to Accompany Scientific Computing An Introductory Survey Second Edition by Michael T. Heath Chapter 10 Boundary Value Problems for Ordinary Differential Equations Copyright c 2001. Reproduction

### Notes on Factoring. MA 206 Kurt Bryan

The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

### Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001,