BLOCK JACOBI-TYPE METHODS FOR LOG-LIKELIHOOD BASED LINEAR INDEPENDENT SUBSPACE ANALYSIS

BLOCK JACOBI-TYPE METHODS FOR LOG-LIKELIHOOD BASED LINEAR INDEPENDENT SUBSPACE ANALYSIS Hao Shen, Knut Hüper National ICT Australia, Australia, and The Australian National University, Australia Martin Kleinsteuber Department of Mathematics, University of Würzburg, Germany ABSTRACT Independent Subspace Analysis (ISA) is a natural generalisation of Independent Component Analysis (ICA) incorporated with invariant feature subspaces, where mutual statistical independence exists between subspaces, while mutual statistical dependence is still allowed between components within the same subspace. In this paper, we develop a general scheme of block Jacobi-type ISA methods which optimise a popular family of log-likelihood based ISA contrast functions. It turns out that block Jacobi-type ISA method is an efficient tool for both parametric and nonparametric approaches. Rigorous analysis regarding the local convergence properties is provided in a general sense. A concrete realisation of the block Jacobi-type ISA method, employing a Newton step strategy, is proposed and demonstrates its local quadratic convergence properties to a correct subspace separation. Performance of the proposed algorithms is investigated by numerical experiments.. INTRODUCTION As a generalisation of the standard blind source separation (BSS), the so-called multidimensional blind source separation (MBSS) studies the problem of extracting sources in terms of groups other than individual signals. Since the success of using Independent Component Analysis (ICA) to solve BSS, meanwhile, an analogous statistical tool to ICA has been proposed for solving MBSS as Multidimensional Independent Component Analysis (MICA), which assumes that components from different groups are mutually statistically independent, while mutual statistical dependence is still allowed between components in the same subspace. Incorporated with invariant feature spaces, MICA is also referred to as Independent Subspace Analysis (ISA) National ICT Australia is funded by the Australian Government s Department of Communications, Information Technology and the Arts and the Australian Research Council through Backing Australia s Ability and the ICT Research Centre of Excellence programs. Emails: Hao.Shen@rsise.anu.edu.au, Knut.Hueper@nicta.com.au Email: Kleinsteuber@mathematik.uni-wuerzburg.de. In this work, we study the problem of linear ISA from an optimisation point of view. According to the pioneering work, it shows that, in general, any standard ICA algorithm can be adapted to solve MICA in two steps: (i) to utilise a standard ICA method to estimate all individual signals; (ii) to construct mutually statistically independent subspaces by grouping dependent signals together. The Jacobi-type method is an important tool for solving the standard linear ICA problem. It jointly diagonalises a given set of commuting symmetric matrices, which are constructed in accordance with certain ICA models, such as JADE or MaxKurt, 3. Apart from a full joint diagonalisation of a set of symmetric matrices, it has been shown that the problem of MICA can be solved by a joint block diagonalisation with respect to a fixed block structure 4. Recently, a class of MICA methods based on joint block diagonalisation has been developed in 4, 5 by performing a standard Jacobi-type method as in 6 followed by certain permutations on the columns of the demixing matrix, to obtain block diagonalisation of a set of symmetric matrices. Although the efficiency of this approaches has been verified by numerical evidence, to our best knowledge, up to now there was no theory developed yet to guarantee that the efficiency and convergence properties of standard ICA algorithms hold for their MICA/ISA counterparts, as well. It is well known that the Jacobi-type method is essentially an optimisation procedure. Instead of optimising over a single parameter at one time, the standard Jacobi-type method has been generalised to the so-called block Jacobitype method 7, which optimises over several parameters simultaneously. It has also been shown that a convenient setting for doing linear ISA is indeed a flag manifold 8. In this paper, we develop a general scheme of block Jacobitype ISA methods on flag manifolds, in order to optimise a popular family of log-likelihood based ISA contrast functions. The paper is organised as follows. Section briefly introduces the linear ISA model with log-likelihood based ISA contrast functions and a block Jacobi-type method on a flag manifold. In Section 3, we give a critical point analysis of the log-likelihood based ISA contrast function followed

by a study of the Hessian, and propose a general scheme of block Jacobi-type ISA methods. Local convergence results of the proposed methods are presented without proof. By using a Newton step strategy, a concrete block Jacobi-type ISA method is formulated. It shares the same local convergence properties with the general cases. Analogous results on a similar nonparametric ISA approach, which is based on kernel density estimation, are also discussed. Finally, numerical experiments in Section 4 investigate the performance of the proposed algorithms.. PRELIMINARIES: LINEAR ISA MODEL AND BLOCK JACOBI-TYPE METHODS.. Linear ISA Model and linear ISA Contrast In this work, we study the standard noiseless linear instantaneous ISA model as follows, refer to for more details. Let Z = AS, () where S = s,..., s n R m n represents n samples of m sources with m n, which consist of p mutually statistically independent groups with the dimension of each subspace being d i, for i =,..., p, and p i= d i = m. The matrix A R m m is the full rank mixing matrix, and Z = z,..., z n R m n represents the observed mixtures. It is important to notice that the mutual statistical independence is ensured only if the sample size n tends to infinity. Nevertheless, for our theoretical analysis in Section 3, we assume that the independence holds even if the sample size is finite. The task of the linear ISA model is to recover the source signals S in mutually statistically independent groups based only on the observations Z via a linear transformation Q = B Z, () where B R m m is the full rank demixing matrix, and Q R m n represents p independent groups of d dependent signals. Let B = b,..., b p R m m with b i R m di and rk b i = d i for i =,..., p. If B = b,..., b p R m m is a correct demixing matrix and every b i extracts a statistically independent group of d i dependent signals, then B with span b i = span b i, for all i =,..., p, provides a correct separation of independent groups as well. Let us define r i := i j= d j for all i =,..., p. It is clear that < r <... < r p = m is an increasing sequence of integers. The solution set of the linear ISA problem can then be identified as the collection of ordered sets of p vector subspaces V i of R m with dim V i = r i for i =,... p and V... V p = R m, i.e., the flag manifold F l(r,..., r p ). In this work, we only study the situation where all independent subspaces have the same dimension d. For the sake of simplicity, in the following, we use F l(p, d) to denote the flag manifold F l(r,..., r p ) with r i = i d for all i =,..., p. Similar to performing linear ICA, the so-called whitening process of the mixtures can be applied to simplify the demixing ISA model () as follows Y = X W, (3) where W = w,..., w n = V Z R m n is the whitened observation (V R m m is the whitening matrix), X R m m is an orthogonal matrix being the demixing matrix, and Y = y,..., y n R m n contains the reconstructed p independent groups of signals. Let us denote the special orthogonal group of order m by SO(m) := { X R m m X X = I, det(x) = }. Let X = x,..., x p SO(m) with x i R m d, i.e., x i x i = I. We define an equivalence relation on SO(m) by the following: for any X, X SO(m), X X if and only if span x i = span x i, for all i =,..., p. We denote the equivalence class containing X SO(m) under by X. Obviously, every equivalence class X, for X SO(m), identifies exactly one point in F l(p, d) and X is a representative of X F l(p, d). The key idea of linear ISA is to maximise the mutual statistical independence between the norms of the projections of observations on a set of linear subspaces. Minimisation of the negtive log-likelihood between recovered signals is a widely used independence criterion in standard ICA. We adapt the same criterion to the linear ISA case as follows F : F l(p, d) R, ( ) w i F ( X ) := E i log ψ x kx k wi, k= where ψ( ) is the differential probability density function (PDF) of the norm of the projection of the observations on a certain linear subspace and E i is the empirical mean over index i. It is easily seen that the ISA contrast function (4) is independent of concrete representatives of an equivalence class X. The PDF ψ is usually chosen hypothetically based on applications. For the sake of simplicity, we use G(a) := log ψ(a). It can just be considered as a special parametric approach... Block Jacobi-type Methods on Flag Manifolds Block Jacobi-type procedures were developed as a generalisation of standard Jacobi method in terms of grouped variables for solving symmetric eigenvalue problems or singular value problems 9. Recent work in 7 formulates the so-called block Jacobi-type method as an optimisation approach on manifolds. We now adapt the general formulation as in 7 to the present setting, the flag manifold F l(p, d). (4)

Denote the vector space of all m m skew-symmetric matrices by so(m) := {Ω R m m Ω = Ω }. Let m = p d. We fix a subspace B(p, d) so(m) with fixed block structure as follows. Any Ω B(p, d) consists of p blocks of dimension d d. The (d d)-diagonal blocks ω ll of Ω = (ω kl ) p k,l= B(p, d) are all equal to zero. For example, for p = 3, Ω B(3, d) looks as Ω = ( 3 3 ω ω 3 ω 3 3 ω3 ω 3 ω 3 3 3 ), (5) where ω kl = ω lk R3 3. By means of the matrix exponential map, a local parameterisation µ X of F l(p, d) around X is defined as µ X : B(p, d) F l(p, d), µ X (Ω) := Xe Ω. (6) The tangent space of F l(p, d) at X is then T X F l(p, d) = d d t µ X (t B(p, d)) t=. (7) Now let us decompose B(p, d) as follows B(p, d) = B kl (p, d), (8) k<l p where all blocks of Ω B kl (p, d) = R d d are equal to zero except for the kl-th and lk-th block. We then define V kl X := d d t µ X (t B kl (p, d)) t=. (9) It is clear that (V X kl ) k<l p gives a direct sum decomposition of the tangent space T X F l(p, d) as well, i.e., The smooth maps T X F l(p, d) = k<l p τ kl : B kl (p, d) F l(p, d) F l(p, d), τ kl (Ω, X ) := µ X (Ω), V kl X. () () for all k < l p, are referred to as the basic transformations. Let f : F l(p, d) R be a smooth cost function. A block Jacobi-type method for minimising f can be summarised as follows Algorithm Block Jacobi-type method on F l(p, d) Step : Given an initial guess X F l(p, d) and a set of basic transformations τ kl, for all k < l p, as defined in (). Step : (Sweep) Let X old = X. For k < l p (i) Compute Ω := arg min (f τ kl )(Ω, X ), Ω B kl (p,d) (ii) Update X τ kl (Ω, X ). Step 3: If δ( X old, X ) is small enough, Stop. Otherwise, goto Step. Here δ( X old, X ) represents a certain distance measure between two points on F l(p, d). Following the result of Theorem.4 in 7, we state the following theorem without proof. Theorem Let f : F l(p, d) R be a smooth cost function and X F l(p, d) be a local minimum of f. If the Hessian H f ( X ) is nondegenerated and the vector subspaces V X kl, for all k < l p, as in () are mutually orthonormal with respect to the Hessian H f ( X ), then the block Jacobi-type method converges locally quadratically fast. 3. BLOCK JACOBI-TYPE ISA METHODS 3.. Analysis of Linear ISA Contrasts In this section, we will first show that the log-likelihood based linear ISA contrast function as in (4) fulfills the conditions stated in Theorem, i.e., one can develop a scheme of block Jacobi-type methods, which minimise the negative log-likelihood based ISA contrasts, with local quadratic convergence properties. By the chain rule, the first derivative of the contrast F is calculated as d dt (F µ X )(tω) t= = tr ωkl(u kl (X) u lk (X)), () k<l p where u kl (X), u lk (X) R d d with u kl (X)=E i G ( w i x lx l wi )x kw i wi x l, and (3) u lk (X)=E i G ( w i x kx k wi )x kw i wi x l It can be shown that if X F l(p, d) is a correct demixing point, by the whitening properties of the sources, the term u kl (X ) is equal to for all k l. Thus it follows that the first derivative of F vanishes at X. Therefore a correct demixing point X is indeed a critical point of F. Note that there exist more critical points than the correct separation points. By a straightforward computation, the second derivative of F at a correct separation point X is calculated as follows d d t (F µ X )(tω) t= (4) = tr ωkl (v kk (X ) + v ll (X )) ω kl k<l p

where v kk (X) = E i G ( w i x kx k wi )x kw i wi x k E i G ( w i x kx k wi )x kw i wi x k + E i G ( w i x kx k wi ) I d. (5) It is clear that the Hessian of F evaluated at X is indeed block diagonal with the size of each diagonal block being d d. Note that the properties in (5) hold true only if the statistical independence can be ensured for the sources. 3.. A Block Jacobi-type ISA method According to the results in Section 3., we now develop a scheme of block Jacobi-type linear ISA method. For any k < l p, we denote X := µ X. (6) Bkl (p,d) µ kl Each partial step in a Jacobi-type sweep (Step in Algorithm ) requires to solve an unconstrained optimisation problem as F µ kl X : B kl(p, d) = R d d R. (7) As stated in Algorithm, one will need to solve the above subproblem for a global optimum. Unfortunately, it seems not feasible to do so in the current case (7). Nevertheless we can still make a theoretical conclusion as the following. Corollary Let X F l(p, d) be a correct separation point of a linear ISA problem. Then the block Jacobi-type linear ISA method in the fashion of Algorithm is locally quadratically convergent to X. It is well known that the performance of block Jacobitype methods critically depends on the methods to solve the subproblems. In the rest of this section, we formulate a Newton step based realisation of the block Jacobi-type linear ISA method, i.e., other than seeking for a local or global minimum of the restricted subproblem (7), we apply a single Newton optimisation step on each basic transformation. Similar techniques have already been used in,. The resulting algorithm preserves the local quadratic convergence properties as Algorithm does. The first and second derivatives of F µ kl X are computed as follows d dt (F µkl X )(tω) t= =tr ω kl(u kl (X) u lk (X)), (8) d dt (F µ kl X )(tω) t= =tr ω kl(h (X) kl (Ω)+h(X) lk (Ω)), where Ω B kl (p, d) and h (X) kl (Ω)=E i G ( w i x kx k wi )(x kw i w ix l )(w ix k ω kl x lw i ) E i G ( w i x kx k wi )(x kw i wi x k ω kl ) (9) + E i G ( w i x kx k wi )(ω kl x l w i wi x l ). Thus, a single Newton step is computed by solving the following linear system for Ω B kl (p, d) h (X) kl (Ω) + h (X) lk (Ω) = u kl(x) u lk (X). () By recursively iterating the above Newton step approach on each basic transformation, it completes the corresponding Newton step based block Jacobi-type ISA method. The local convergence properties of the Newton step based block Jacobi-type ISA method is stated as follows. Due to the page limits, we omit the proof. Proposition 3 Let X F l(p, d) be a correct separation point of a linear ISA problem. Then the block Jacobi-type linear ISA method employing a single Newton step as () on each basic transformation τ kl for all k < l p is locally quadratically convergent to X. 3.3. ISA Contrast Using Kernel Density Estimation Similar to ICA, the true distribution of the norm of projected observations is generally unknown. By employing the kernel density estimation technique, a popular nonparametric approach, an empirical negative log-likelihood between the norms of projected components can be formulated as follows, see for more details, F : F l(p, d) R, ( F ( X ):= E i log k= h E j ( φ w ij x kx k wij h )), () where w ij := w i w j R m represents the difference between the i-th and j-th sample, φ: R R is an appropriate kernel function, e.g., the Gaussian kernel φ(a) = exp ( a), and h R + is the kernel bandwidth. Following more tedious but analogous computations as for the general contrast function (4), it shows that (i) a correct separation point X is a critical point of F, (ii) the Hessian of F at X is also block diagonal with respect to the fixed block structure d d. It then follows directly that block Jacobi-type method is indeed an efficient tool for minimising the empirical ISA contrast function (). A Block Jacobi-type ISA method for optimising the contrast function () can be formulated directly in the similar fashion as in Section 3.. The convergence properties for general settings stated in Corollary and Proposition 3 will still apply to the empirical situation here. Due to the page limits,

all descriptions of the algorithm and proves of corresponding convergence results will be omitted here. For further details, we refer to our forthcoming journal paper. 4. NUMERICAL EXPERIMENTS In this section, we propose two experiments to illustrate the properties of the proposed ISA methods. Section 4. demonstrates the local quadratic convergence properties of the Newton step based block Jacobi-type ISA method by an ideal example. In Section 4., the Newton step based empirical block Jacobi-type ISA method proposed in Section 3.3 is compared with an ICA based ISA approach in terms of separation quality. 4.. Experiment As pointed out before, in general, the statistical independence holds only if the sample size tends to infinity. It indicates that the theoretical results as shown in Corollary and Proposition 3 could not generally be observed or verified in real environment. Nevertheless, in this experiment, by constructing an ideal dataset where the statistical independence can be ensured, we will illustrate the theoretical convergence result of the Newton step based block Jacobitype ISA method, i.e. the result in Proposition 3. Let us first specify the ideal dataset, which consists of three statistically independent signal groups with two dependent signals per group, as shown in Fig. and approximate the PDF by ψ(a) = cosh(a). The convergence properties are measured by the distance of the accumulation point X to the current iterate X k, which is defined as follows, for X, X F l(p, d), δ( X, X ) := x i x i x i x i F, () i= where F is the Frobenius norm. The numerical results in Fig. evidently verify the local quadratic convergence properties of the Newton step based block Jacobi-type linear ISA method stated in Proposition 3. 4.. Experiment In this experiment, we investigate separation performance of the Newton step based empirical block Jacobi-type ISA method proposed in Section 3.3. It is compared with the popular approach of applying an ICA method followed by a regrouping process (referred to here as ICA-Group ISA). By fixing the dimension of each subspace d =, the block Jacobi-type ISA methods proposed in section 3 can be adapted easily to solve the standard linear ICA problem, i.e., a standard Jacobi-type ICA method. We refer to 3 for δ ( X k, X * ) S 4 6 8 4 6 8 4 6 8 S S3 4 6 8 5 5 4 6 8 4 6 8 Index S4 S5 S6 5 5 Fig.. A toy ideal dataset. 3 4 5 6 7 8 Sweep (k) Fig.. Convergence properties of the Newton step based block Jacobi-type linear ISA method. implementation details. For each test, both methods are initialised by the same separation point which is close to an optimal solution. Our test data is generated as follows. Firstly we take three statistically independent signals randomly out of sources, with a fixed sample size n =,, from the benchmark speech dataset provided by the Brain Science Institute, RIKEN, see http://www.bsp.brain.riken.jp/data. By generating a distortion of each signal, we end up with test data having three statistically independent signal groups with two dependent signals per group. To measure the separation quality in ISA scenario, a so-called multidimensional performance index (MPI) has been proposed in 4 as a generalisation of the Amari error 4 by pp pp c ij c ij d(c) := p i= j= max j c ij + i= max c i ij j=, (3) where C = (c ij ) p i,j= Rm m with c ij R d d and m =

PMI.8.6.4. ICA Group ISA Block Jacobi type ISA Fig. 3. Separation performance of the proposed method. p d. For a given separation point X F l(p, d), we define C := X V A. Here the notation represents a certain matrix norm. As suggested in 4, for a given c R d d, c gives the absolute value of the largest eigenvalue of c. Generally, the smaller the index, the better the separation. After a replication of times of the test, we present the quartile based boxplots of the MPI score in Fig. 3. It shows that the Newton-step based empirical block Jacobitype ISA method outperforms the ICA-Group approach in terms of separation quality consistently. 5. REFERENCES J.-F. Cardoso, Multidimensional independent iomponent analysis, in Proceedings of the 3 rd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 998), Seattle, WA, USA, 998, pp. 94 944. A. Hyvärinen and P. O. Hoyer, Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces, Neural Computation, vol., no. 7, pp. 75 7,. 3 J.-F. Cardoso, High-order contrasts for independent component analysis, Neural Computation, vol., no., pp. 57 9, 999. 4 F. J. Theis, Blind signal separation into groups of dependent signals using joint block diagonalization, in IEEE International Symposium on Circuits and Systems, 5 (ISCAS 5), Kobe, Japan, 5, pp. 5878 588. 6 J.-F. Cardoso and A. Souloumiac, Jacobi angles for simultaneous diagonalisation, SIAM Journal of Matrix Analysis and Application, vol. 7, no., pp. 6 64, 996. 7 K. Hüper, A Calculus Approach to Matrix Eigenvalue Algorithms, Habilitation Dissertation, Department of Mathematics, University of Würzburg, Germany, July. 8 Y. Nishimori, S. Akaho, and M. Plumbley, Riemannian optimization method on the flag manifold for independent subspace analysis, in Lecture Notes in Computer Science, Proceedings of the 6 th International Conference on Independent Component Analysis and Blind Source Separation (ICA 6), Berlin/Heidelberg, 6, vol. 3889, pp. 95 3, Springer-Verlag. 9 G. Golub and C. F. van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, nd edition, 989. J. J. Modi and J. D. Pryce, Efficient implementation of Jacobi s diagonalization method on the DAP, Numerische Mathematik, vol. 46, no. 3, pp. 443 454, 985. J. Götze, S. Paul, and M. Sauer, An efficient Jacobilike algorithm for parallel eigenvalue computation, IEEE Transactions on Computers, vol. 4, no. 9, pp. 58 65, 993. R. Boscolo, H. Pan, and V. P. Roychowdhury, Independent component analysis based on nonparametric density estimation, IEEE Transactions on Neural Networks, vol. 5, no., pp. 55 65, 4. 3 H. Shen, M. Kleinsteuber, and K. Hüper, Efficient geometric methods for kernel density estimation based independent component analysis, To appear at EU- SIPCO 7, Poznań, Poland, September 3-7, 7. 4 S. Amari, A. Cichocki, and H. H. Yang, A new learning algorithm for blind signal separation, in Advances in Neural Information Processing Systems, David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, Eds. 996, vol. 8, pp. 757 763, The MIT Press. 5 F. J. Theis, Multidimensional independent component analysis using characteristic functions, in Proceedings of the 3 th European Signal Processing Conference (EUSIPCO 5), Antalya, Turkey, 5.