Understanding Big Data Spectral Clustering

Transcription

1 Understanding Big Data Spectral Clustering Romain Couillet, Florent Benaych-Georges To cite this version: Romain Couillet, Florent Benaych-Georges Understanding Big Data Spectral Clustering 205 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Dec 205, Cancun, Mexico 205 <hal > HAL Id: hal Submitted on 25 Sep 205 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not The documents may come from teaching and research institutions in France or abroad, or from public or private research centers L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés

2 Understanding Big Data Spectral Clustering Romain Couillet, Florent Benaych-Georges CentraleSupélec LSS Université ParisSud, Gif sur Yvette, France MAP 5, UMR CNRS 845 Université Paris Descartes, Paris, France Abstract This article introduces an original approach to understand the behavior of standard kernel spectral clustering algorithms (such as the Ng Jordan Weiss method) for large dimensional datasets Precisely, using advanced methods from the field of random matrix theory and assuming Gaussian data vectors, we show that the Laplacian of the kernel matrix can asymptotically be well approximated by an analytically tractable equivalent random matrix The study of the latter unveils the mechanisms into play and in particular the impact of the choice of the kernel function and some theoretical limits of the method Despite our Gaussian assumption, we also observe that the predicted theoretical behavior is a close match to that experienced on real datasets (taken from the MNIST database) I INTRODUCTION Letting x,, x n R p be n data vectors, kernel spectral clustering consists in a variety of algorithms designed to cluster these data in an unsupervised manner by retrieving information from the leading eigenvectors of (a possibly modified version of) the so-called kernel matrix K = {K ij } n i,j= with eg, K ij = f( x i x j /p) for some (usually decreasing) f : R + R + There are multiple reasons (see eg, []) to expect that the aforementioned eigenvectors contain information about the optimal data clustering One of the most prominent of those was put forward by Ng Jordan Weiss in [2] who notice that, if the data are ideally well split in k classes C,, C k that ensure f( x i x j /p) = 0 if and only if x i and x j belong to distinct classes, then the eigenvectors associated with the k smallest eigenvalues of I n D 2 KD 2, D D(K n ), live in the span of the canonical class-wise basis vectors In the non-trivial case where such a separating f does not exist, one would thus expect the leading eigenvectors to be instead perturbed versions of indicator vectors We shall precisely study the matrix I n D 2 KD 2 in this article Nonetheless, despite this conspicuous argument, very little is known about the performance of kernel spectral clustering in actual working conditions In particular, to the authors knowledge, there exists no contribution addressing the case of arbitrary p and n In this article, we propose a new approach consisting in assuming that both p and n are large, and exploiting recent results from random matrix theory Our method is inspired by [3] which studies the asymptotic distribution of the eigenvalues of K for iid vectors x i We generalize here [3] by assuming that the x i s are drawn from a mixture of k Gaussian vectors having means µ,, µ k and covariances C,, C k We then go further by studying the resulting model and showing that L = D 2 KD 2 can be Couillet s work is supported by RMT4GRAPH (ANR-4-CE ) approximated by a matrix of the so-called spiked model type [4], [5], that is a matrix with clustered eigenvalues and a few isolated outliers Among other results, our main findings are: in the large n, p regime, only a very local aspect of the kernel function really matters for clustering; there exists a critical growth regime (with p and n) of the µ i s and C i s for which spectral clustering leads to non-trivial misclustering probability; we precisely analyze elementary toy models, in which the number of exploitable eigenvectors and the influence of the kernel function may vary significantly On top of these theoretical findings, we shall observe that, quite unexpectedly, the kernel spectral algorithms behave similar to our theoretical findings on real datasets We precisely see that clustering performed upon a subset of the MNIST (handwritten figures) database behaves as though the vectorized images were extracted from a Gaussian mixture Notations: The norm stands for the Euclidean norm for vectors and operator norm for matrices The vector m R m stands for the vector filled with ones The operator D(v) = D({v a a=) is the diagonal matrix having v,, v k (scalar or vectors) down its diagonal The Dirac mass at x is δ x Almost sure convergence is denoted, as and convergence in distribution D II MODEL AND THEORETICAL RESULTS Let x,, x n R p be independent vectors with x n+ +n l +,, x n+ +n l C l for each l {,, k}, where n 0 = 0 and n + + n k = n Class C a encompasses data x i = µ a + w i for some µ a R p and w i N (0, C a ), with C a R p p nonnegative definite We shall consider the large dimensional regime where both n and p grow simultaneously large In this regime, we shall require the µ i s and C i s to behave in a precise manner As a matter of fact, we may state as a first result that the following set of assumptions forms the exact regime under which spectral clustering is a non trivial problem n a n Assumption (Growth Rate): As n, n c 0 > 0, c a > 0 (we will write c = [c,, c k ] T ) Besides, ) For µ k n a a= n µ a and µ a = µ a µ, µ a = O() 2) For C k n a a= O() and tr C a = O( n) 3) As p, 2 p tr C τ > 0 n C a and C a = C a C, C a = The value τ is important since p x i x j 2 as τ uniformly on i j in {,, n} p

3 We now define the kernel function as follows Assumption 2 (Kernel function): Function f is three-times continuously differentiable around τ and f(τ) > 0 Having defined f, we introduce the kernel matrix as { ( )} n K f p x i x j 2 i,j= From the previous remark on τ, note that all non-diagonal elements of K tend to f(τ) and thus K can be point-wise developed using a Taylor expansion However, our interest is on (a slightly modified form of) the Laplacian matrix L nd 2 KD 2 where D = D(K n ) is usually referred to as the degree matrix Under Assumption, L is essentially a rank-one matrix with D 2 n for leading eigenvector (with n for eigenvalue) To avoid technical difficulties, we shall study the equivalent matrix 2 L nd 2 KD 2 n D 2 n T nd 2 T nd n () which we shall show to have all its eigenvalues of order O() Our main technical result shows that there is a matrix ˆL such that L ˆL as 0, where ˆL follows a tractable random matrix model Before introducing the latter, we need the following fundamental deterministic element notations 3 M [µ,, µ k] R p k { t p tr Ca R k T { p tr C ac b a= a,b= J [j,, j k ] R n k P I n n n T n R n n R k k where j a R n is the canonical vector of class C a, defined by (j a ) i = δ xi C a, and the random element notations W [w,, w n ] R p n Φ p W T M R n k ψ p { wi 2 E[ w i 2 ] } n i= Rn Theorem (Random Matrix Equivalent): Let Assumptions and 2 hold and L be defined by () Then, as n, L ˆL as 0 2 It is clearly equivalent to study L or L that have the same eigenvalueeigenvector pairs but for the pair (n, D 2 n) of L turned into (0, D 2 n) for L 3 Capital M stands here for means while t, T account for vector and matrix of traces, P for a projection matrix (onto the orthogonal of n T n) where ˆL is given by ˆL 2f [ (τ) P W T W P f(τ) p with F (τ) = f(0) f(τ)+τf (τ) [ ] U p J, Φ, ψ B B = M T M + ( 5f (τ) + UBU T ] + f(τ) F (τ)i n and ( ) B I k k c T 5f (τ) t I k c T k ) 0 k k 0 k t T 5f 0 (τ) k 8f(τ) f ) (τ) 2f tt T f (τ) (τ) f (τ) T + p n F (τ) k T k and the case f (τ) = 0 is obtained by extension by continuity (in the limit f (τ)b being well defined as f (τ) 0) From a mathematical standpoint, excluding the identity matrix, when f (τ) 0, ˆL follows a spiked random matrix model, that is its eigenvalues congregate in bulks but for a few isolated eigenvalues, the eigenvectors of which align to some extent to the eigenvectors of UBU T When f (τ) = 0, ˆL is merely a small rank matrix In both cases, the isolated eigenvalue-eigenvector pairs of ˆL are amenable to analysis From a practical aspect, note that U is notably constituted by the vectors j a, while B contains the information about the inter-class mean deviations through M, and about the inter-class covariance deviations through t and T As such, the aforementioned isolated eigenvalue-eigenvector pairs are expected to correlate to the canonical class basis J and all the more so that M, t, T have sufficiently strong norm From the point of view of the kernel function f, note that, if f (τ) = 0, then M vanishes from the expression of ˆL, thus not allowing spectral clustering to rely on differences in means Similarly, if f (τ) = 0, then T vanishes, and thus differences in shape between the covariance matrices cannot be discriminated upon Finally, if 5f (τ) 8f(τ) = f (τ), then differences in covariance traces are seemingly not exploitable Before introducing our main results, we need the following technical assumption which ensures that p P W T W P does not in general produce itself isolated eigenvalues (and thus, that the isolated eigenvalues of ˆL are solely due to UBU T ) Assumption 3 (Spike control): With λ (C a ) λ p (C a ) the eigenvalues of C a, for each a, as n, p p i= δ D λ i(c a) ν a, with support supp(ν a ), and max dist(λ i(c a ), supp(ν a )) 0 i p Theorem 2 (Isolated eigenvalues 4 ): Let Assumptions 3 hold and define, for z R, the k k matrix G z = h(τ, z)i k + D τ,z Γ z 4 Again here, the case f (τ) = 0 is obtained by extension by continuity

4 where h(τ, z) = + 8f(τ) f ) k (τ) 2f c i g i (z) 2 (τ) p tr C2 i i= k D τ,z = h(τ, z)m T I p + c j g j (z)c j M h(τ, z) f (τ) f (τ) T + j= Γ z = D {c a g a (z) a= { c a g a (z)c b g b (z) k i= c ig i (z) ) tt T a,b= and g (z),, g k (z) are, for well chosen z, the unique solutions to the system ( ) c 0 g a (z) = z + k p tr C a I p + c i g i (z)c i i= Let ρ, away from the eigenvalue support of p P W T W P, be such that h(τ, ρ) 0 and G ρ has a zero eigenvalue of multiplicity m ρ Then there exists m ρ eigenvalues of L asymptotically close to 2 f (τ) f(τ) ρ + f(0) f(τ) + τf (τ) f(τ) We now turn to the more interesting result concerning the eigenvectors This result is divided in two formulas, concerning (i) the eigenvector D 2 n associated with the eigenvalue n of L and (ii) the remaining eigenvectors associated with the eigenvalues exhibited in Theorem 2 Proposition (Eigenvector D 2 n ): Let Assumptions 2 hold true Then, for some ϕ N (0, I n ), almost surely, [ D 2 n = n + f (τ) T n D n n 2f(τ) {t a na a= n c 0 { 2 + D p tr(c2 a) na ϕ + o() a= Theorem 3 (Eigenvector projections): Let Assumptions 3 hold Let also λ p j,, λp j+m ρ be isolated eigenvalues of L all converging to ρ as per Theorem 2 and Π ρ the projector on the eigenspace associated to these eigenvalues Then, m p J ρ T h(τ, ρ)(v r,ρ ) i (V l,ρ ) T i Π ρ J = Γ(ρ) (V i= l,ρ ) T + o() i G ρ(v r,ρ ) i almost surely, where V r,ρ, V l,ρ C k mρ are sets of right and left eigenvectors of G ρ associated with the eigenvalue zero, and G ρ is the derivative of G z along z taken for z = ρ From Proposition, we get that D 2 n is centered around the sum of the class-wise vectors t a j a with fluctuations of amplitude 2 p tr(c2 a) As for Theorem 3, it states that, as p, n grow large, the alignment between the isolated eigenvectors of L and the canonical class-basis j,, j k tends to be ] deterministic in a theoretically tractable manner In particular, the quantity ( ) tr n D(c 2 )J T Π ρ J D(c 2 ) [0, m λ ] evaluates the alignment between Π ρ and the canonical class basis, thus providing a first hint on the expected performance of spectral clustering A second interest of Theorem 3 is that, for eigenvectors û of L of multiplicity one (so Π ρ = ûû T ), the diagonal elements of n D(c 2 )J T Π ρ J D(c 2 ) provide the squared mean values of the successive first j, then next j 2, etc, elements of û The off-diagonal elements of n D(c 2 )J T Π ρ J D(c 2 ) then allow to decide on the signs of û T j i for each i These pieces of information are crucial to estimate the expected performance of spectral clustering However, the statements of Theorems 2 and 3 are difficult to interpret as they stand These become more explicit when applied to simpler scenarios and allow one to draw interesting conclusions This is the target of the next section III SPECIAL CASES In this section, we apply Theorems 2 and 3 to the cases where: (i) C i = βi p for all i, with β > 0, (ii) all µ i s are equal and C i = ( + γi p )βi p Assume first that C i = βi p for all i Then, letting l be an isolated eigenvalue of βi p + M D(c)M T, we get that, if l β > β c 0 (2) then the matrix L has an eigenvalue (asymptotically) equal to 2f ( (τ) l + β l ) + f(0) f(τ) + τf (τ) (3) f(τ) c 0 l β f(τ) Besides, we find that n J T Π ρ J = ( l c 0β 2 l(β l) 2 ) D(c)M T Υ ρ Υ T ρ M D(c) + o() almost surely, where Υ ρ R p mρ are the eigenvectors of βi p + M D(c)M T associated with eigenvalue l Aside from the very simple result in itself, note that the choice of f is (asymptotically) irrelevant here Note also that M D(c)M T plays an important role as its eigenvectors rule the behavior of the eigenvectors of L used for clustering Assume now instead that for each i, µ i = µ and C i = ( + γ p i )βi p for some γ,, γ k R fixed, and we shall denote γ = [γ,, γ k ] T Then, if condition (2) is met, we now find after calculus that there exists at most one isolated eigenvalue in L (beside n) again ) equal ( in the limit) to (3) but now for l = β 2 ( 5f (τ) 8f(τ) f (τ) 2f(τ) 2 + k i= c iγ 2 i β 2 (β l) 2 Moreover, n J T Π ρ J = c k i= c D(c)γγ T D(c) + o P () iγi 2 If (2) is not met, there is no isolated eigenvalue beside n We note here the importance of an appropriate choice of f Also observe that n D(c 2 )J T Π ρ J D(c 2 ) is proportional to

5 Fig Samples from the MNIST database, without and with 0dB noise D(c 2 )γγ T D(c 2 ) and thus the eigenvector aligns strongly to D(c 2 )γ itself Thus the entries of D(c 2 )γ should be quite distinct to achieve good clustering performance IV SIMULATIONS We complete this article by demonstrating that our results, that apply in theory only to Gaussian x i s, show a surprisingly similar behavior when applied to real datasets Here we consider the clustering of n = 3 64 vectorized images of size p = 784 from the MNIST training set database (numbers 0,, and 2, as shown in Figure ) Means and covariance are empirically obtained from the full set of MNIST images The matrix L is constructed based on f(x) = exp( x/2) Figure 2 shows that the eigenvalues of L and ˆL, both in the main bulk and outside, are quite close to one another (precisely L ˆL / L 0) As for the eigenvectors (displayed in decreasing eigenvalue order), they are in an almost perfect match, as shown in Figure 3 In the latter is also shown in thick (blue) lines the theoretical approximated (signed) diagonal values of n D(c 2 )J T Π ρ J D(c 2 ), which also show an extremely accurate match to the empirical classwise means Here, the k-means algorithm applied to the four displayed eigenvectors has a correct clustering rate of 86% Introducing a 0dB random additive noise to the same MNIST data (see images in Figure ) brings the approximation error down to L ˆL / L 004 and the k-means correct clustering probability to 78% (with only two theoretically exploitable eigenvectors instead of previously four) matching eigenvalues Eigenvalues of L Eigenvalues of ˆL Fig 2 Eigenvalues of L and ˆL, MNIST data, p = 784, n = 92 V CONCLUDING REMARKS The random matrix analysis of kernel matrices constitutes a first step towards a precise understanding of the underlying Fig 3 Leading four eigenvectors of L (red) versus ˆL (black) and theoretical class-wise means (blue); MNIST data mechanism of kernel spectral clustering Our first theoretical findings allow one to already have a partial understanding of the leading kernel matrix eigenvectors on which clustering is based Notably, we precisely identified the (asymptotic) linear combination of the class-basis canonical vectors around which the eigenvectors are centered Currently on-going work aims at studying in addition the fluctuations of the eigenvectors around the identified means With all these informations, it shall then be possible to precisely evaluate the performance of algorithms such as k-means on the studied datasets This innovative approach to spectral clustering analysis, we believe, will subsequently allow experimenters to get a clearer picture of the differences between the various classical spectral clustering algorithms (beyond the present Ng Jordan Weiss algorithm), and shall eventually allow for the development of finer and better performing techniques, in particular when dealing with high dimensional datasets REFERENCES [] U Von Luxburg, A tutorial on spectral clustering, Statistics and computing, vol 7, no 4, pp , 2007 [2] A Y Ng, M Jordan, and Y Weiss, On spectral clustering: Analysis and an algorithm, Proceedings of Advances in Neural Information Processing Systems Cambridge, MA: MIT Press, vol 4, pp , 200 [3] N El Karoui, The spectrum of kernel random matrices, The Annals of Statistics, vol 38, no, pp 50, 200 [4] F Benaych-Georges and R R Nadakuditi, The singular values and vectors of low rank perturbations of large rectangular random matrices, Journal of Multivariate Analysis, vol, pp 20 35, 202 [5] F Chapon, R Couillet, W Hachem, and X Mestre, The outliers among the singular values of large rectangular random matrices with additive fixed rank deformation, Markov Processes and Related Fields, vol 20, pp , 204