Memory and Computation Efficient PCA via Very Sparse Random Projections

Transcription

1 Meory and Coputation Efficient PCA via Very Sparse Rando Projections Farad Pourkaali-Anaraki Sannon M. Huges Departent of Electrical, Coputer, and Energy Engineering, University of Colorado at Boulder, CO, 839, USA Abstract Algorits tat can efficiently recover principal coponents in very ig-diensional, streaing, and/or distributed data settings ave becoe an iportant topic in te literature. In tis paper, we propose an approac to principal coponent estiation tat utilizes projections onto very sparse rando vectors wit Bernoulli-generated nonzero entries. Indeed, our approac is siultaneously efficient in eory/storage space, efficient in coputation, and produces accurate PC estiates, wile also allowing for rigorous teoretical perforance analysis. Moreover, one can tune te sparsity of te rando vectors deliberately to acieve a desired point on te tradeoffs between eory, coputation, and accuracy. We rigorously caracterize tese tradeoffs and provide statistical perforance guarantees. In addition to tese very sparse rando vectors, our analysis also applies to ore general rando projections. We present experiental results deonstrating tat tis approac allows for siultaneously acieving a substantial reduction of te coputational coplexity and eory/storage space, wit little loss in accuracy, particularly for very ig-diensional data.. Introduction Principal coponent analysis (PCA) is a fundaental tool in unsupervised learning and data analysis tat finds te low-diensional linear subspace tat iniizes te eansquared error between te original data and te data projected onto te subspace. Te principal coponents (PCs) can be obtained by a singular value decoposition (SVD) of te data atrix or eigendecoposition of te data s covariance atrix. PCA is frequently used for diensionality reduction, feature extraction, and as a pre-processing step for learning and recognition tasks suc as classification. Proceedings of te 3 st International Conference on Macine Learning, Beijing, Cina,. JMLR: W&CP volue 3. Copyrigt by te autor(s). Tere is a wealt of existing literature tat develops coputationally efficient approaces to coputing tese PCs. However, te overweling ajority of tis literature assues ready access to te stored full data saples. However, tis full data access is not always possible in odern data settings. Modern data acquisition capabilities ave increased assively in recent years, wic can lead to a wealt of rapidly canging ig-diensional data. Hence, in very large database environents, it ay not be feasible or practical to access all te data in storage (Mutukrisnan, 5). Moreover, in applications suc as sensor networks, distributed databases, and surveillance, data is typically distributed over any sensors. Accessing all te data at once requires treendous counication costs between te sensors and a central processing unit. Algorits tat don t require access to all te data can elp reduce tis counication cost (Balcan et al., 3). A tird case is streaing data, were one ust acquire and store te data in real tie to ave full access, wic ay not be feasible. One proising strategy to address tese issues in a coputationally efficient way, wic also allows for rigorous teoretical analysis, is to use very sparse rando projections. Rando projections provide inforative lowerdiensional representations of ig-diensional data, tereby saving eory and coputation. Tey are widely used in any applications, including databases and data strea processing (Li et al., 6; Indyk, 6) and copressive sensing (Donoo, 6). Initial attepts ave been ade to perfor PCA using only te inforation ebedded in rando projections. Unfortunately, owever, teoretical guarantees ave generally only been given for rando vectors wit i.i.d. entries drawn fro te Gaussian distribution. Tis coon coice is convenient in ters of teoretical analysis, but undesirable in practice. Suc dense rando vectors require relatively ig storage space, and ig coputation because of te large aount of floating point aritetic needed to copute eac projection. In tis paper, we instead ai to recover PCs fro very

2 sparse rando projections wit Bernoulli entries. Tese sparse rando projections can be ipleented using siple database operations. For exaple, tis type of rando projection can be obtained by siply adding two sall subsets of te entries of a data saple and ten subtracting te results. Tey tus require little coputation or data access. For distributed data, tis type of sparse Bernoulli projection could be obtained via localized aggregation in te network requiring inial counication (assuing all sensors can counicate wit one anoter). (If a network topology ust be respected, te sparse rando projections could presuably be adjusted accordingly, but we ave not yet analyzed tis case.) In sort, very sparse rando projections are or could potentially be extreely practical for a variety of situations. Our teoretical analysis begins by assuing a probabilistic generative odel for te data, related to te spiked covariance odel. Under tis odel, we sow tat PCs coputed fro very sparse rando projections are close estiators of te true underlying PCs. Moreover, one can adjust te sparsity of te rando projections as desired to greatly reduce eory and coputation (at te cost of soe accuracy). We give rigorous teoretical analysis of te resulting tradeoffs between eory, coputation, and accuracy as we vary sparsity, sowing tat efficiency in eory and coputation ay be gained wit little sacrifice in accuracy. In fact, our analysis will also apply ore generally to any rando projections wit i.i.d. zero ean entries and bounded second-, fourt-, sixt- and eigt-order oents, altoug we focus on te sparse-bernoulli case. In Section, we present a brief review of related work. Te odel assuptions and notation are in Section 3. We present an overview of te ain contributions in Section. In Section 5, te ain results are stated wit soe discussion of teir consequences. Proofs are reserved to te suppleentary aterial. Finally, we present experiental results deonstrating te perforance and efficiency of our approac copared wit prior work in Section 6.. Related Work Algorits tat can efficiently recover PCs fro a collection of full data saples ave been an iportant topic in te literature for decades. A copreensive survey of tese algorits can be found in (Halko et al., b; Gilbert et al., ) and te references terein. Tis includes several lines of work. Te first involves tecniques tat are based on diensionality reduction, sketcing, and sub-sapling for low-rank atrix approxiation suc as (Halko et al., a). In tese etods, te coputational coplexity is typically reduced by perforing SVD on te saller atrix obtained by sketcing or subsapling. However, tese etods require accessible storage of all te data saples. Tis ay not be practical for odern data processing applications were data saples are too vast or generated too quickly to be stored accessibly. Te second line of work involves online algorits specifically tailored to ave extreely low-eory coplexity suc as (Arora et al., ) and te references terein. Typically, tese algorits assue tat te data is streaing by, tat real-tie PC estiates are needed, and tey obtain tese by solving a stocastic optiization proble, in wic eac arriving data saple is used to update te PCs in an iterative procedure. As a couple recent exaples of tis line of work, (Mitliagkas et al., 3) sow tat a blockwise stocastic variant of te power etod can recover PCs in tis low-eory setting fro O(p log p) saples, altoug te coputational cost is not exained. Meanwile, (Arora et al., 3) bound te generalization error of PCs learned wit teir algorit to new data saples and also analyze its coputational cost. Our proble lies soewere between te above two lines of work. We don t assue tat eory/data access is not a concern, but at te sae tie, we also don t assue te extreely restrictive setting were one-saple-at-a-tie realtie PC updates are required. Instead, we ai to reduce bot eory and coputation siultaneously for PCA across a broad class of big data settings, e.g. for enorous databases were loading into local eory ay be difficult or costly, for streaing data wen PC estiates do not ave to be real-tie, or for distributed data. We also ai to provide tunable tradeoffs for te aount of accuracy tat will be sacrificed for eac given reduction in eory/coputation, in order to aid in coosing a desired balance point between tese. To do tis, we recover PCs fro rando projections. Tere ave been soe related prior attepts to extract PCs fro rando projections of data (Fowler, 9; Qi and Huges, ). In bot, te proble of recovering PCs fro rando projections as been considered only for dense Gaussian rando projections. However, dense vectors are undesirable for practical applications since tey require relatively ig storage space and coputation (including lots of floating point aritetic) as noted in te introduction. Our work will ake use of sparse rando vectors wit Bernoulli entries wic will be ore efficiently ipleentable in a large database environent. Cen et al. (3) ave estiated te covariance atrix of data fro general sub-gaussian rando projections to reduce eory use. However, convergence guarantees are given only for te case of infinite data saples, aking it ard to realistically use tese results in eory/coputation vs. accuracy tradeoffs, and coputational cost is not exained. We will address bot tese issues. As a final note, we observe tat our work also can be

3 viewed as an exaple of eerging ideas in coputational statistics (see (Candrasekaran and Jordan, 3)) in wic tradeoffs between coputational coplexity, dataset size, and estiation accuracy are explicitly caracterized, so tat a user ay coose to reduce coputation in very igdiensional data settings wit knowledge of te risk to te accuracy of te result. 3. Proble Forulation and Notation In tis paper, we focus on a statistical odel for te data tat is applicable to various scenarios. Assue tat our original data in R p are centered at xr p and {v i } d i= R p are te d ortonoral PCs. We consider te following probabilistic generative odel for te data saples, x i =x P d j= w ij jv j z i, i=,...,n, were {w i } n i= and {z i} n i= are drawn i.i.d. fro N (, I d d) and N (, p I p p), respectively. Also, { i } d i= are scalar constants reflecting te energy of te data in eac principal direction suc tat > >...> d >. Te additive noise ter z i allows for soe error in our assuptions. Note tat te underlying covariance atrix of te data is C true, P d j= j v jvj T, and te signal-to-noise ratio is SNR=, were, P d j= j. In fact, tis odel is related to te spiked covariance odel (Jonstone, ) in wic te data s covariance atrix is assued to be a low-rank perturbation of te identity atrix. We ten introduce a very general class of rando projections. Assue tat atrices {R i } n i= Rp, <p, are fored by drawing eac of teir i.i.d. entries fro a distribution wose ean µ is assued to be zero and wose k t order oents, µ k, are assued finite for k =,, 6, 8. In particular, we will be interested in a popular class of sparse rando projections, but our analysis will apply to any distribution satisfying tese assuptions. Eac rando projection y i R is ten obtained by taking inner products of te data saple x i R p wit te rando vectors coprising te coluns of R i, i.e. y i =R T i x i. Te ain goal of tis paper is to provide teoretical guarantees for estiating te center and PCs of {x i } n i= fro tese rando projections.. Our Contributions In tis paper, we introduce two estiators for te center and underlying covariance atrix of data {x i } n i= fro sparse rando projections {y i =R T i x i} n i=. In typical PCA, te center is estiated using te epirical center x ep = P n n i= x i. PCs are ten obtained by eigendecoposition P of te epirical covariance atrix C ep = n n i= (x i x ep )(x i x ep ) T, tat typically coes close to te true covariance atrix (Versynin, ). Siilar to typical PCA, we sow tat te epirical center and epirical covariance atrix of te new data saples {R i y i } n i= (scaled by a known factor) result in accurate estiates of te original center x, and te true underlying covariance atrix C true. (Note tat R i y i approxiately represents a projection in R p of x i onto te colun space of R i, but we ave eliinated a coputationally expensive atrix inverse ere.) We will provide rigorous teoretical analysis for te perforance of tese estiators in ters of paraeters suc as te easureent ratio /p, nuber of saples n, SNR, and oents µ k. Our approac is quite general and we believe it can eventually be applicable to various data processing applications in wic te data is very ig-diensional, streaing, or distributed. Particularly for te case of distributed data, we ay need to adjust te set-up to ensure te rando projections respect network topology, but we believe it could be done following te strategies in (Wang et al., a;b). We will be particularly interested in applying our general distribution results to te case of very sparse easureent atrices. Aclioptas () first sowed tat, in te classic Jonson Lindenstrauss result on pairwise distance preservation, te dense Gaussian projection atrices can be replaced wit sparse projection atrices, were eac entry is distributed on {,, } wit probabilities { 6, 3, 6 }, acieving a tree-fold speedup in processing tie. Li et al. (6) ten drew eac entry fro {,, } wit probabilities { s, s, s }, acieving a ore significant s- fold speedup in processing tie. In tis paper, we refer to tis second distribution as a sparse-bernoulli distribution wit sparsity paraeter s. Sparse rando projections ave been applied in any oter applications to substantially reduce coputational coplexity and eory requireents (Oidiran and Wainwrigt, ; Zang et al., ). Motivated by te success of tese etods, we propose to recover PCs fro sparse rando projections of te data, in wic eac entry of {R i } n i= is drawn i.i.d. fro te sparse-bernoulli distribution. In tis case, eac colun of {R i } n i= as p s nonzero entries, on average. Tis coice as te following properties siultaneously: Te coputation cost for obtaining eac projection is O( p s ) and tus te cost to acquire/access/old in eory te data needed for te algorit is O( pn s ). Specifically, we are interested in coosing and s so tat te copression factor, s <. In tis case, our fraework requires significantly less coputation cost and storage space. First, te coputation cost to acquire/access eac data saple is O( p), <, in contrast to te cost for acquiring eac original data saple O(p). Tis results in a substantial cost reduction for te sensing process, e.g. for streaing data. Second, once acquired, observe tat te projected data saples {R i y i } n i= Rp will be sparse, aving at ost O( p) nonzero entries eac. Tis results in a significant reduction, O( pn) as opposed to

4 O(pn), in eory/storage requireents and/or counication cost, e.g. transferring distributed data to a central processing unit. Given te sparse data atrix fored by {R i y i } n i=, one can ake use of efficient algorits for perforing (partial) SVD on very large sparse atrices, suc as te Lanczos algorit (Golub and Van Loan, ) and svds in MATLAB. In general, for a p n atrix, te coputational cost of SVD is O(p n). However, for large sparse atrices suc as ours, te cost can be reduced to O( p n) (Lin and Gunopulos, 3). In te reainder of tis paper, we will caracterize te accuracy of te estiated center and PCs in ters of, p, n, SNR, oents of te distribution (wic for sparse- Bernoulli will scale wit s), etc. As we will see, under certain conditions on te PCs, we ay coose as low as / p for constant accuracy. Hence, assuing n = O(p) saples, te eory/storage requireents for our approac can scale wit p in contrast to p for standard algorits tat store te full data, and a siilar factor of p savings in coputation can be acieved copared wit regular SVD. Less aggressive savings will also be available for oter PC types. 5. Main Results We present te ain results of our work in tis section, wit all proofs delayed to te suppleental aterial. Interestingly, we will see tat te sape of te distribution for eac entry of {R i } n i= plays an iportant role in our results. Te kurtosis, defined as apple, µ 3, is a easure of µ peakedness and eaviness of tail for a distribution. It can also be tougt of as a easure of non-gaussianity, since te kurtosis of te Gaussian distribution is zero. It turns out tat te distribution s kurtosis is a key factor in deterining PC estiation accuracy. For sparse-bernoulli, te kurtosis increases wit increasing sparsity paraeter s. 5.. Mean and Variance of Center Estiator Teore. Assue tat {R i } n i=, {x i} n i=, {y i} n i=,, n, and µ are as defined in Section 3, and define te n- saple center estiator bx n = µ n P n i= R iy i. Ten, te ean of te estiator bx n is te true center of te original data x, i.e. E[bx n ]=x, for all n, including te base case n=. Furterore, as n!, te estiator bx n converges to te true center: li n! bx n =x. We see tat te epirical center of {R i y i } n i= is a (scaled) unbiased estiator for te true center x. Note tat tis teore does not depend on te nuber of projections or sparsity paraeter s, and tus does not depend on, as a sufficiently ig nuber of saples will copensate for unfavorable values of tese paraeters. We furter note tat, wen n!, tere is no difference between te Gaussian, very sparse, or oter coices of rando projections. Tis is consistent wit te observation tat rando projection atrices consisting of i.i.d. entries ust only be zero ean to preserve pairwise distances in te Jonson- Lindenstrauss teore (Li et al., 6). Teore. Assue tat {R i } n i=, {x i} n i=, {y i} n i=,, n, p, µ,, and SNR are as defined in Section 3, and kurtosis apple is as defined above. Ten, te variance of te unbiased center estiator bx n = P n µ n i= R iy i is Var x bn = apple p n p kxk p apple p SNR. (5.) We see tat as te nuber of saples n and easureent ratio /p increase, te variance of tis estiator decreases at rate n and close to. Interestingly, te power of te /p signal, i.e. = P d j= j, works against te accuracy of te estiator. Te intuition for tis is tat, for te center estiation proble, it is desirable to ave all te data saples close to te center, wic appens for sall. For sparse rando projections, we observe tat te kurtosis is apple=s 3 and tus apple p t s p. Hence, variance scales wit increasing sparsity, altoug sufficient data saples n are enoug to cobat tis effect. Indeed, wen s>p, te variance increases eavily since any of te rando vectors are zero, and tus te corresponding projections cannot capture any inforation about te original data. Overall, tis result sows an explicit tradeoff between reducing n or increasing s to reduce eory/coputation and te variance of te resulting estiator. Finally, given tis ean and variance, probabilistic error bounds can be iediately obtained via Cebysev, Bernstein, etc. inequalities. 5.. Mean and Variance of Covariance Estiator Teore 3. Assue tat {R i } n i=, {x i} n i=, {y i} n i=,, n, p, µ,,, and C true are as defined in Section 3, and apple is te kurtosis. Moreover, assue tat {x i } n i= are centered at x=. Define P te n-saple covariance estiator C b n = n ( )µ n i= R iy i yi T RT i. Ten, for all n, te ean of tis estiator is: E[ C b n ]= C b true E, were C b true, C true I p p,, ( apple p() (p) p() ), and E, apple P d j= j diag(v jvj T ), were diag(a) denotes te atrix fored by zeroing all but te diagonal entries of A. Furterore, let C, C b true E. Ten, as n!, te estiator C b n converges to C : li n! Cn b =C. We observe tat te liit of te estiator C b n as two coponents. Te first, C b true, as te sae eigenvectors wit sligtly perturbed eigenvalues ( tends to be very sall in ig diensions) and te oter, E, is an error perturbation ter. Bot and E scale wit te kurtosis, reflecting te necessary tradeoff between increasing sparsity (decreasing eory/coputation) and aintaining accuracy. We first consider a siple exaple to gain soe intuition for tis teore. A set of data saples {x i } 3 i= R

5 are generated fro one PC. We also generate te easureent atrices {R i } 3 i= R ( /p =.) wit i.i.d. entries bot for te Gaussian distribution and te sparse-bernoulli distribution for various values of te sparsity paraeter s. In Fig. 5., we view two diensions (te original PC s and one oter) of te data {x i } 3 i= and te scaled projected data p {R i y i } 3 ( )µ i=, represented by blue dots and red circles respectively. We see tat te projected data saples are scattered soewat into oter directions for all four cases. However, te aount of scattered energy for te Gaussian and sparse-bernoulli for s=3 is quite sall. Tis can be easily verified fro te fact tat te aount of perturbation depends on te kurtosis, and for bot cases te kurtosis is apple=. As we increase te paraeter s, te kurtosis apple=s 3 gets larger, and tis is consistent wit te observation tat te projected data saples get ore scattered into oter directions. We also note te siilarity of our findings to (Li et al., 6) s result tat te variance of te pairwise distances in Jonson Lindenstrauss depends on te kurtosis of te distribution being used for rando projections. Despite te perturbation, in all cases, te PC can be recovered accurately. Note also tat scaling te projected data points by / p ( ) µ preserves te energy in te direction of te PC (i.e. te eigenvalue). In Teore 3, we see tat C true and C b true ave te sae set of eigenvectors wit te eigenvalues of C true increased by ={ ( apple p() p ) SNR }. Tus, is a decreasing function of p, /p and SNR, and in particular goes to as p!for constant projection ratio /p. Tis is illustrated in Fig. 5.. Tus, surprisingly, in te igdiensional regie, te aount of perturbation of eigenvalues becoes increasingly negligible even for sall easureent ratios. Now, let s exaine te error atrix E. We observe tat E can be viewed as representing a bias of te estiated PCs towards te nearest canonical basis vectors; it stes fro anisotropy in te distribution for R i wen tis is non- Gaussian (note apple =, and tus E =, for te Gaussian case). In later sections, we will use te -nor of E, kek, to bound te angle between te estiated and true PCs. Indeed, we find, for constant, kek, te sae angular PC estiation error is acieved. We now study kek, leading to useful observations, for several types of PCs. (An expanded discussion wit full derivations is included in te suppleentary aterials.) () Soot PCs: It as frequently been observed tat sparse-bernoulli rando projections are ost effective on vectors tat are soot (Ailon and Cazelle, 9), eaning tat teir axiu entry is of size O( p p ). Large iages, videos, and oter natural signals wit distributed energy are obvious exaples of tis type. (Oter sig- α/ p= p=5 p= p= Measureent Ratio /p...3. Measureent Ratio /p (a) (b) Figure 5.. Variation of te paraeter for (a) apple =and (b) apple =, varying p and easureent ratio /p, and fixed SNR =5. α/ p= p=5 p= p=5 nals are often preconditioned to be soot via ultiplication wit a Hadaard conditioning atrix.) We ay easily observe ten tat kek apple apple µ ax, or apple apple µ ax, were µ ax is te utual coerence (Elad, 7) between apple te PCs and te canonical basis, and we note apple. As we will see in Section 5.3, we will want to keep sall enoug to guarantee a certain fixed angular error. In fact, tis can be satisfied by requiring C( )µ ax, were C( ) is a constant depending on te error. Hence, for soot PCs, we need only ave / p, reducing eory and coputation by a rater rearkable factor of p. () All Sparse PCs: In te case of all sparse PCs, we ay write E as E= apple C true E were ke k apple p apple µ in and µ in, in appleiappled ax applejapplep v i, e j i represents te closeness of te PCs to te canonical basis {e j } p j=. Tus, unlike for oter sparse-bernoulli applications, we find tat sparse PCs can still be recovered very well ere, altoug te eigenvalues ay be eavily scaled by te known factor apple. Doing tis, and taking E as te resulting error ter, we can let / p µ in to aintain constant. (3) Neiter Sparse nor Soot PCs: In tis case, we can still apply te analysis for case (), just wit a larger µ ax and less aggressive eory/coputation savings. () Mixture of PC Types: In tis case, we ay split E into two error atrices, associated wit eac of te sparse and non-sparse PCs. Recovery of te d-diensional PC subspace still perfors well ere. However, if te eigenvalues { j }d j= do not decay sufficiently fast, scaling of te eigenvalues for te sparse PCs ay reorder te individual coponents. Please see te suppleentary aterial for furter discussions and siulations. Teore. Assue tat {R i } n i=, {x i} n i=, {y i} n i=,, n, p, µ k,, and SNR are as defined in Section P 3. Consider te covariance atrix estiator C b n = n n i= R ()µ i y i yi T RT i. Ten, te deviation of our n-saple estiator fro its ean value is upper bounded: apple E bc n C apple n ( ) (5.) n were,, e p p p SNR F e p apple SNR p SNR p apple o,,

6 Diension Diension Diension Diension Efficient PCA via Very Sparse Rando Projections Diension Diension Diension Diension (a) Gaussian (b) s = 3 (c) s = (d) s = 5 Figure 5.. Accurate recovery of te PC under rando projections using bot Gaussian and sparse rando projection atrices for p various values of s. In eac figure, tere are n=3 data saples uniforly distributed on a line in R. {Ri }n, i= R /p=., are generated wit i.i.d. entries drawn fro (a) N (, ) and (b,c,d) te sparse-bernoulli distribution for s=3,, 5. In n eac p figure, we view two diensions (te original PC s and one oter) of te data {xi }i= (blue dots) and te scaled projected data / ( ) µ {Ri RTi xi }n (red circles). We observe tat, in all cases, te projected data saples are syetrically distributed i= around te PC, and te inner product agnitude between te PC estiated fro te projected data and te true PC is at least.998. Pd e, j= j, and = ax(, ), were µ8/µ µ6/µ3 (µ/µ ) 3 3 /p /p! µ/µ /p (/p) /p p ( /p) and µ6/µ3 6 p/p (µ/µ ) 5 p/p! 6 3 p/p /p p (/p) /p p /p p (/p) p ( /p) p ( /p) µ/µ = Note tat as various ters tat scale wit p, te iger order oents µ8/µ, µ6/µ3, and µ/µ. /p, and We see tat as te nuber of data saples n increases, te variance decreases at rate n, converging quickly to te liit. Moreover, te variance of our estiator is a decreasing function of te easureent ratio /p and SNR. We furter note tat te paraeter gives us iportant inforation about te effect of te tails of te distribution on te convergence rate of te covariance estiator. More preµ8/µ cisely, for sparse rando projections, we see tat 3 = µ6/µ3 (µ/µ ) µ/µ s 3 ) = 3, = =, and =. Hence, for ( a fixed nuber of data saples, decreasing te copression factor leads to an increase of te variance and a loss in accuracy, as we will see in Section 6. Tis is as we would expect since tere is an inerent tradeoff between saving coputation and eory and te accuracy. However, caracterizing tis tradeoff allows to be cosen in an infored way for large datasets Meory, Coputation and PC Accuracy Tradeoffs We now use te covariance atrix estiator results to bound te error of its eigenvalues and eigenvectors, using related results fro atrix perturbation teory. First, note tat using te variance of our estiator (Eq. 5.) b n C ", wit in te Cebysev inequality yields C probability at least bn C bn C b true C C n" ( bn C F C ). Hence, C kek kek " F b true C (5.3) wit probability at least ). In fact, n" ( Eq. 5.3 can be used to caracterize tradeoffs between eory, coputation, and PC estiation accuracy (as an angle between estiated subspaces) in ters of our paraeters n, /p, etc. For siplicity in wat follows and to elp keep te intuition clear, we focus on te case were te nuber of saples n! and "! in Eq. 5.3 above. However, it is trivial to adjust tese results to te case of finite n by including a nonzero " in te derivations tat follow. For illustrative purposes, we start by analyzing te case of a single PC and use te following Lea. In te following, (A) and i (A) denote te set of all eigenvalues and te it eigenvalue of A, respectively. Lea 5. (Hogben, 6; Davis and Kaan, 97) Supe pose A is a real syetric atrix and A=A E is te e e ) is an exact eigenpair perturbed atrix. Assue tat (, v e were ke of A vk =. Ten (a) e kek for soe eigenvalue of A. (b) Let be te closest eigenvalue of A to e and v be its associated eigenvector wit kvk =, and let = in (A), 6= e. If >, ten sin \ (e v, v) kek (5.) were \ (e v, v) denotes te canonical angle between te two eigenvectors. We will use tis Lea to bound te angle between te b n and te true PC in te single PC PC estiate fro C

7 case. Since C true as only one eigenpair (, v) wit nonzero eigenvalue, C b true as an eigenpair (, v) and i( C b true )=, i=,...,p. Fro Lea 5, we see tat te largest eigenvalue of C b n satisfies ( C b n ) ( ) apple kek =. We find te paraeter : = in bcn bctrue i = bcn i=,...,p kek =( ). (5.5) We ten get te following tradeoff between te accuracy of te estiated eigenvector and te paraeters of our odel: sin \ (ev, v) apple. (5.6) Tis equation allows us to caracterize te statistical tradeoff between te sparsity paraeter s and te accuracy of te estiated PC. Observe tat tis is te sae = kek tat we discussed in Section 5.. To ensure fixed axiu angular error for PC estiation, i.e. sin \(ev, v)applesin, sin we sould coose suc tat apple sin. For soot PCs, we ay satisfy tis by coosing C( )µ ax for sin C( ), sin, wic gives O( p ). Hence, te eory/storage requireents of our etod can scale wit p in contrast to standard algorits tat scale wit p, wile te coputational coplexity of SVD can scale wit p as opposed to p 3. Altoug te soot case is of special interest, less aggressive, but still substantial, savings are also available for oter PC types. For te general case of d PCs, we consider te eigendecoposition of te perturbed atrix C b n and C b true : bc true = apple apple S V V V T S V T i " #" # e bc n = ev V e S ev T S e ev T. Te distance between eac perturbed eigenvalue and te corresponding original eigenvalue depends on te aount of perturbation. We now ave tat j( b C n ) j( b C true ) applekek = for all j=,...,d. Moreover, it is possible to quantify te rotation of eigenvectors using te notion of canonical angle atrix defined in (Davis and Kaan, 97). Note tat V, e V R p d are te first (true and estiated) PCs. Te canonical angles between te are defined as i =arccos i, were { i } d i= are te singular values of ( e V T e V ) / e V T V (V T V ) /, in our case, just e V T V. Te canonical angle atrix is ten defined as ( e V, V )=diag(,..., d ). Based on te results given in (Davis and Kaan, 97; Gilbert et al., ): sin ( e V, V ) apple kek Noralized Center Error Noralized Singular Value Error n=*p,γ =/5 n=*p, γ=/ n=3*p,γ=/5 n=3*p, γ=/ 6 8 Diension (p) (a) γ=/5 γ=/ γ=/ 6 8 Diension (p) Inner Product Magnitude γ=/5 γ=/ γ=/ 6 8 Diension (p) (c) (d) Figure 6.. Results for syntetic data: (a) noralized estiation error for te center for varying n and, (b) agnitude of te inner product between te estiated and true PC for varying, (c) noralized estiation error for for varying, and (d) coputation tie to perfor te SVD for te original vs. randoly projected data for varying. were,in appleiappled,applejapplep d (S ) fs ii >. Using te sae logic as in 5.5, we find d jj. Hence, Tie in sec. (b) γ=/5 γ=/ γ=/ SVD 6 8 Diension (p) coosing s,, etc. suc tat satisfies < d, te axiu canonical angle between e V and V satisfies sin i apple d, i =,...,d. (5.7) Tis is te sae for we saw in Eq Hence, for soot PCs, we ay again coose / p. 6. Experiental Results In tis section, we exaine te tradeoffs between eory, coputation, and accuracy for te sparse rando projections approac on bot syntetic and real-world datasets. First, we syntetically generate saples {x i } n i= Rp distributed along one PC wit =. Eac entry of te center and PC is drawn fro te unifor distribution on [, ) and [, ), respectively. Te PC is ten noralized to ave unit `-nor. We consider a relatively noisy situation wit SNR=. We ten estiate te center of te original data fro te sparse rando projections, were /p=., for varying n and copression factors. Our results are averaged over independent trials. Fig. 6.(a) sows te accuracy for te estiated center, were te error is te distance between te estiated and te true center noralized by te true center s nor. As expected, wen n or diension p increase, te copression factor can be tuned to acieve a substantial reduction of storage space wile obtaining accurate estiates. Tis is desirable for igdiensional data strea processing. We ten fix n=p, and plot te inner product agnitude between te estiated and true PC in Fig. 6.(b) and te

8 Explained Variance SVD on Original Data Our Metod,γ=/ Our Metod,γ=/ BSOI 6 8 Nuber of PCs Tie in sec. (log scale) 6 8 Nuber of PCs (a) (b) Figure 6.. Results for te MNIST dataset. Our proposed approac is copared wit two etods: () perforing MAT- LAB s svds on te full original data, () BSOI (Mitliagkas et al., 3). Plot of (a) perforance accuracy based on te explained variance and (b) coputation tie for perforing SVD. We see tat our approac perfors as well as SVD on te original data and outperfors BSOI wit significantly less coputation tie. coputation tie in Fig. 6.(d) for varying. We observe tat, despite saving nearly two orders of agnitude in coputation tie and also in eory (note = 5,, ) copared to PCA on te full data, te PC is well-estiated. Moreover, te approac reains increasingly effective for iger diensions, wic is of crucial iportance for odern data processing applications. We furter note tat, as te diension increases, we can decrease te copression factor wile still acieving a desired perforance. For exaple, = for p= 3 and = for p= ave alost te sae accuracy. Tis is consistent wit te observation fro before. / p We also plot te estiation error for te singular value in Fig. 6.(c). Te error is te distance between te singular value obtained by perforing SVD on {R i y i } n i= and on te original data {x i } n i=, noralized by te latter value. Finally, we consider te MNIST dataset to see a realworld application outside te spiked covariance odel. Tis dataset contains 7, saples of andwritten digits, wic we ave resized to pixels. Hence, we ave 7, saples in R 6. To evaluate te perforance of our etod, we use te explained variance described in (Mitliagkas et al., 3). Given estiates of d PCs e V R p d and te data atrix X, te fraction of explained variance is defined as tr( e V T XX T e V)/tr(XX T ). We copare te perforance of our approac wit () perforing SVD (using MATLAB svds) on te original data tat are fully acquired and stored, and as a useful point of coparison, wit () te online algorit Block-Stocastic Ortogonal Iteration (BSOI) (Mitliagkas et al., 3), were te data saples are fully acquired but not stored. We sow te results in Fig. 6. for te easureent ratio /p=.. In ters of accuracy, our approac perfors about as well as SVD on te original data, and as sligtly better perforance copared to BSOI. Te sparse rando projections result in a significant reduction of coputational coplexity, wit one order and two orders of agnitude speedup 3 copared to te original SVD and BSOI, respectively. In ters of eory requireents, 3 MB is needed to store te original data. However, te required eory for our fraework is MB for = and MB for =. Te projected data tus can easily reside in te ain eory. Moreover, we ave copared our etod wit te fast randoized SVD algorit in (Halko et al., a). Te estiation accuracy of tis etod is very close to SVD on te original data, and te coputation tie is about. seconds, wic is sligtly less tan te coputation tie of our etod. Tis is as we would expect, since fast randoized SVD is designed specifically for low-coputational coplexity. However, (Halko et al., a) is a full data etod, eaning tat it is assued tat te full data is available for coputation and does not require tie or cost to access. Our approac perfors approxiately as well in siilar coputation tie wile also allowing a reduction in eory (or data access or data counication costs) by a factor of, in tis case and. Tis can be a significant advantage in te case were data is stored in a large database syste or distributed network. Tis exaple indicates tat our approac results in a significant siultaneous reduction of eory and/or coputational cost wit little loss in accuracy. 7. Conclusions We ave presented a eory- and coputation-efficient approac for estiation of PCs via very sparse rando projections. Tis approac siultaneously reduces substantially te required eory and coputation for PC estiation, wile still providing ig accuracy. More iportantly, it allows us to rigorously analyze eac of eory, coputation, and accuracy in ters of te sparsity of te projection, for various PC odels. Tus, we ave been able to give provable tradeoffs between eory, coputation, and accuracy. Furterore, a user of tis approac could even use te sparsity of te projections to tune to any desired point on tis tree-way tradeoff. We believe tat tis approac could be valuable for various iportant odern data processing applications suc as assive databases, distributed networks, and ig-diensional data strea processing, altoug we ave not focused on te specific details of tese in favor of ore teoretical analysis. Indeed, we observe tat our approac perfors well in initial practical siulations, e.g. for te MNIST dataset, wit large reduction of bot eory and coputation witout sacrificing accuracy. Acknowledgeents: Tis aterial is based upon work supported by te National Science Foundation under Grant CCF-7775.

9 References D. Aclioptas. Database-friendly rando projections. In Proceedings of te twentiet ACM SIGMOD-SIGACT- SIGART syposiu on Principles of database systes, pages 7 8,. N. Ailon and B. Cazelle. Te fast Jonson-Lindenstrauss transfor and approxiate nearest neigbors. SIAM Journal on Coputing, 39:3 3, R. Arora, A. Cotter, K. Livescu, and N. Srebro. Stocastic optiization for PCA and PLS. In 5t Annual Allerton Conference on Counication, Control, and Coputing (Allerton), pages ,. R. Arora, A. Cotter, and N. Srebro. Stocastic optiization of PCA wit capped MSG. In NIPS, pages 85 83, 3. M. Balcan, S. Erlic, and Y. Liang. Distributed k-eans and k-edian clustering on general topologies. In NIPS, pages 995 3, 3. V. Candrasekaran and M. Jordan. Coputational and statistical tradeoffs via convex relaxation. Proc. of te National Acadey of Sciences, :E8 E9, 3. Y. Cen, Y. Ci, and A. Goldsit. Exact and stable covariance estiation fro quadratic sapling via convex prograing. arxiv preprint arxiv:3.87, 3. C. Davis and W. Kaan. Te rotation of eigenvectors by a perturbation. III. SIAM J. on Nuerical Analysis, 7: 6, 97. 5, 5.3 D. Donoo. Copressed sensing. IEEE Transactions on Inforation Teory, 5:89 36, 6. M. Elad. Optiized projections for copressed sensing. IEEE Trans. SP, 55: , J. Fowler. Copressive-projection principal coponent analysis. IEEE Trans. on Iage Process., pages 3, 9. A. Gilbert, J. Park, and M. Wakin. Sketced SVD: Recovering spectral features fro copressive easureents. arxiv preprint arxiv:.36,., 5.3 G. H. Golub and C. F. Van Loan. Matrix coputations, volue 3. JHU Press,. N. Halko, P. Martinsson, Y. Skolnisky, and M. Tygert. An algorit for te principal coponent analysis of large data sets. SIAM Journal on Scientific Coputing, 33(5): 58 59, a., 6 N. Halko, P. Martinsson, and J. Tropp. Finding structure wit randoness: Probabilistic algorits for constructing approxiate atrix decopositions. SIAM review, 53():7 88, b. L. Hogben. Handbook of linear algebra. CRC Press, 6. 5 P. Indyk. Stable distributions, pseudorando generators, ebeddings, and data strea coputation. Journal of te ACM (JACM), 53(3):37 33, 6. I. Jonstone. On te distribution of te largest eigenvalue in principal coponents analysis. Te Annals of Statistics, 9():95 37,. 3 P. Li, T. Hastie, and K. Curc. Very sparse rando projections. In Proceedings of te t ACM SIGKDD international conference on Knowledge discovery and data ining, pages 87 96, 6.,, 5., 5. J. Lin and D. Gunopulos. Diensionality reduction by rando projection and latent seantic indexing. In proceedings of te Text Mining Worksop, at te 3rd SIAM International Conference on Data Mining, 3. I. Mitliagkas, C. Caraanis, and P. Jain. Meory liited, Streaing PCA. In NIPS, 3., 6., 6 S. Mutukrisnan. Data streas: Algorits and applications. Now Publisers Inc, 5. D. Oidiran and M. Wainwrigt. Hig-diensional variable selection wit sparse rando projections: easureent sparsity and statistical efficiency. Te Journal of Macine Learning Researc, 99:36 386,. H. Qi and S. Huges. Invariance of principal coponents under low-diensional rando projection of te data. In ICIP, pages 937 9,. R. Versynin. How close is te saple covariance atrix to te actual covariance atrix? Journal of Teoretical Probability, 5(3): ,. M. Wang, W. Xu, E. Mallada, and A. Tang. Sparse recovery wit grap constraints: Fundaental liits and easureent construction. In IEEE Proceedings INFO- COM, pages , a. M. Wang, W. Xu, E. Mallada, and A. Tang. Sparse recovery wit grap constraints. CoRR, abs/7.89, b. K. Zang, L. Zang, and M. Yang. Real-tie copressive tracking. In Coputer Vision ECCV, pages Springer,.