Robust and Scalable Algorithms for Big Data Analytics Georgios B. Giannakis Acknowledgment: Drs. G. Mateos, K. Slavakis, G. Leus, and M. Mardani Arlington, VA, USA March 22, 2013 1
Roadmap n Robust principal component analysis BIG Ø Linear low-rank models and sparse outliers BIG n Scalable algorithms for big network data analytics Ø (De-) centralized and online rank minimization n Robust sparse embedding via dictionary learning Ø Ø Nonlinear low-rank models Data-adaptive compressed sensing n Concluding remarks Fast Messy 2
Principal component analysis n Motivation: (statistical) learning from high-dimensional data DNA microarray Traffic surveillance n Principal component analysis (PCA) [Pearson 1901] Ø Extraction of low(est)-dimensional structure Ø Applications: source (de)coding, anomaly ID, recommender systems Ø PCA is non-robust to outliers [Huber 81], [Jolliffe 86], [Wright et al 09-12] Objective: robustify PCA by controlling outlier sparsity 3
PCA formulations n Training data n Minimum reconstruction error Ø Compression operator Ø Reconstruction operator n Component analysis model Solution: 4
Robustifying PCA n Outlier variables s.t. outlier otherwise Ø Nominal data obey ; outliers something else Ø Linear regression [Fuchs 99], [Giannakis et al 11] Ø Both and unknown, typically sparse! n Natural (but intractable) estimator (P0) G. Mateos and G. B. Giannakis, ``Robust PCA as bilinear decomposition with outlier sparsity regularization,'' IEEE Transactions on Signal Processing, pp. 5176-5190, Oct. 2012. 5
Universal robustness n (P0) is NP-hard relax e.g., [Tropp 06] (P1) Ø Role of sparsity-controlling is central Q: Does (P1) yield robust estimates? A: Yap! Huber estimator is a special case 6
Alternating minimization (P1) Ø Ø update: SVD of outlier-compensated data update: row-wise soft-thresholding of residuals -γ γ Proposition : Algorithm 1 s iterates converge to a stationary point of (P1) 7
Video surveillance n Background modeling from video feeds [De la Torre-Black 01] Data PCA Robust PCA Outliers Data: http://www.cs.cmu.edu/~ftorre/ 8
Robust unveiling of communities n Robust kernel PCA for identification of cohesive subgroups n Network: NCAA football teams (vertices), Fall 00 games (edges) ARI=0.8967 Ø Identified exactly: Big 10, Big 12, ACC, SEC, ; Outliers: Independent teams Data: http://www-personal.umich.edu/~mejn/netdata/ 9
Online robust PCA Ø Scalability via exponentially weighted subspace tracking Ø At time, do not re-estimate n Motivation: Real-time big data and memory limitations n Nominal: n Outliers: 10
Roadmap n Robust principal component analysis Ø Linear low-rank models and sparse outliers n Scalable algorithms for big network data Ø (De-) centralized and online rank minimization n Robust embedding via dictionary learning Ø Ø Nonlinear low-rank models Data-adaptive compressed sensing n Concluding remarks 11
Modeling traffic anomalies n Anomalies: changes in origin-destination (OD) flows [Lakhina et al 04] Ø Failures, congestions, DoS attacks, intrusions, flooding n Graph G (N, L) with N nodes, L links, and F flows (F >> L); OD flow z f,t n Packet counts per link l and time slot t Anomaly 1 0.9 0.8 0.7 0.6 f 2 l 0.5 0.4 f 1 0.3 є {0,1} 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 n Matrix model across T time slots: LxT LxF 12
Low-rank plus sparse matrices n Z has low rank, e.g., [Zhang et al 05]; A is sparse across time and flows 4 x 108 a f,t 2 0 0 200 400 600 800 1000 Time index(t) Data: http://math.bu.edu/people/kolaczyk/datasets.html 13
General decomposition problem n Given and routing matrix, identify sparse when is low rank Ø fat but still low rank (P1) n Rank minimization with the nuclear norm, e.g., [Recht-Fazel-Parrilo 10] Ø Principal Comp. Pursuit (PCP) [Candes et al 10], [Chandrasekaran et al 11] 14
Challenges and importance n not necessarily sparse and fat PCP not applicable n LT + FT >> LT X A Y n Important special cases Ø R = I : matrix decomposition with PCP [Candes et al 10] Ø X = 0 : compressive sampling with basis pursuit [Chen et al 01] Ø X = C Lxρ W ρxt and A = 0 : PCA [Pearson 1901] Ø X = 0, R = D unknown: dictionary learning [Olshausen 97] 15
Exact recovery n Noise-free case (P0) Q: Can one recover sparse and low-rank exactly? A: Yes! Under certain conditions on Theorem: Given and, assume every row and column of has at most k<s non-zero entries, and has full row rank. If C1)-C2) hold, then with (P0) exactly recovers C1) C2) M. Mardani, G. Mateos, and G. B. Giannakis,``Recovery of low-rank plus compressed sparse matrices with application to unveiling traffic anomalies," IEEE Trans. Information Theory, 2013. 16
In-network processing Smart metering n Robust imputation of network data matrix Network health cartography?????????? Goal: Given few rows per agent, perform distributed cleansing and imputation by leveraging low-rank of nominal data and sparsity of the outliers. n Challenge: not separable across rows (links/agents) G. Mateos and K. Rajawat Dynamic network cartography, IEEE Signal Processing Magazine, May 2013. 17
Separable regularization n Key property V C W n Separable formulation equivalent to (P1) Lxρ rank[x] (P2) Ø Nonconvex; less variables: Proposition: If stat. pt. of (P2) and, then is a global optimum of (P1). 18
Decentralized rank minimization n Alternating-direction method of multipliers (ADMM) solver for (P2) Ø Method [Glowinski-Marrocco 75], [Gabay-Mercier 76] Ø Learning over networks [Schizas-Ribeiro-Giannakis 07] Consensus-based optimization Attains centralized performance M. Mardani, G. Mateos, and G. B. Giannakis, In-network sparsity regularized rank minimization: Algorithms and applications," IEEE Transactions on Signal Processing, 2013. 19
Internet2 data n Real network data Ø Dec. 8-28, 2008 Ø N=11, L=41, F=121, T=504 1 Detection probability 0.8 0.6 [Lakhina04], rank=1 [Lakhina04], rank=2 0.4 [Lakhina04], rank=3 Proposed method [Zhang05], rank=1 0.2 [Zhang05], rank=2 [Zhang05], rank=3 0 0 0.2 0.4 0.6 0.8 1 False alarm probability Anomaly volume 6 5 4 3 2 1 0 100 Flows 50 0 0 100 200 ---- True ---- Estimated Time P fa = 0.03 P d = 0.92 300 400 500 Data: http://www.cs.bu.edu/~crovella/links.html 20
Online rank minimization n Construct an estimated map of anomalies in real time Ø Streaming data model: n Approach: regularized exponentially-weighted LS formulation 5 Tracking cleansed link traffic ATLA--HSTN 4 Real time unveiling of anomalies CHIN--ATLA 2 Link traffic level 0 20 10 0 20 10 DNVR--KSCY HSTN--ATLA ---- Estimated ---- True Anomaly amplitude 0 40 20 0 30 20 10 WASH--STTL WASH--WASH o---- Estimated ---- True 0 Time index (t) 0 0 1000 2000 3000 4000 5000 6000 Time index (t) M. Mardani, G. Mateos, and G. B. Giannakis, "Dynamic anomalography: Tracking network anomalies via sparsity and low rank," IEEE Journal of Selected Topics in Signal Processing, pp. 50-66, Feb. 2013. 21
Roadmap n Robust principal component analysis Ø Linear low-rank models and sparse outliers n Scalable algorithms for big network data analytics Ø (De-) centralized and online rank minimization n Robust sparse embedding via dictionary learning Ø Nonlinear low-rank models; data-adaptive compressed sensing n Concluding remarks 22
Nonlinear low-dimensional models? q Compressive sampling (CS) [Donoho/Candes 06]: Linear operator Ø CS vs data-adaptive principal component analysis (PCA) [Pearson 1901] Ø Data-adaptive nonlinear CS? ; quad-cs [Ohlsson etal 13] q Nonlinear dimensionality reduction for data on manifolds Ø Kernel PCA [Scholkopf etal 98]; SDE [Weinberger 04]; reconstruction? Ø Local linear embedding (LLE) [Roweis-Saul 00]; LEM; MDS; Isomap Ø Sparsity-aware embeddings [Huang etal 10], [Vidal 11], [Kong etal 12] Ø Dictionary learning (DL) [Olshausen 97]; online DL [Mairal etal 10], [Carin etal 11] 23
Learning sparse manifold models q Training data on a smooth but unknown manifold Ø Use matrix to learn dictionary ( ) Sparse training data fit Smooth affine manifold fit Ø reduces and morphs training data to yield a smoother basis for Ø Robust sparse embedding via dictionary learning (RSE-DL) 24
Parsimonious nonlinear embedding q Embedding preserves Ø Reduced complexity embedding step ( ) q RSE-DL appropriate for (de-)compression and reconstruction q Robust sparse coding: works for clustering/classification 25
RSE-DL compression and reconstruction q Operational phase @ Tx: per data vector q Compress: q Operational phase @ Rx: given (possibly noisy) q Reconstruct: Ø Less computationally demanding modules ( ) 26
Test case: Swiss roll Ø Noise on manifold:, channel noise: 27
Comparisons with LLE, RSE, RSGE (Average over 100 realizations) 28
Missing data q USC girl (predates Lena!) with 50% misses q RSE-DL: reduced complexity relative to e.g., Bayesian-type [Chen etal 10]
Concluding summary n Robust PCA; online via robust subspace tracking Ø Leveraging linear low-rank models and outlier sparsity n Unveiling anomalies in large-scale network data Ø Scalable decentralized and online algorithms n Data-adaptive, nonlinear, low-dimensional models n The road ahead Ø Performance bounds? Dynamical network data? Ø Learning via quantized big data (few bits)? Ø RSE-DL for nonlinear compressive sampling? Thank you! 30
Numerical validation n Setup L=105, F=210, T = 420 R ~ Bernoulli(1/2) X o = RPQ, P, Q ~ N(0, 1/FT) a ij ϵ {-1,0,1} w.p. {π/2, 1-π, π/2} n Relative recovery error rank(x0) R ) [r] (r) 50 40 30 20 10 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 2.5 4.5 6.5 8.5 10.5 12.5 % non-zero entries ( ρ ) [(s/ft)%] 0 31