Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou, Sergios Theodoridis Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 1/31
Outline 1 Big Data era 2 Randomized Methods 3 Randomized Linear Regression 4 Robust Randomized Linear Regression 5 Iterative Randomized Robust Regression 6 Randomized Low Rank matrix approximation Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 2/31
Big Data era Why all the fuss? Massive Data Volumes is not a new thing - 65 10 18 bytes flowed through telecommunication networks on 2007 First New Thing: Established Data analysis and Machine learning techniques face Big Challenges Second New Thing: Novel approaches for data capturing, handling and processing emerged Third New Thing: New modalities and increased complexity (internet of things, cyber-physical systems, smart homes, smart cars etc.) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 3/31
Big Data era Why all the fuss? Marketing policies From Big Data to Insights New emerging applications From Big Data to Insights Big Profits 4.4 million data scientists needed by 2015 (IBM) Many challenging open problems / paradigm shift Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 4/31
Big Data era What characterizes big data Volume (scale of data) Variety (different forms of data) Velocity (streaming data) Veracity (presence of outliers / corruptions) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 5/31
Big Data era How to deal with big data Distributed Processing Centralized approach, e.g. MapReduce/Hadoop Decentralized approach, e.g. ad-hoc in-network processing Share processing power and storage requirements Privacy protection Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 6/31
Big Data era How to deal with big data Distributed Processing Centralized approach, e.g. MapReduce/Hadoop Decentralized approach, e.g. ad-hoc in-network processing Share processing power and storage requirements Privacy protection Online Learning Process data on the fly Limited storage demands Reduced computational complexity (stochastic gradient descent) Dealing with time-varying situations Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 6/31
Big Data era How to deal with big data Distributed Processing Centralized approach, e.g. MapReduce/Hadoop Decentralized approach, e.g. ad-hoc in-network processing Share processing power and storage requirements Privacy protection Online Learning Process data on the fly Limited storage demands Reduced computational complexity (stochastic gradient descent) Dealing with time-varying situations Randomized Methods Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 6/31
Randomized Methods Major Principle that governs randomized methods Instead of working with the original large-scale data matrices, operate on compressed versions of them. The compression is realized via computationally efficient dimensionality reduction, which is performed in a randomized rather than in a deterministic way. Some Facts! It is a very appealing idea! Data are highly compressible Low speed memory units are the major bottleneck It is applicable to ubiquitous data analysis and ML tasks, even to basic matrix operations Matrix Multiplication Linear Regression Low-rank Matrix approximation (Singular Value Decomposition) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 7/31
Randomized Methods Some Facts! (cont.) Which is the price to pay for? Provide approximate rather than exact solutions There is a probability of failure Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 8/31
Randomized Linear Regression Linear LS Regression b = A x + η N 1 N l l 1 N 1 N l, and at least N very large ˆx LS = arg min x R l b Ax 2 2 ˆx LS = (A T A) 1 A T b Computational complexity: O(Nl 2 ) via QR decomposition Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 9/31
Randomized Least Squares Randomized Linear LS Regression b = R b, d 1 d N N 1 b = A x + η N 1 N l l 1 N 1 A d l = R A d N N l, where d N ˆx R = arg min x R l b Ax 2 2, Computational complexity: O(dl 2 ) + C(R) + T (RA) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 10/31
Randomized Least Squares Randomized Linear LS Regression b = R b, d 1 d N N 1 b = A x + η N 1 N l l 1 N 1 A d l = R A d N N l, where d N ˆx R = arg min x R l b Ax 2 2, Computational complexity: O(dl 2 ) + C(R) + T (RA) Compression Ratio Compression Ratio Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 10/31
Randomized Least Squares Randomized Linear LS Regression b = R b, d 1 d N N 1 b = A x + η N 1 N l l 1 N 1 A d l = R A d N N l, where d N ˆx R = arg min x R l b Ax 2 2, Computational complexity: O(dl 2 ) + C(R) + T (RA) Some theoretic results [Drineas 2011] If d = O ( ) l(ln l)(ln N) + l ln N ɛ, then with probability 0.8 b Ax R 2 (1 + ɛ) b Ax LS 2 x LS x R 2 ɛ (κ(a) ) γ 2 1 x LS 2 if N e l, and U A U T A b 2 γ b 2, γ (0, 1] Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 10/31
Johnson Lindenstrauss (JL) seminal work (1984) Lemma For any set, S, of k points, u 1, u 2,... in R N there exist a linear mapping R : R N R d, with d = O(ɛ 2 log l), such that all the pairwise distances are approximately preserved: i, j (1 ɛ) u i u j 2 2 Ru i Ru j 2 2 (1 + ɛ) u i u j 2 2 W.B. Johnson and J. Lindenstrauss, Extensions of Lipshitz mapping into Hilbert space, Contemporary Mathematics, 1984. Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 11/31
Johnson Lindenstrauss (JL) seminal work (1984) Lemma For any set, S, of k points, u 1, u 2,... in R N there exist a linear mapping R : R N R d, with d = O(ɛ 2 log l), such that all the pairwise distances are approximately preserved: i, j (1 ɛ) u i u j 2 2 Ru i Ru j 2 2 (1 + ɛ) u i u j 2 2 W.B. Johnson and J. Lindenstrauss, Extensions of Lipshitz mapping into Hilbert space, Contemporary Mathematics, 1984. JL Transforms (R Matrix) Johnson and Lindenstrauss (1984): Choose R uniformly at random from the space of projection matrices. Frankl and Maehara (1988): Random orthogonal matrix Indyk and Motwani (1998), DasGupta and Gupta (1999): entries chosen uniformly at random from N (0, 1 N ) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 11/31
Johnson Lindenstrauss (JL) seminal work JL geometry in the Linear Regression case b = A x + η N 1 N l l 1 N 1 ˆx R = arg min x R l Rb RAx 2 2, 0 0 Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 12/31
Accelerating Johnson Lindenstrauss (JL) Transforms Achlioptas (2003) 3 + d, with probability 1 6, a i,j = 0, with probability 2 3, 3 d, with probability 1 6. then if d 4+2β ɛ 2 /2 ɛ 3 /3log(l), each pairwise distance is preserved with probability at least 1 l β. Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 13/31
Accelerating Johnson Lindenstrauss (JL) Transforms Achlioptas (2003) 3 + d, with probability 1 6, a i,j = 0, with probability 2 3, 3 d, with probability 1 6. then if d 4+2β ɛ 2 /2 ɛ 3 /3log(l), each pairwise distance is preserved with probability at least 1 l β. Fast JL Transforms (e.g. Sarlos 2006, Drineas et all 2011) R = P HD D R N N diagonal matrix with ±1 H R N N Hadamard matrix (normalized) P R d N a sparse matrix (or simply a Sampling Matrix) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 13/31
Accelerating Johnson Lindenstrauss (JL) Transforms Fast JL Transforms (e.g. Sarlos 2006, Drineas et all 2011) R = P HD D R N N diagonal matrix with ±1 H R N N Hadamard matrix (normalized) P R d N a sparse matrix (or simply a Sampling Matrix) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 14/31
Accelerating Johnson Lindenstrauss (JL) Transforms Fast JL Transforms (e.g. Sarlos 2006, Drineas et all 2011) R = P HD D R N N diagonal matrix with ±1 H R N N Hadamard matrix (normalized) P R d N a sparse matrix (or simply a Sampling Matrix) Computational Complexity / Facts It is called Randomized Hadamard Transform Multiplication with D is just selective sign changes Ha O(N log k), where k N, is the number Hadamard components needed Overall, RA takes O(lN log k) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 14/31
Fast LS approximation Example Compression Ratio Randomized Hadamard Transform Recall: d = O ( ) l(ln l)(ln N) + l ln N ɛ Example 1: N = 10 6, l = 200, ɛ = 0.1. Nl 2 dl 2 + ln log(d) = 10, Example 2: N = 10 8, l = 1000, ɛ = 0.1. Nl 2 dl 2 + ln log(d) = 63, Compression Ratio N d = 23 N d = 321 Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 15/31
Randomized projections vs Randomized sampling = Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 16/31
Randomized projections vs Randomized sampling = = Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 16/31
Randomized projections vs Randomized sampling = Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 17/31
Randomized projections vs Randomized sampling = = Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 17/31
Statistical Leverage Hat Matrix Statistical Leverage Scores b = A x + η ˆx LS = (A T A) 1 A T b ˆb = Aˆx LS = A(A T A) 1 A T b ˆb = Hb H ij measures the influence exerted on the prediction ˆb i by observation b j l i = H ii measures the importance of b i in determining the best LS fit. l i, i = 1... N are referred to as statistical leverage scores H = P A = UU T, for any orthogonal matrix spanning the column space of A. l i = U i,. 2 2 Very large H ii are indicators for outliers in A. Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 18/31
Randomized Sampling Sampling Strategy Construct an importance sampling distribution {p i } N i=1, with p i = l i l. Intuitively, the larger the p i is the higher the probability of selecting the ith data sample (b i, A i,. ). Start with a zero-matrix R R d N. Then successively fill a single entry of each row, say the ith as follows Sample a random value, say ρ [1,..., N], from the importance sampling distribution. Set R i,ρ = 1 dp ρ. Via A = RA, A comprises rescaled rows of A randomly sampled with replacement. Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 19/31
Computation of the Statistical Leverage Naive way A = UΣV T then U is orthogonal spanning the column space of A. Alas, complexity O(Nl 2 ) Fast approximations (Drineas 2012) Exploit the fact that l i = (UU T ) i,. 2 2 = (AA ) i,. 2 2 Construct two fast JL transform matrices (e.g. randomized Hadamard transforms), Π 1 R r 1 N, Π 2 R r 2 r 1 Estimate leverage scores as ˆl i = (A(Π 1 A) Π 2 ) i,. 2 2 it is proved that l i ˆl i ɛl i, i Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 20/31
Randomized projections vs Randomized sampling common ground! Random projections uniformize the leverage scores (so simple random sampling is adequate) Without random projection-based preprocessing, advanced sampling is needed (and Leverage scores-based importance sampling is doing the job!) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 21/31
Robust Randomized Linear Regression Robust Linear Regression Recall Veracity! b = A x + η, η = n + o ˆx LAD = arg min x R l b Ax 1 Least Absolute Deviations do not admit a closed form solution Linear programming using, e.g. interior-point methods O(poly(N)) Use approximate, iterative solutions, e.g. ADMM [Boyed 2011]. Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 22/31
Robust Randomized Linear Regression Robust Linear Regression Recall Veracity! b = A x + η, η = n + o ˆx LAD = arg min x R l b Ax 1 Least Absolute Deviations do not admit a closed form solution Linear programming using, e.g. interior-point methods O(poly(N)) Use approximate, iterative solutions, e.g. ADMM [Boyed 2011]. A Hard time for Fast JL transforms Rb = RA + Rn + Ro The sparsity property is missing from Ro The energy of the nonzero values of o is spread across all d dimensions LAD is not appropriately anymore Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 22/31
Robust Randomized Linear Regression Robust Linear Regression Recall Veracity! b = A x + η, η = n + o ˆx LAD = arg min x R l b Ax 1 Least Absolute Deviations do not admit a closed form solution Linear programming using, e.g. interior-point methods O(poly(N)) Use approximate, iterative solutions, e.g. ADMM [Boyed 2011]. Randomized Sampling is Still OK Ro is still sparse LAD can be applied Harder (at least in theory) to compute the approximate leverage scores Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 22/31
Randomized Sampling for LAD l (1) leverage scores Recall the LS case: l (2) i = U i,. 2 2 where U could be any orthogonal base spanning the column space of A. LAD regression case: Leverage scores: l (1) i = U i,. 1 U i,. 1 is not invariant under rotation so, a well conditioned U need to be used Cauchy distributed variables / submatrices are needed. In practice, benefits over the l (2) construction are observed when N is way much larger than l Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 23/31
Reasoning behind our approach Proposed Approach Apply a fast JL transform, b = Rb, A = RA. Progressively clean the data from outliers in the reduced dimensional space Obtain final solution with ordinary LS. In an ideal world...(i) Let Λ {1,, N} be the index set indicating the corrupted data. Assume Λ is known. Then A Λ c,., b Λ c are the outlier-free data. Ideal solution: ˆx LS = arg min x R l Rb Λ c RA Λc,.x 2 Cleaning the compressed data directly in the low dim domain. Rb Λ c = b R.,Λ b Λ RA Λ c,. = A R.,Λ A Λ,. Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 24/31
Reasoning behind our approach In an ideal world...(ii) Let randomized Hadamard transform be applied to the full data set where n = Rn. b = Ax + n + Ro Assume that x can be estimated exactly. Then where z is computed as b Ax. z = Ro + n (1) Request: Is it possible to estimate the support of o, in the reduced dimensional space, based on (1)? Indeed, this is a typical compressed sensing scenario. ô = min( z Ro 2 ) s.t. o 0 K o We only need to estimate the support (or a subset of it) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 25/31
Reasoning behind our approach Back in reality... Let randomized Hadamard transform is applied to the full data set b = Ax + n + Ro where b = Rb, A = RA, n = Rn. x is not known but Ro is likely to be Normal distributed. so ˆx = arg min x R l b Ax 2 where ˆx = x + x e. Then, z = b Aˆx z = Ro + n Request: Estimate any part of the support of o. Suggestion: Just use the CoSaMP proxy, ψ = R T z, Λ = Supp( ψ, K) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 26/31
The full picture Iterative Randomized Robust LS: Concept Compress data: b = Rb, A = RA Start Iterations Get a tentative estimate ˆx via arg min x R l b Ax 2 Compute ψ = R T (b Aˆx) Define Λ as the set of indices of the K larger (in magnitude) components of ψ. Key remark: We are happy if Λ contains some, not necessarily K, outlier indices Exclude / Clear the data indexed in Λ from the compressed data set Rb Λ c = b R.,Λ b Λ RA Λ c,. = A R.,Λ A Λ,. Key remark: Note that some healthy data might be omitted as well. Return to hopefully get an improved ˆx or stop Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 27/31
Computational complexity analysis Proposed Once: O((l + 1)N log d) Per iteration: O(dl 2 ) + d(l + 1) + O(Nd) + O(N) + (dkl) Random Sampling For the leverage Scores: O((l + 1)N log r 1 + lnr 2 + r 1 l 2 + r 2 l 2 ) r 1 = d and r 2 = O(log l) For LAD: poly(d) Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 28/31
Some Results Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 29/31
Some Results Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 30/31
Randomized Methods for Low Rank approximation Sampling the column space is the key... Task: Let A R n m,. min X:rank(X=k) A X F Randomized Projection based Range finder Generate matrix R R m d and compress: Y = AR some housekeeping: Replace Y with Q whose columns form an orthonormal basis for the range of Y SVD estimation in 3 steps B = Q T A Compute low dimensional SVD: B = ŨΣV T U = QŨ Y. Kopsinis, Dept. of Informatics & Telecommunications, UoA. Randomized methods for big data applications, 31/31