Robust Regression on MapReduce

Transcription

1 Xiangrui Meng LinkedIn Cororation, 2029 Stierlin Ct, Mountain View, CA Michael W. Mahoney Deartment of Mathematics, Stanford University, Stanford, CA Abstract Although the MaReduce framework is now the de facto standard for analyzing massive data sets, many algorithms (in articular, many iterative algorithms oular in machine learning, otimization, and linear algebra) are hard to fit into MaReduce. Consider, e.g., the l regression roblem: given a matrix A R m n and a vector b R m, find a vector x R n that minimizes f(x) = Ax b. The widely-used l 2 regression, i.e., linear least-squares, is known to be highly sensitive to outliers; and choosing [1, 2) can hel imrove robustness. In this work, we roose an efficient algorithm for solving strongly over-determined (m n) robust l regression roblems to moderate recision on MaReduce. Our emirical results on data u to the terabyte scale demonstrate that our algorithm is a significant imrovement over traditional iterative algorithms on MaReduce for l 1 regression, even for a fairly small number of iterations. In addition, our roosed interior-oint cutting-lane method can also be extended to solving more general convex roblems on MaReduce. 1 Introduction Statistical analysis of massive data sets resents very substantial challenges both to data infrastructure and to algorithm develoment. In articular, many oular data analysis and machine learning algorithms that erform well when alied to small-scale and mediumscale data that can be stored in RAM are infeasible when alied to the terabyte-scale and etabyte-scale Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, JMLR: W&CP volume 28. Coyright 2013 by the author(s). data sets that are stored in distributed environments and that are increasingly common. In this aer, we develo algorithms for variants of the robust regression roblem, and we evaluate imlementations of them on data of u to the terabyte scale. In addition to being of interest since ours are the first algorithms for these roblems that are aroriate for data of that scale, our results are also of interest since they highlight algorithm engineering challenges that will become more common as researchers try to scale u small-scale and medium-scale data analysis and machine learning methods. For examle, at several oints we had to work with variants of more rimitive algorithms that were worse by traditional comlexity measures but that had better communication roerties. 1.1 MaReduce and Large-scale Data The MaReduce framework, introduced by (Dean & Ghemawat, 2004) in 2004, has emerged as the de facto standard arallel environment for analyzing massive data sets. Aache Hadoo 1, an oen source software framework insired by Google s MaReduce, is now extensively used by comanies such as Facebook, LinkedIn, Yahoo!, etc. In a tyical alication, one builds clusters of thousands of nodes containing etabytes of storage in order to rocess terabytes or even etabytes of daily data. As a arallel comuting framework, MaReduce is well-known for its scalability to massive data. However, the scalability comes at the rice of a very restrictive interface: sequential access to data, and functions limited to ma and reduce. Working within this framework demands that traditional algorithms be redesigned to resect this interface. For examle, the Aache Mahout 2 roject is building a collection of scalable machine learning algorithms that includes algorithms for collaborative filtering, clustering, matrix decomosition, etc. 1 Aache Hadoo, htt://hadoo.aache.org/ 2 Aache Mahout, htt://mahout.aache.org/

2 Although some algorithms are easily adated to the MaReduce framework, many algorithms (and in articular many iterative algorithms oular in machine learning, otimization, and linear algebra) are not. When the data are stored on RAM, each iteration is usually very chea in terms of floating-oint oerations (FLOPs). However, when the data are stored on secondary storage, or in a distributed environment, each iteration requires at least one ass over the data. Since the cost of communication to and from secondary storage often dominates FLOP count costs, each ass can become very exensive for very large-scale roblems. Moreover, there is generally no arallelism between iterations: an iterative algorithm must wait until the revious ste gets comleted before the next ste can begin. 1.2 Our Main Results In this work, we are interested in develoing algorithms for robust regression roblems on MaReduce. Of greatest interest will be algorithms for the strongly over-determined l 1 regression roblem, 3 although our method will extend to more general l regression. For simlicity of resentation, we will formulate most of our discussion in terms of l regression; and toward the end we will describe the results of our imlementation of our algorithm for l 1 regression. Recall the strongly over-determined l regression roblem: given a matrix A R m n, with m n, a vector b R m, a number [1, ), and an error arameter ɛ > 0, find a (1 + ɛ)-aroximate solution ˆx R n to: i.e., find a vector ˆx such that f = min x R n Ax b, (1) Aˆx b (1 + ɛ)f, (2) where the l norm is given by x = ( i x i ) 1/. A more robust alternative to the widely-used l 2 regression is obtained by working with l regression, with [1, 2), where = 1 is by far the most oular alternative. This, however, comes at the cost of increased comlexity. While l 2 regression can be solved with, e.g., a QR decomosition, l 1 regression roblems can be formulated as linear rograms, and other l regression roblems can be formulated as convex rograms. In those cases, iterated weighted least-squares methods or simlex methods or interior-oint methods are tyically used in ractice. These algorithms tend to require dot roducts, orthogonalization, and thus a great 3 The l 1 regression roblem is also known as the Least Absolute Deviations or Least Absolute Errors roblem. deal of communication, rendering them challenging to imlement in the MaReduce framework. In this aer, we describe an algorithm with better communication roerties that is efficient for solving strongly over-determined l regression roblems to moderate recision on MaReduce. 4 Several asects of our main algorithm are of articular interest: Single-ass conditioning. We use a recentlydeveloed fast rounding algorithm (which takes O(mn 3 log m) time to construct a 2n-rounding of a centrally symmetric convex set in R n (Clarkson et al., 2013)) to construct a single-ass deterministic conditioning algorithm for l regression. Single-ass random samling. By using a constrained form of l regression (that was also used recently (Clarkson et al., 2013)), we show that the method of subsace-reserving random samling (Dasguta et al., 2009) can be (easily) imlemented in the MaReduce framework, i.e., with ma and reduce functions, in a single ass. Effective initialization. By using multile subsamled solutions from the single-ass random samling, we can construct a small initial search region for interior-oint cutting-lane methods. Effective iterative solving. By erforming in arallel multile queries at each iteration, we develo a randomized IPCPM (interior-oint cutting lane method) for solving the convex l regression rogram. In addition to describing the basic algorithm, we also resent emirical results from a numerical imlementation of our algorithm alied to the l 1 regression roblems on data sets of size u to the terabyte scale. 1.3 Prior Related Work There is a large literature on robust regression, distributed comutation, MaReduce, and randomized matrix algorithms that is beyond our scoe to review. See, e.g., (Rousseeuw & Leroy, 1987), (Bertsekas & Tsitsiklis, 1991), (Dean & Ghemawat, 2004), and (Mahoney, 2011), resectively, for details. Here, we will review only the most recent related work. Strongly over-determined l 1 regression roblems were considered by (Portnoy & Koenker, 1997), who used a uniformly-subsamled solution for l 1 regression to estimate the signs of the otimal residuals in order 4 Interestingly, both our single-ass conditioning algorithm as well as our iterative rocedure are worse in terms of FLOP counts than state-of-the-art algorithms (develoed for RAM) for these roblems see Tables 1 and 2 but we refer them since they erform better in very largescale distributed settings that are of interest to us here.

3 to reduce the roblem size; their samle size is roortional to (mn) 2/3. (Clarkson, 2005) showed that, with roer conditioning, relative-error aroximate solutions can be obtained from row norm-based samling; and (Dasguta et al., 2009) extended these subsace-reserving samling schemes to l regression, for [1, ), thereby obtaining relative-error aroximations. (Sohler & Woodruff, 2011) roved that a Cauchy Transform can be used for l 1 conditioning and thus l 1 regression in O(mn 2 log n) time; this was imroved to O(mn log n) time with the Fast Cauchy Transform by (Clarkson et al., 2013), who also develoed an ellisoidal rounding algorithm (see Lemma 1 below) and used it and a fast random rojection to construct a fast single-ass conditioning algorithm (uon which our Algorithm 1 below is based). (Clarkson & Woodruff, 2013) and (Meng & Mahoney, 2013) show that both l 2 regression and l regression can be solved in inut-sarsity time via subsacereserving samling. The large body of work on fast randomized algorithms for l 2 regression (and related) roblems has been reviewed recently (Mahoney, 2011). To obtain a (1 + ɛ)-aroximate solution in relative scale, the samle sizes required by all these algorithms are all roortional to 1/ɛ 2, which limits samling algorithms to low recision, e.g., ɛ 10 2, solutions. By using the outut of the samling/rojection ste as a reconditioner for a traditional iterative method, thereby leading to an O(log(1/ɛ)) deendence, this roblem has been overcome for l 2 regression (Mahoney, 2011). For l 1 regression, the O(1/ɛ 2 ) convergence rate of the subgradient method of (Clarkson, 2005) was imroved by (Nesterov, 2009), who showed that, with a smoothing technique, the number of iterations can be reduced to O(1/ɛ). Interestingly (and as we will return to in Section 3.3), the rarely-used ellisoid method (see (Grötschel et al., 1981)) as well as IPCPMs (see (Mitchell, 2003)) can solve general convex roblems and converge in O(log(1/ɛ)) iterations with extra oly(n) work er iteration. More generally, there has been a lot of interest recently in distributed machine learning comutations. For examle, (Daumé et al., 2012) describes efficient rotocols for distributed classification and otimization; (Balcan et al., 2012) analyzes communication comlexity and rivacy asects of distributed learning; (Mackey et al., 2011) adots a divide-andconquer aroach to matrix factorization such as CUR decomositions; and (Zhang et al., 2012) develo communication-efficient algorithms for statistical otimization. Algorithms for these and other roblems can be analyzed in models for MaReduce (Karloff et al., 2010; Goodrich, 2010; Feldman et al., 2010); and work on arallel and distributed aroaches to scaling u machine learning has been reviewed recently (Bekkerman et al., 2011). 2 Background and Overview In the remainder of this aer, we use the following formulation of the l regression roblem: minimize x R n Ax subject to c T x = 1. (3) This formulation of l regression, which consists of a homogeneous objective and an affine constraint, can be shown to be equivalent to the formulation of (1). 5 Denote the feasible region by Ω = {x R n c T x = 1}, where recall that we are interested in the case when m n. Let X be the set of all otimal solutions to (3) and x be an arbitrary otimal solution. Then, let f(x) = Ax, f = Ax, and let g(x) = A T [Ax] 1 / Ax 1 f(x), where ([Ax] 1 ) i = sign(a T i x) at i x 1 and a i is the i-th row of A, i = 1,..., m. Note that g(x) T x = f(x). For simlicity, we assume that A has full column rank and c 0. Our assumtions imly that X is a nonemty and bounded convex set and f > 0. Thus, given an ɛ > 0, our goal is to find an ˆx Ω that is a (1 + ɛ)-aroximate solution to (3) in relative scale, i.e., such that f(ˆx) < (1 + ɛ)f. As with l 2 regression, l regression roblems are easier to solve when they are well-conditioned. The l -norm condition number of A, denoted κ (A), is defined as: where κ (A) = σ max (A)/σ min (A), σ max (A) = max Ax and σ min (A) = min Ax. x 2 1 x 2 1 This imlies σ min (A) x 2 Ax σ max (A) x 2, x R n. We use κ, σ min, and σ max for simlicity when the underlying matrix is clear. The element-wise l -norm of A is denoted by A. We use E(d, E) = {x R n x = d + Ez, z 2 = 1} to describe an ellisoid where E R n n is a non-singular matrix. The volume of a full-dimensional ellisoid E is denoted by E. We 5 In articular, the new A is A concatenated with b, etc. Note that the same formulation is also used by (Nesterov, 2009) for solving unconstrained convex roblems in relative scale as well as by (Clarkson et al., 2013).

4 use S(S, t) = {x R n Sx t} to describe a olytoe, where S R s n and t R s for some s n + 1. Given an l regression roblem, its condition number is generally unknown and can be arbitrarily large; and thus one needs to run a conditioning algorithm before randomly samling and iteratively solving. Given any non-singular matrix E R n, let y be an otimal solution to the following roblem: minimize y R n AEy subject to c T Ey = 1. (4) This roblem is equivalent to (3), in that we have x = Ey X, but the condition number associated with (4) is κ (AE), instead of κ (A). So, the conditioning algorithm amounts to finding a nonsingular matrix E R n such that κ (AE) is small. One aroach to conditioning is via ellisoidal rounding. In this aer, we will modify the following result from (Clarkson et al., 2013) to comute a fast ellisoidal rounding. Lemma 1 ((Clarkson et al., 2013)). Given A R m n with full column rank and [1, 2), it takes at most O(mn 3 log m) time to find a non-singular matrix E R n n such that y 2 AEy 2n y 2, y R n. Finally, we call a work online if it is executed on MaReduce, and offline otherwise. An online work deals with large-scale data stored on secondary storage but the work can be well distributed on MaReduce; an offline work deals with data stored on RAM. 3 l Regression on MaReduce In this section, we will describe our main algorithm for l regression on MaReduce. 3.1 Single-ass Conditioning Algorithm The algorithm of Lemma 1 for comuting a 2nrounding is not immediately-alicable to large-scale l regression roblems, since each call to the oracle requires a ass to the data. 6 We can grou n calls together within a single ass, but we would still need O(n log m) asses. Here, we resent a deterministic single-ass conditioning algorithm that balances the cost-erformance trade-off to rovide a 2n 2/ - conditioning of A. See Algorithm 1. Our main result for Algorithm 1 is given in the following lemma. Lemma 2. Algorithm 1 is a 2n 2/ -conditioning algorithm and it runs in O((mn 2 + n 4 ) log m) time. 6 The algorithm takes a centrally-symmetric convex set described by a searation oracle that is a subgradient of Ax ; see (Clarkson et al., 2013) for details. Algorithm 1 A single-ass conditioning algorithm. Inut: A R m n with full column rank & [1, 2). Outut: A non-singular matrix E R n n such that y 2 AEy 2n 2/ y 2, y R n. 1: Partition A along its rows into sub-matrices of size n 2 n, denoted by A 1,..., A M. 2: For each A i, comute its economy-sized singular value decomosition (SVD): A i = U i Σ i Vi T. 3: Let Ãi = Σ i Vi T for i = 1,..., M, ( Ã1 ) C = {x ( Ãix 2 )1/ 1}, and Ã =... Ã M 4: Comute Ã s SVD: Ã = Ũ ΣṼ T. 5: Let E 0 = E(0, E 0 ) where E 0 = n 1/ 1/2 Ṽ Σ 1. E 0 gives an (Mn 2 ) 1/ 1/2 -rounding of C. 6: With the algorithm of Lemma 1, comute an ellisoid E = E(0, E) that gives a 2n-rounding of C. 7: Return E. Proof. The idea is to use block-wise reduction in l 2 - norm and aly fast rounding to a small roblem. The tool we need is simly the equivalence of vector norms. Let C = {x R n Ax 1}, which is convex, full-dimensional, bounded, and centrally symmetric. Adoting notation from Algorithm 1, we first have n 1 2/ C C C because for all x R n, M M Ax = A i x n 2 A i x 2 =n2 Ãix 2 and M Ax = A i x A i x 2 = Ãix 2. Next we rove that E 0 gives an (Mn 2 ) 1/ 1/2 -rounding of C. For all x R n, we have M Ãix 2 Ãix = Ãx (Mn) 1 /2 Ãx 2 = (Mn) 1 /2 ΣṼ T x 2, and Ãix 2 M n/2 1 Ãix = n /2 1 Ãx n /2 1 Ãx 2 = n/2 1 ΣṼ T x 2. Then by choosing E 0 = n 1/ 1/2 Ṽ Σ 1, we get E0 1 x 2 ( Ãix 2 )1/ (Mn 2 ) 1/ 1/2 E0 1 x 2

5 time κ 1 (Clarkson, 2005) O(mn 5 log m) (n(n + 1)) 1/2 Lemma 1 O(mn 3 log m) 2n Lemma 2 & Algorithm 1 O(mn 2 log m) 2n 2 (Sohler & Woodruff, 2011) O(mn 2 log n) O(n 3/2 log 3/2 n) (Clarkson et al., 2013) O(mn log m) O(n 5/2 log 1/2 n) (Clarkson et al., 2013) O(mn log n) O(n 5/2 log 5/2 n) (Meng & Mahoney, 2013) O(nnz(A)) O(n 3 log 3 n) Table 1. Comarison of l 1-norm conditioning algorithms on the running time and conditioning quality. for all x R n and hence E 0 gives an (Mn 2 ) 1/ 1/2 - rounding of C. Since n 1 2/ C C C, we know that any 2n-rounding of C is a 2n n 2/ 1 = 2n 2/ -rounding of C. Therefore, Algorithm 1 comutes a 2n 2/ - conditioning of A. Note that the rounding rocedure is alied to a roblem of size Mn n m/n n. Therefore, Algorithm 1 only needs a single ass through the data, with O(mn 2 ) FLOPs and an offline work of O((mn 2 + n 4 ) log m) FLOPs. The offline work requires m RAM, which might be too much for largescale roblems. In such cases, we can increase the block size from n 2 to, for examle, n 3. This gives us a 2n 3/ 1/2 -conditioning algorithm that only needs m/n offline RAM and O((mn + n 4 ) log m) offline FLOPs. The roof follows similar arguments. See Table 1 for a comarison of the results of Algorithm 1 and Lemma 2 with rior work on l 1 norm conditioning (and note that some of these results, e.g., those of (Clarkson et al., 2013) and (Meng & Mahoney, 2013), have extensions that aly to l - norm conditioning). Although the Cauchy Transform (Sohler & Woodruff, 2011) and the Fast Cauchy Transform (Clarkson et al., 2013) are indeendent of A and require little offline work, there are several concerns with using them in our alication. First, the constants hidden in κ 1 are not exlicitly given, and they may be too large for ractical use, esecially when n is small. Second, although random samling algorithms do not require σ min and σ max as inuts, some algorithms, e.g., IPCPMs, need accurate bounds of them. Third, these transforms are randomized algorithms that fail with certain robability. Although we can reeat trials to make the failure rate arbitrarily small, we don t have a simle way to check whether or not any given trial succeeds. Finally, although the online work in Algorithm 1 remains O(mn 2 ), it is embarrassingly arallel and can be well distributed on MaReduce. For large-scale strongly over-determined roblems, Algorithm 1 with block size n 3 seems to be a good comromise in ractice. This guarantees 2n 3/ 1/2 -conditioning, and the O(mn 2 ) online work can be easily distributed on MaReduce. 3.2 Single-ass Random Samling Here, we describe our method for imlementing the subsace-samling rocedure with ma and reduce functions. Suose that after conditioning we have σ min (A) = 1 and κ (A) = oly(n). (Here, we use A instead of AE for simlicity.) Then the following method of (Dasguta et al., 2009) can be used to erform subsace-reserving samling. Lemma 3 ((Dasguta et al., 2009)). Given A R m n that is (α, β, )-conditioned 7 and an error arameter ɛ < 1 7, let r 16(2 +2)(αβ) (n log 12 ɛ +log 2 δ )/(2 ɛ 2 ), and let S R m m be a diagonal samling matrix, with random entries: { 1 S ii = i with robability i, 0 otherwise, where the imortance samling robabilities { i min 1, a i } A r, i = 1,..., m. Then, with robability at least 1 δ, the following holds for all x R n, (1 ɛ) Ax SAx (1 + ɛ) Ax. (5) This subsace-reserving samling lemma can be used, with the formulation (3), to obtain a relative-error aroximation to the l regression roblem, the roof of which is immediate. Lemma 4 ((Clarkson et al., 2013)). Let S be constructed as in Lemma 3, and let ˆx be the otimal solution to the subsamled roblem: minimize x R n SAx subject to c T x = 1. Then with robability at least 1 δ, ˆx is a 1+ɛ 1 ɛ - aroximate solution to (3). It is straightforward to imlement this algorithm in MaReduce in a single ass. This is resented in Algorithm 2. Imortantly, note that more than one subsamled solution can be obtained in a single ass. This translates to a higher recision or a lower failure rate; and, as described in Section 3.3, it can also be used to construct a better initialization. Several ractical oints are worth noting. First, nκ is an uer bound of A, which makes the actual samle size likely to be smaller than r. For better control on the samle size, we can comute A directly via one ass over A rior to samling, or we can set a 7 See (Clarkson et al., 2013) for the relationshi between κ (A) and the notion of (α, β, )-conditioning.

6 Algorithm 2 A single-ass samling algorithm. Inut: A R m n with σ max (A) = κ, c R n, a desired samle size r, and an integer N. Outut: N aroximate solutions: ˆx k, k = 1,..., N. 1: function maer(a: a row of A) 2: Let = min{r a /(nκ ), 1}. 3: Emit (k, a/) with robability, k = 1,..., N. 4: end function 5: function reducer(k, {a i }) 6: Assemble A k from {a i }. 7: Comute ˆx k = arg min ct x=1 A k x. 8: Emit (k, ˆx k ). 9: end function big r in maers and discard rows at random in reducers if the actual samle size is too big. Second, in ractice, it is hard to accet ɛ as an inut and determine the samle size r based on Lemma 3. Instead, we choose r directly based on our hardware caacity and running time requirements. For examle, suose we use a standard rimal-dual ath-following algorithm (see (Nesterov & Nemirovsky, 1994)) to solve subsamled roblems. Then, since each roblem needs O(rn) RAM and O(r 3/2 n 2 log r ɛ ) running time for a (1 + ɛ)- aroximate solution, where r is the samle size, this should dictate the choice of r. Similar considerations aly to the use of the ellisoid method or IPCPMs. 3.3 A Randomized IPCPM Algorithm A roblem with a vanilla alication of the subsacereserving random samling algorithm is accuracy: it is very efficient if we only need one or two accurate digits (see (Clarkson et al., 2013) for details), but if we are looking for moderate-recision solutions, e.g., those with ɛ 10 5, then we very quickly be limited by the O(1/ɛ 2 ) samle size required by Lemma 3. For examle, setting = 1, n = 10, αβ = 1000, and ɛ = 10 3 into Lemma 3, we get a samle size of aroximately 10 12, which as a ractical matter is certainly intractable for a subsamled roblem. In this section, we will describe an algorithm with a O(log(1/ɛ)) deendence, which is thus aroriate for comuting moderate-recision solutions. This algorithm will be a randomized IPCPM with several features secially-designed for MaReduce. In articular, the algorithm will take advantage of the multile subsamled solutions and the arallelizability of the MaReduce framework. As background, recall that IPCPMs are similar to the bisection method but work in a high dimensional sace. An IPCPM requires a olytoe S 0 that is known to contain a full-dimensional ball B of desired num. iter. addl work subgradient (Clarkson, 2005) O(n 4 /ɛ 2 ) gradient (Nesterov, 2009) O(m 1/2 log m/ɛ) ellisoid (Grötschel et al., 1981) O(n 2 log(κ/ɛ)) O(n 2 ) IPCPMs (see text for refs.) O(n log(κ/ɛ) oly(n) Table 2. Iterative algorithms for l regression: number of iterations and extra work er iteration. solutions described by a searation oracle. At ste k, a query oint x k int S k is sent to the oracle. If the query oint is not a desired solution, the oracle returns a half sace K k which contains B but not x k, and then we set S k+1 = S k K k and continue. If x k is chosen such that S k+1 / S k α, k for some α < 1, then the IPCPM converges geometrically. Such a choice of x k was first given by (Levin, 1965), who used (but did not rovide a way to comute) the center of gravity of S k. (Tarasov et al., 1988) roved that the center of the maximal-volume inscribed ellisoid also works; (Vaidya, 1996) showed the volumetric center works, but he didn t give an exlicit bound; and (Bertsimas & Vemala, 2004) suggest aroximating the center of gravity by random walks, e.g., the hitand-run algorithm (Lovász, 1999). Table 2 comares IPCPMs with other iterative methods on l regression roblems. Although they require extra work at each iteration, IPCPMs converge in the fewest number of iterations. 8 For comleteness, we will first describe a standard IPCPM aroach to l regression; and then we will describe the modifications we made to make it work in MaReduce. Assume that σ min (A) = 1 and κ (A) = oly(n). Let ˆf always denote the best objective value we have obtained. Then for any x R n, by convexity, g(x) T x = f(x) + g(x) T (x x) f ˆf. (6) This subgradient gives us the searation oracle. Let x 0 be the minimal l 2 -norm oint in Ω, in which case Ax 0 κ x 0 2 κ x 2 κ Ax, and hence x 0 is a κ -aroximate solution. Moreover, x x 0 x 2 Ax Ax 0, (7) 8 It is for this reason that ICPCMs seem to be good candidates for imroving subsamled solutions. Previous work assumes that data are in RAM, which means that the extra work er iteration is exensive. Since we consider large-scale distributed environments where data have to be accessed via asses, the number of iterations is the most recious resource, and thus the extra comutation at each iteration is relatively inexensive. Indeed, by using a randomized IPCPM, we will demonstrate that subsamled solutions can be imroved in very few asses.

7 Algorithm 3 A randomized IPCPM Inut: A R m n with σ min (A) 1, c R n, a set of initial oints, number of iterations M, and N 1. Outut: An aroximate solution ˆx. 1: Choose K = O(n). 2: Comute (f(x), g(x)) for each initial oint x. 3: Let ˆf = f(ˆx) always denote the best we have. 4: for i=0,...,m-1 do 5: Construct S i from known (f, g) airs and ˆf. 6: Generate random walks in S i : z (i) 1, z(i) 2,... 7: Let x (i) k = 1 K 8: Comute (f(x (i) k 9: end for 10: Return ˆx. kk j=(k 1)K+1 z(i) j, k = 1,..., N. ), g(i) (x (i) k )) for each k. which defines the initial olytoe S 0. Given ɛ > 0, for any x B = {x Ω x x 2 ɛ Ax 0 /κ 2 }, Ax Ax A(x x ) κ x x 2 ɛ Ax 0 /κ ɛ Ax. So all oints in B are (1 + ɛ)-aroximate solutions. The number of iterations to reach a (1 + ɛ)- aroximation is O(log( S 0 / B )) = O(log((κ 2 /ɛ) n )) = O(n log(n/ɛ)). This leads to an O((mn 2 + oly(n)) log(n/ɛ))-time algorithm, which is better than samling when ɛ is very small. Note that we will actually aly the IPCPM in a coordinate system defined on Ω, where the maings from and to the coordinate system of R n are given by Householder transforms; we omit the details. Our randomized IPCPM for use on MaReduce, which is given in Algorithm 3, differs from the standard aroach just described in two asects: samling initialization; and multile queries er iteration. In both cases, we take imortant advantage of the eculiar roerties of the MaReduce framework. For the initialization, note that constructing S 0 from x 0 may not be a good choice since we can only guarantee κ = oly(n). Recall, however, that we actually have N subsamled solutions from Algorithm 2, and all of these solutions can be used to construct a better S 0. Thus, we first comute ˆf k = f(ˆx k ) and ĝ k = g(ˆx k ) for k = 1,..., N in a single ass. For each ˆx k, we define a olytoe containing x using (6) and x ˆx k A(x ˆx k ) f + ˆf k ˆf + ˆf k. We then merge all these olytoes to construct S 0, which is described by 2n + N constraints. Note also that it would be hard to use all the available aroximate solutions if we chose to iterate with a subgradient or gradient method. For the iteration, the question is which query oint we send at each ste. Here, instead of one query, we send multile queries. Recall that, for a data intensive job, the dominant cost is the cost of inut/outut, and hence we want to extract as much information as ossible for each ass. Take an examle of one of our runs on a 10-node Hadoo cluster: with a matrix A of size , then a ass with a single query took 282 seconds, while a ass with 100 queries only took 328 seconds so the extra 99 queries come almost for free. To generate these multile queries, we follow the random walk aroach roosed by (Bertsimas & Vemala, 2004). The urose of the random walk is to generate uniformly distributed oints in S k such that we can estimate the center of gravity. Instead of comuting one estimate, we comute multile estimates. We conclude our discussion of our randomized IPCPM algorithm with a few comments. The online work of comuting (f, g) airs and the offline work of generating random walks can be done artially in arallel. Because S i+1 S i, we can continue generating random walks in S i while comuting (f, g) airs. When we have S i+1, simly discard oints outside S i+1. Even if we don t have enough oints left, it is very likely that we have a warm-start distribution that allows fast mixing. The way we choose query oints works well in ractice but doesn t guarantee faster convergence. How to choose query oints for guaranteed faster convergence is worth further investigation. However, we are not execting that by sending O(n) queries er ste we can reduce the number of iterations to O(log(1/ɛ)), which may require exonentially many queries. Sending multile queries makes the number of linear inequalities describing S k increase raidly, which is a roblem if we have too many iterations. But here we are just looking for, say, fewer than 30 iterations. Otherwise, we can urge redundant or unimortant linear constraints on the fly. 4 Emirical Evaluation The comutations are erformed on a Hadoo cluster with 40 CPU cores. We used the l 1 regression test roblem from (Clarkson et al., 2013). The roblem is of size 5.24e9 15, generated in the following way: The true signal x is a standard Gaussian vector. Each row of the design matrix A is a canonical

8 x x 1 x 1 x x 2 x 2 x x x ALG1 [0.0057, ] [0.0059, ] [0.0059, ] CT [0.008, ] [0.0090, ] [0.0113, ] UNIF [0.0572, ] [0.089, 0.166] [0.129, 0.254] NOCD [0.0823, 22.1] [0.126, 70.8] [0.193, 134] standard IPCPM roosed IPCPM Table 3. The 1st and the 3rd quartiles of the relative errors in 1-, 2-, and -norms from 100 indeendent subsamled solutions of samle size (f f * )/f * vector, which means that we only estimate a single entry of x in each measurement. The number of measurements on the i-th entry of x is twice as large as that on the (i + 1)-th entry, i = 1,..., 14. We have 2.62 billion measurements on the first entry while only 0.16 million measurements on the last. Imbalanced measurements aarently create difficulties for samling-based algorithms. The resonse vector b is given by { 1000ɛ i with rob b i = a T, i = 1,..., m, i x + ɛ i otherwise where {ɛ i } are i.i.d. samles drawn from the standard Lalace distribution. 0.1% measurements are corruted to simulate noisy real-world data. Since the roblem is searable, we know that an otimal solution is simly given by the median of resonses corresonding to each entry. If we use l 2 regression, the otimal solution is given by the mean values, which is inaccurate due to corruted measurements. We first check the accuracy of subsamled solutions. We imlement Algorithm 1 with block size n 3 (ALG1), which gives 2n 5/2 -conditioning; and the Cauchy transform (CT) by (Sohler & Woodruff, 2011), which gives asymtotic O(n 3/2 log 3/2 n)-conditioning; and then we use Algorithm 2 to comute 100 subsamled solutions in a single ass. We comute AE 1 exlicitly rior to samling for a better control on the samle size. We choose r = in Algorithm 2. We also imlement Algorithm 2 without conditioning (NOCD) and uniform samling (UNIF) for comarison. The 1st and the 3rd quartiles of the relative errors in 1-, 2-, and - norms are shown in Table 3. ALG1 clearly erforms the best, achieving 0.01 relative error in all the metrics we use. CT has better asymtotic conditioning quality than ALG1 in theory, but it doesn t generate better solutions in this test. This confirms our concerns on the hidden constant in κ 1 and the failure robability. UNIF works but it is about a magnitude worse than ALG1. NOCD generates large errors. So both UNIF and NOCD are not reliable aroaches. Next we try to iteratively imrove the subsamled solutions using Algorithm 3. We imlement and comare number of iterations Figure 1. A standard IPCPM aroach (single oint initialization and single query er iteration) vs. the roosed aroach (samling initialization and multile queries er iteration) on relative errors in function value. the roosed IPCPM with a standard IPCPM based on random walks with single oint initialization and single query er iteration. We set the number of iterations to 30. The running times for each of them are aroximately the same. Figure 1 shows the convergence behavior in terms of relative error in objective value. IPCPMs are not monotonically decreasing algorithms. Hence we see even we begin with a 10- aroximate solution with the standard IPCPM, the error goes to 10 3 after a few iterations and the initial guess is not imroved in 30 iterations. The samling initialization hels create a small initial search region; this makes the roosed IPCPM begin at a aroximate solution, stay below that level, and reach 10 6 in only 30 iterations. Moreover, it is easy to see that the multile-query strategy imroves the rate of convergence, though still at a linear rate. 5 Conclusion We have roosed an algorithm for solving strongly over-determined l regression roblems, for [1, 2), with an emhasis on its theoretical and emirical roerties for = 1. Although some of the building blocks of our algorithm are not better than state-of-the-art algorithms in terms of FLOP counts, we have shown that our algorithm has suerior communication roerties that ermit it to be imlemented in MaReduce and alied to terabyte-scale data to obtain a moderaterecision solution in only a few asses. The roosed method can also be extended to solving more general convex roblems on MaReduce. Acknowledgments Most of the work was done while the first author was at ICME, Stanford University suorted by NSF DMS The authors would like to thank Suresh Venkatasubramanian for helful discussion and for bringing to our attention several helful references.

9 References Balcan, M.-F., Blum, A., Fine, S., and Mansour, Y. Distributed learning, communication comlexity and rivacy. Arxiv rerint arxiv: , Bekkerman, R., Bilenko, M., and Langford, J. (eds.). Scaling u Machine Learning: Parallel and Distributed Aroaches. Cambridge University Press, Bertsekas, D. P. and Tsitsiklis, J. N. Some asects of arallel and distributed iterative algorithms a survey. Automatica, 27(1):3 21, Bertsimas, D. and Vemala, S. Solving convex rograms by random walks. Journal of the ACM, 51(4): , Clarkson, K. L. Subgradient and samling algorithms for l 1 regression. In Proceedings of the Sixteenth Annual ACM- SIAM Symosium on Discrete Algorithms (SODA), SIAM, Clarkson, K. L. and Woodruff, D. P. Low rank aroximation and regression in inut sarsity time. In Proceedings of the 45th Annual ACM symosium on Theory of Comuting (STOC), Clarkson, K. L., Drineas, P., Magdon-Ismail, M., Mahoney, M. W., Meng, X., and Woodruff, D. P. The Fast Cauchy Transform and faster robust linear regression. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symosium on Discrete Algorithms (SODA), Dasguta, A., Drineas, P., Harb, B., Kumar, R., and Mahoney, M. W. Samling algorithms and coresets for l regression. SIAM J. Comut., 38(5): , Daumé, III, H., Phillis, J. M., Saha, A., and Venkatasubramanian, S. Efficient rotocols for distributed classification and otimization. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, , Dean, J. and Ghemawat, S. MaReduce: Simlified data rocessing on large clusters. In Proceedings of the Sixth Symosium on Oerating System Design and Imlementation (OSDI), , Feldman, J., Muthukrishnan, S., Sidirooulos, A., Stein, C., and Svitkina, Z. On distributing symmetric streaming comutations. ACM Transactions on Algorithms, 6 (4):Article 66, Goodrich, M. T. Simulating arallel algorithms in the MaReduce framework with alications to arallel comutational geometry. Arxiv rerint arxiv: , Levin, A. Y. On an algorithm for the minimization of convex functions. In Soviet Mathematics Doklady, volume 160, , Lovász, L. Hit-and-run mixes fast. Math. Prog., 86(3): , Mackey, L., Talwalkar, A., and Jordan, M. I. Divide-andconquer matrix factorization. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS), Mahoney, M. W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning. NOW Publishers, Boston, Also available at: arxiv: Meng, X. and Mahoney, M. W. Low-distortion subsace embeddings in inut-sarsity time and alications to robust linear regression. In Proceedings of the 45th Annual ACM symosium on Theory of Comuting (STOC), Mitchell, J. E. Polynomial interior oint cutting lane methods. Otimization Methods and Software, 18(5): , Nesterov, Y. Unconstrained convex minimization in relative scale. Mathematics of Oerations Research, 34(1): , Nesterov, Y. and Nemirovsky, A. Interior Point Polynomial Methods in Convex Programming. SIAM, Portnoy, S. and Koenker, R. The Gaussian hare and the Lalacian tortoise: comutability of squared-error versus absolute-error estimators. Statistical Science, 12(4): , Rousseeuw, P. J. and Leroy, A. M. Robust Regression and Outlier Detection. Wiley, Sohler, C. and Woodruff, D. P. Subsace embeddings for the l 1-norm with alications. In Proceedings of the 43rd annual ACM symosium on Theory of comuting (STOC), ACM, Tarasov, S., Khachiyan, L. G., and Erlikh, I. The method of inscribed ellisoids. In Soviet Mathematics Doklady, volume 37, , Vaidya, P. M. A new algorithm for minimizing convex functions over convex sets. Math. Prog., 73: , Zhang, Y., Duchi, J., and Wainwright, M. J. Communication-efficient algorithms for statistical otimization. In Annual Advances in Neural Information Processing Systems 26: Proceedings of the 2012 Conference, Grötschel, M., Lovász, L., and Schrijver, A. The ellisoid method and its consequences in combinatorial otimization. Combinatorica, 1(2): , Karloff, H., Suri, S., and Vassilvitskii, S. A model of comutation for MaReduce. In Proceedings of the 21st Annual ACM-SIAM Symosium on Discrete Algorithms, , 2010.