Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization?

Partitioning Data on Features or Saples in Counication-Efficient Distributed Optiization? Chenxin Ma Industrial and Systes Engineering Lehigh University, USA ch54@lehigh.edu Martin Taáč Industrial and Systes Engineering Lehigh University, USA taac.t@gail.co Abstract In this paper we study the effect of the way that the data is partitioned in distributed optiization. The original DiSCO algorith [Counication-Efficient Distributed Optiization of Self- Concordant Epirical Loss, Yuchen Zhang and Lin Xiao, 205] partitions the input data based on saples. We describe how the original algorith has to be odified to allow partitioning on features and show its efficiency both in theory and also in practice. Introduction As the size of the datasets becoes larger and larger, distributed optiization ethods for achine learning have becoe increasingly iportant [2, 5, 3]. Existing ehods often require a large aount of counication between coputing nodes [7, 7, 9, 8], which is typically several agnitudes slower than reading data fro their own eory [0]. Thus, distributed achine learning suffers fro the counication bottlenec on real world applications. In this paper we focus on the regularized epirical ris iniization proble. Suppose we have n data saples {x i, y i } n i=, where each x i R d (i.e. we have d features), y i R. We will denote by X := [x,..., x n ] R d n. The optiization proble is to iniize the regularized epirical loss (ERM) f(w) := n φ i (w, x i ) + λ n 2 w 2 2, () i= where the first part is the data fitting ter, φ : R d R d R is a loss function which typically depends on y i. Soe popular loss functions includes hinge loss φ i (w, x i ) = ax{0, y i w T x i }, square loss φ i (w, x i ) = (y i w T x i ) 2 or logistic loss φ i (w, x i ) = log( + exp( y i w T x i )). The second part of objective function () is l 2 regularizer (λ > 0) which helps to prevent over-fitting of the data. We assue that the loss function φ i is convex and self-concordant [9]: Assuption. For all i [n] := {, 2,..., n} the convex function φ is self-concordant with paraeter M i.e. the following inequality holds: u T (f (w)[u])u M(u T f (w)u) 3 2 (2) for any u R d and w do(f), where f (w)[u] := li t 0 t (f (w + tu) f (w)). There has been an enorous interest in large-scale achine learning probles and any parallel [4, ] or distributed algoriths have been proposed [, 6, 2, 4, 8]. The ain bottlenec in distributed coputing counication was handled by any researches differently. Soe wor considered ADMM type ethods [3, 6], another used bloc-coordinate type algoriths [8, 7, 7, 9], where they tried to solve the local sub-probles ore accurately (which should decrease the overall counications requireents when copared with ore basic approaches [5, 6]).

Algorith High-level DiSCO algorith : Input: paraeters ρ, µ 0, nuber of iterations K 2: Initializing w 0. 3: for = 0,,2,...,K do 4: Option : Given w, run DiSCO-S PCG Algorith 2, get v and δ 5: Option 2: Given w, run DiSCO-F PCG Algorith 3, get v and δ 6: Update w + = w +δ v 7: end for 8: Output: w K+ 2 Algorith We assue that we have achines (coputing nodes) available which can counicate between each other over the networ. We assue that the space needed to store the data atrix X exceeds the eory of every single node. Thus we have to split the data (atrix X) over the nodes. The natural question is: How to split the data into parts? There are any possible ways, but two obvious ones:. split the data atrix X by rows (i.e. create blocs by rows); Because rows of X corresponds to features, we will denote the algorith which is using this type of partitioning as DiSCO-F; 2. split the data atrix X by coluns; Let us note that coluns of X corresponds to saples we will denote the algorith which is using this type of partitioning as DiSCO-S; Notice that the DiSCO-S is exactly the sae as DiSCO proposed and analyzed in [9]. In each iteration of Algorith, wee need to copute an inexact Newton step v such that f (w )v f (w ) 2 ɛ, which is an approxiate solution to the Newton syste f (w )v = f(w ). The discussion about how to choose ɛ and K and a convergence guarantees for Algorith can be found in [9]. The ain goal of this wor is to analyze the algorithic odifications to DiSCO-S when the partitioning type is changed. It will turn out that partitioning on features (DiSCO-F) can lead to algorith which uses less counications (depending on the relations between d and n) (see Section 3). DiSCO-S Algorith. If the dataset is partitioned by saples, such that j th node will only store X j = [x j,,..., x j,nj ] R d nj, which is a part of X, then each achine can evaluate a local epirical loss function n j f j (w) := φ(w, x j,i ) + λ n j 2 w 2 2. (3) i= Algorith 2 Distributed DiSCO-S: PCG algorith data partitioned by saples : Input: w R d, and µ 0. counication (Broadcast w R d and reduceall f i (w ) R d ) 2: Initialization: Let P be coputed as (4). v 0 = 0, s 0 = P r 0, r 0 = f(w ), u 0 = s 0. 3: for t = 0,, 2,... do 4: Copute Hu t counication (Broadcast u t R d and reduceall f i(w )u t R d ) 5: Copute α t = rt,st u t,hu t 6: Update v (t+) = v t + α t u t, Hv (t+) = Hv t + α t Hu t, r t+ = r t α t Hu t. 7: Update s (t+) = P r (t+). 8: Copute β t = r (t+),s (t+) r t,s t 9: Update u (t+) = s (t+) + β t u t. 0: until: r (r+) 2 ɛ : end for 2: Return: v = v t+, δ = v T (t+) Hv t + α t v T (t+) Hu t 2

Because {X j } is a partition of X we have j= n j = n, our goal now becoes to iniize the function f(w) = h= f j(w). Let H denote the Hessian f (w ). For siplicity in this paper we consider only square loss and hence in this case f (w ) is constant (independent on w ). In Algorith 2, each achine will use its local data to copute the local gradient and local Hessian and then aggregate the together. We also have to choose one achine as the aster, which coputes all the vector operations of PCG loops (Preconditioned Conjugate Gradient), i.e., step 5-9 in Algorith 2. The preconditioning atrix for PCG is defined only on aster node and consists of the local Hessian approxiated by a subset of data available on aster node with size τ, i.e. P = τ τ φ (w, x,j ) + µi, (4) j= where µ is a sall regularization paraeter. Algorith 2 presents the distributed PCG athod for solving the preconditioning linear syste P Hv = P f(w ). (5) DiSCO-F Algorith. If the dataset is partitioned by features, then jth achine will store X j = [a [j],..., a[j] n ] R dj n, which contains all the saples, but only with a subset of features. Also, each achine will only store w [j] R dj and thus only be responsible for the coputation and updates of R dj vectors. By doing so, we only need one ReduceAll on a vector of length n, in addition to two ReduceAll on scalars nuber. Algorith 3 Distributed DiSCO-F: PCG algorith data partitioned by features : Input: w [i] Rdi for i =, 2,...,, and µ 0. 2: Initialization: Let P be coputed as (4). v [i] 0 = 0, s[i] 0 = (P ) [i] r [i] 0, r[i] 3: while r r+ 2 ɛ do 0 = f (w [i] ), u[i] 0 = s[i] 0. 4: Copute (Hu t ) [i]. counication (ReduceAll an R di vector) 5: Copute α t = 6: Update v [i] t+ = v[i] t 7: Update s [i] 8: Copute β t = 9: Update u [i] 0: end while i= r[i] t,s[i] t i= u[i] t,(hut)[i] t+ = (P ) [i] r [i] t+ = s[i] : Copute δ [i] 2: Integration: v = [v [] 3: Return: v, δ. counication (ReduceAll a nuber) + α t u [i] t, (Hv t+ ) [i] = (Hv t ) [i] + α t (Hu t ) [i], r [i] t+ = r[i] t α t (Hu t ) [i]. t+. i= r[i] t+,s[i] t+ i= r[i] t,s[i] t. counication (ReduceAll a nuber) t+ + β tu [i] t. = v [i] T t+ (Hvt ) [i] + α t v t+t [i] (Hut ) [i]. t+,..., v[] t+ ], δ = [δ [] t+,..., δ[] t+ ] counication (Reduce an vector) Rdi Coparison of Counication and Coputational Cost. In Table we copare the counication cost for the two approaches DiSCO-S/DiSCO-F. As it is obvious fro the table, DiSCO-F requires only one reduceall of a vector of length n, whereas the DiSCO-S needs one reduceall of a vector of length d and one broadcast of vector of size d. So roughly speaing, when n < d then DiSCO-F will need less counication. However, very interestingly, the advantage of DiSCO-F is the fact that it uses CPU on every node ore effectively. It also requires less total aount of wor to be perfored on each node, leading to ore balanced and efficient utilization of nodes. 3 Nuerical Experients We present experients on several standard large real-world datasets: news20.binary (d =, 355, 9; n = 9, 996; 0.3GB); dd200(test) (d = 29, 890, 095; n = 748, 40; 0.9GB); and epsilon (d = 2, 000; n = 3

Table : Coparison of coputation and counication between different ways of partition on data. partition by saples partition by features atrix-vector ultiplication (R d d R d ) (R d d R d ) aster bac solving linear syste (R d ) (R d ) su of vectors 4 (R d ) 4 (R d ) coputation inner product of vectors 4 (R d ) 4 (R d ) atrix-vector ultiplication (R d d R d ) (R d di R di ) nodes bac solving linear syste 0 (R di ) su of vectors 0 4 (R di ) inner product of vectors 0 4 (R di ) counication Broadcast one R d vector 0 ReduceAll one R d vector one R n vector, 2 R 0 20 40 60 80 00 Elapsed Tie 0 20 40 60 80 00 0 2 3 4 5 Elapsed Tie 0 5 0 5 20 25 30 35 0 2 4 6 8 0 Elapsed Tie 0 0 20 30 40 Figure : Coparison of DiSCO-S, DiSCO-F and on various datasets. 00, 000; 3.04GB). Each data was split into achines. We ipleent DiSCO-S, DiSCO-F and [9] algoriths for coparison in C++, and run the on the Aazon cloud, using 4 3.xlarge EC2 instances. Figure copares the evolution of f(w) as function of elapsed tie, nuber of counications and iterations. As it can be observed, the DiSCO-F needs alost the sae nuber of iterations as DiSCO-S, however, it needs roughly just half the counication, therefore it is uch faster (if we care about elapsed tie). 4

References [] Aleh Agarwal and John C Duchi. Distributed delayed stochastic optiization. In Advances in Neural Inforation Processing Systes, pages 873 88, 20. [2] Diitri P Bertseas and John N Tsitsilis. Parallel and distributed coputation: nuerical ethods. Prentice- Hall, Inc., 989. [3] Stephen Boyd, Neal Parih, Eric Chu, Borja Peleato, and Jonathan Ecstein. Distributed optiization and statistical learning via the alternating direction ethod of ultipliers. Foundations and Trends R in Machine Learning, 3(): 22, 20. [4] Joseph K Bradley, Aapo Kyrola, Danny Bicson, and Carlos Guestrin. Parallel coordinate descent for l- regularized loss iniization. arxiv preprint arxiv:05.5379, 20. [5] Ofer Deel, Ran Gilad-Bachrach, Ohad Shair, and Lin Xiao. Optial distributed online prediction using inibatches. The Journal of Machine Learning Research, 3():65 202, 202. [6] Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating direction ethod of ultipliers. Journal of Scientific Coputing, pages 28, 202. [7] Martin Jaggi, Virginia Sith, Martin Taác, Jonathan Terhorst, Sanjay Krishnan, Thoas Hofann, and Michael I Jordan. Counication-efficient distributed dual coordinate ascent. In Advances in Neural Inforation Processing Systes, pages 3068 3076, 204. [8] Ching-Pei Lee and Dan Roth. Distributed box-constrained quadratic optiization for dual linear SVM. ICML, 205. [9] Chenxin Ma, Virginia Sith, Martin Jaggi, Michael I Jordan, Peter Richtári, and Martin Taáč. Adding vs. averaging in distributed prial-dual optiization. In ICML 205 - Proceedings of the 32th International Conference on Machine Learning, volue 37, pages 973 982. JMLR, 205. [0] Jaub Marece, Peter Richtári, and Martin Taác. Distributed bloc coordinate descent for iniizing partially separable functions. Nuerical Analysis and Optiization 204, Springer Proceedings in Matheatics and Statistics, 204. [] Benjain Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A loc-free approach to parallelizing stochastic gradient descent. In Advances in Neural Inforation Processing Systes, pages 693 70, 20. [2] Peter Richtári and Martin Taáč. Distributed coordinate descent ethod for learning with big data. arxiv preprint arxiv:30.2059, 203. [3] Ohad Shair and Nathan Srebro. Distributed stochastic optiization and learning. In Counication, Control, and Coputing (Allerton), 204 52nd Annual Allerton Conference on, pages 850 857. IEEE, 204. [4] Ohad Shair, Nathan Srebro, and Tong Zhang. Counication efficient distributed optiization using an approxiate newton-type ethod. arxiv preprint arxiv:32.7853, 203. [5] Martin Taáč, Avleen Bijral, Peter Richtári, and Nathan Srebro. Mini-batch prial and dual ethods for SVMs. ICML, 203. [6] Martin Taáč, Peter Richtári, and Nathan Srebro. Distributed ini-batch SDCA. arxiv preprint arxiv:507.08322, 205. [7] Tianbao Yang. Trading coputation for counication: Distributed stochastic dual coordinate ascent. In Advances in Neural Inforation Processing Systes, pages 629 637, 203. [8] Tianbao Yang, Shenghuo Zhu, Rong Jin, and Yuanqing Lin. Analysis of distributed stochastic dual coordinate ascent. arxiv preprint arxiv:32.03, 203. [9] Yuchen Zhang and Lin Xiao. Counication-efficient distributed optiization of self-concordant epirical loss. arxiv preprint arxiv:50.00263, 205. 5