Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization?



Similar documents
Federated Optimization: Distributed Optimization Beyond the Datacenter

Applying Multiple Neural Networks on Large Scale Data

Cooperative Caching for Adaptive Bit Rate Streaming in Content Delivery Networks

An Innovate Dynamic Load Balancing Algorithm Based on Task

Online Bagging and Boosting

arxiv: v1 [math.pr] 9 May 2008

Multi-Class Deep Boosting

Extended-Horizon Analysis of Pressure Sensitivities for Leak Detection in Water Distribution Networks: Application to the Barcelona Network

2. FINDING A SOLUTION

Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

The Research of Measuring Approach and Energy Efficiency for Hadoop Periodic Jobs

Machine Learning Applications in Grid Computing

An Optimal Task Allocation Model for System Cost Analysis in Heterogeneous Distributed Computing Systems: A Heuristic Approach

Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2

Resource Allocation in Wireless Networks with Multiple Relays

ASIC Design Project Management Supported by Multi Agent Simulation

Optimal Resource-Constraint Project Scheduling with Overlapping Modes

Image restoration for a rectangular poor-pixels detector

The Virtual Spring Mass System

RECURSIVE DYNAMIC PROGRAMMING: HEURISTIC RULES, BOUNDING AND STATE SPACE REDUCTION. Henrik Kure

Efficient Key Management for Secure Group Communications with Bursty Behavior

CRM FACTORS ASSESSMENT USING ANALYTIC HIERARCHY PROCESS

CLOSED-LOOP SUPPLY CHAIN NETWORK OPTIMIZATION FOR HONG KONG CARTRIDGE RECYCLING INDUSTRY

CPU Animation. Introduction. CPU skinning. CPUSkin Scalar:

Analyzing Spatiotemporal Characteristics of Education Network Traffic with Flexible Multiscale Entropy

Reliability Constrained Packet-sizing for Linear Multi-hop Wireless Networks

Reconnect 04 Solving Integer Programs with Branch and Bound (and Branch and Cut)

PREDICTION OF POSSIBLE CONGESTIONS IN SLA CREATION PROCESS

Protecting Small Keys in Authentication Protocols for Wireless Sensor Networks

Use of extrapolation to forecast the working capital in the mechanical engineering companies

The Application of Bandwidth Optimization Technique in SLA Negotiation Process

Markovian inventory policy with application to the paper industry

Big Data Analytics: Optimization and Randomization

Real Time Target Tracking with Binary Sensor Networks and Parallel Computing

Design of Model Reference Self Tuning Mechanism for PID like Fuzzy Controller

Data Streaming Algorithms for Estimating Entropy of Network Traffic

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Software Quality Characteristics Tested For Mobile Application Development

Searching strategy for multi-target discovery in wireless networks

Models and Algorithms for Stochastic Online Scheduling 1

Evaluating Inventory Management Performance: a Preliminary Desk-Simulation Study Based on IOC Model

ON SELF-ROUTING IN CLOS CONNECTION NETWORKS. BARRY G. DOUGLASS Electrical Engineering Department Texas A&M University College Station, TX

Method of supply chain optimization in E-commerce

INTEGRATED ENVIRONMENT FOR STORING AND HANDLING INFORMATION IN TASKS OF INDUCTIVE MODELLING FOR BUSINESS INTELLIGENCE SYSTEMS

Impact of Processing Costs on Service Chain Placement in Network Functions Virtualization

Support Vector Machine Soft Margin Classifiers: Error Analysis

Media Adaptation Framework in Biofeedback System for Stroke Patient Rehabilitation

PERFORMANCE METRICS FOR THE IT SERVICES PORTFOLIO

Performance Evaluation of Machine Learning Techniques using Software Cost Drivers

Fuzzy Sets in HR Management

6. Time (or Space) Series Analysis

Data Set Generation for Rectangular Placement Problems

An Integrated Approach for Monitoring Service Level Parameters of Software-Defined Networking

Load Control for Overloaded MPLS/DiffServ Networks during SLA Negotiation

( C) CLASS 10. TEMPERATURE AND ATOMS

Distributed Machine Learning and Big Data

Lecture 8 February 4

Comment on On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes

Partitioned Elias-Fano Indexes

Factored Models for Probabilistic Modal Logic

This paper studies a rental firm that offers reusable products to price- and quality-of-service sensitive

Package termstrc. February 20, 2015

Pricing Asian Options using Monte Carlo Methods

Research Article Performance Evaluation of Human Resource Outsourcing in Food Processing Enterprises

Capacity of Multiple-Antenna Systems With Both Receiver and Transmitter Channel State Information

Information Processing Letters

Leak detection in open water channels

On Computing Nearest Neighbors with Applications to Decoding of Binary Linear Codes

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Modeling operational risk data reported above a time-varying threshold

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Physics 211: Lab Oscillations. Simple Harmonic Motion.

Network delay-aware load balancing in selfish and cooperative distributed systems

Preference-based Search and Multi-criteria Optimization

Stable Learning in Coding Space for Multi-Class Decoding and Its Extension for Multi-Class Hypothesis Transfer Learning

Managing Complex Network Operation with Predictive Analytics

Evaluating the Effectiveness of Task Overlapping as a Risk Response Strategy in Engineering Projects

Generating Certification Authority Authenticated Public Keys in Ad Hoc Networks

High Performance Chinese/English Mixed OCR with Character Level Language Identification

Chapter 5. Principles of Unsteady - State Heat Transfer

Markov Models and Their Use for Calculations of Important Traffic Parameters of Contact Center

Airline Yield Management with Overbooking, Cancellations, and No-Shows JANAKIRAM SUBRAMANIAN

A framework for performance monitoring, load balancing, adaptive timeouts and quality of service in digital libraries

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

REQUIREMENTS FOR A COMPUTER SCIENCE CURRICULUM EMPHASIZING INFORMATION TECHNOLOGY SUBJECT AREA: CURRICULUM ISSUES

Implementation of Active Queue Management in a Combined Input and Output Queued Switch

Modified Latin Hypercube Sampling Monte Carlo (MLHSMC) Estimation for Average Quality Index

Evaluating Software Quality of Vendors using Fuzzy Analytic Hierarchy Process

Transcription:

Partitioning Data on Features or Saples in Counication-Efficient Distributed Optiization? Chenxin Ma Industrial and Systes Engineering Lehigh University, USA ch54@lehigh.edu Martin Taáč Industrial and Systes Engineering Lehigh University, USA taac.t@gail.co Abstract In this paper we study the effect of the way that the data is partitioned in distributed optiization. The original DiSCO algorith [Counication-Efficient Distributed Optiization of Self- Concordant Epirical Loss, Yuchen Zhang and Lin Xiao, 205] partitions the input data based on saples. We describe how the original algorith has to be odified to allow partitioning on features and show its efficiency both in theory and also in practice. Introduction As the size of the datasets becoes larger and larger, distributed optiization ethods for achine learning have becoe increasingly iportant [2, 5, 3]. Existing ehods often require a large aount of counication between coputing nodes [7, 7, 9, 8], which is typically several agnitudes slower than reading data fro their own eory [0]. Thus, distributed achine learning suffers fro the counication bottlenec on real world applications. In this paper we focus on the regularized epirical ris iniization proble. Suppose we have n data saples {x i, y i } n i=, where each x i R d (i.e. we have d features), y i R. We will denote by X := [x,..., x n ] R d n. The optiization proble is to iniize the regularized epirical loss (ERM) f(w) := n φ i (w, x i ) + λ n 2 w 2 2, () i= where the first part is the data fitting ter, φ : R d R d R is a loss function which typically depends on y i. Soe popular loss functions includes hinge loss φ i (w, x i ) = ax{0, y i w T x i }, square loss φ i (w, x i ) = (y i w T x i ) 2 or logistic loss φ i (w, x i ) = log( + exp( y i w T x i )). The second part of objective function () is l 2 regularizer (λ > 0) which helps to prevent over-fitting of the data. We assue that the loss function φ i is convex and self-concordant [9]: Assuption. For all i [n] := {, 2,..., n} the convex function φ is self-concordant with paraeter M i.e. the following inequality holds: u T (f (w)[u])u M(u T f (w)u) 3 2 (2) for any u R d and w do(f), where f (w)[u] := li t 0 t (f (w + tu) f (w)). There has been an enorous interest in large-scale achine learning probles and any parallel [4, ] or distributed algoriths have been proposed [, 6, 2, 4, 8]. The ain bottlenec in distributed coputing counication was handled by any researches differently. Soe wor considered ADMM type ethods [3, 6], another used bloc-coordinate type algoriths [8, 7, 7, 9], where they tried to solve the local sub-probles ore accurately (which should decrease the overall counications requireents when copared with ore basic approaches [5, 6]).

Algorith High-level DiSCO algorith : Input: paraeters ρ, µ 0, nuber of iterations K 2: Initializing w 0. 3: for = 0,,2,...,K do 4: Option : Given w, run DiSCO-S PCG Algorith 2, get v and δ 5: Option 2: Given w, run DiSCO-F PCG Algorith 3, get v and δ 6: Update w + = w +δ v 7: end for 8: Output: w K+ 2 Algorith We assue that we have achines (coputing nodes) available which can counicate between each other over the networ. We assue that the space needed to store the data atrix X exceeds the eory of every single node. Thus we have to split the data (atrix X) over the nodes. The natural question is: How to split the data into parts? There are any possible ways, but two obvious ones:. split the data atrix X by rows (i.e. create blocs by rows); Because rows of X corresponds to features, we will denote the algorith which is using this type of partitioning as DiSCO-F; 2. split the data atrix X by coluns; Let us note that coluns of X corresponds to saples we will denote the algorith which is using this type of partitioning as DiSCO-S; Notice that the DiSCO-S is exactly the sae as DiSCO proposed and analyzed in [9]. In each iteration of Algorith, wee need to copute an inexact Newton step v such that f (w )v f (w ) 2 ɛ, which is an approxiate solution to the Newton syste f (w )v = f(w ). The discussion about how to choose ɛ and K and a convergence guarantees for Algorith can be found in [9]. The ain goal of this wor is to analyze the algorithic odifications to DiSCO-S when the partitioning type is changed. It will turn out that partitioning on features (DiSCO-F) can lead to algorith which uses less counications (depending on the relations between d and n) (see Section 3). DiSCO-S Algorith. If the dataset is partitioned by saples, such that j th node will only store X j = [x j,,..., x j,nj ] R d nj, which is a part of X, then each achine can evaluate a local epirical loss function n j f j (w) := φ(w, x j,i ) + λ n j 2 w 2 2. (3) i= Algorith 2 Distributed DiSCO-S: PCG algorith data partitioned by saples : Input: w R d, and µ 0. counication (Broadcast w R d and reduceall f i (w ) R d ) 2: Initialization: Let P be coputed as (4). v 0 = 0, s 0 = P r 0, r 0 = f(w ), u 0 = s 0. 3: for t = 0,, 2,... do 4: Copute Hu t counication (Broadcast u t R d and reduceall f i(w )u t R d ) 5: Copute α t = rt,st u t,hu t 6: Update v (t+) = v t + α t u t, Hv (t+) = Hv t + α t Hu t, r t+ = r t α t Hu t. 7: Update s (t+) = P r (t+). 8: Copute β t = r (t+),s (t+) r t,s t 9: Update u (t+) = s (t+) + β t u t. 0: until: r (r+) 2 ɛ : end for 2: Return: v = v t+, δ = v T (t+) Hv t + α t v T (t+) Hu t 2

Because {X j } is a partition of X we have j= n j = n, our goal now becoes to iniize the function f(w) = h= f j(w). Let H denote the Hessian f (w ). For siplicity in this paper we consider only square loss and hence in this case f (w ) is constant (independent on w ). In Algorith 2, each achine will use its local data to copute the local gradient and local Hessian and then aggregate the together. We also have to choose one achine as the aster, which coputes all the vector operations of PCG loops (Preconditioned Conjugate Gradient), i.e., step 5-9 in Algorith 2. The preconditioning atrix for PCG is defined only on aster node and consists of the local Hessian approxiated by a subset of data available on aster node with size τ, i.e. P = τ τ φ (w, x,j ) + µi, (4) j= where µ is a sall regularization paraeter. Algorith 2 presents the distributed PCG athod for solving the preconditioning linear syste P Hv = P f(w ). (5) DiSCO-F Algorith. If the dataset is partitioned by features, then jth achine will store X j = [a [j],..., a[j] n ] R dj n, which contains all the saples, but only with a subset of features. Also, each achine will only store w [j] R dj and thus only be responsible for the coputation and updates of R dj vectors. By doing so, we only need one ReduceAll on a vector of length n, in addition to two ReduceAll on scalars nuber. Algorith 3 Distributed DiSCO-F: PCG algorith data partitioned by features : Input: w [i] Rdi for i =, 2,...,, and µ 0. 2: Initialization: Let P be coputed as (4). v [i] 0 = 0, s[i] 0 = (P ) [i] r [i] 0, r[i] 3: while r r+ 2 ɛ do 0 = f (w [i] ), u[i] 0 = s[i] 0. 4: Copute (Hu t ) [i]. counication (ReduceAll an R di vector) 5: Copute α t = 6: Update v [i] t+ = v[i] t 7: Update s [i] 8: Copute β t = 9: Update u [i] 0: end while i= r[i] t,s[i] t i= u[i] t,(hut)[i] t+ = (P ) [i] r [i] t+ = s[i] : Copute δ [i] 2: Integration: v = [v [] 3: Return: v, δ. counication (ReduceAll a nuber) + α t u [i] t, (Hv t+ ) [i] = (Hv t ) [i] + α t (Hu t ) [i], r [i] t+ = r[i] t α t (Hu t ) [i]. t+. i= r[i] t+,s[i] t+ i= r[i] t,s[i] t. counication (ReduceAll a nuber) t+ + β tu [i] t. = v [i] T t+ (Hvt ) [i] + α t v t+t [i] (Hut ) [i]. t+,..., v[] t+ ], δ = [δ [] t+,..., δ[] t+ ] counication (Reduce an vector) Rdi Coparison of Counication and Coputational Cost. In Table we copare the counication cost for the two approaches DiSCO-S/DiSCO-F. As it is obvious fro the table, DiSCO-F requires only one reduceall of a vector of length n, whereas the DiSCO-S needs one reduceall of a vector of length d and one broadcast of vector of size d. So roughly speaing, when n < d then DiSCO-F will need less counication. However, very interestingly, the advantage of DiSCO-F is the fact that it uses CPU on every node ore effectively. It also requires less total aount of wor to be perfored on each node, leading to ore balanced and efficient utilization of nodes. 3 Nuerical Experients We present experients on several standard large real-world datasets: news20.binary (d =, 355, 9; n = 9, 996; 0.3GB); dd200(test) (d = 29, 890, 095; n = 748, 40; 0.9GB); and epsilon (d = 2, 000; n = 3

Table : Coparison of coputation and counication between different ways of partition on data. partition by saples partition by features atrix-vector ultiplication (R d d R d ) (R d d R d ) aster bac solving linear syste (R d ) (R d ) su of vectors 4 (R d ) 4 (R d ) coputation inner product of vectors 4 (R d ) 4 (R d ) atrix-vector ultiplication (R d d R d ) (R d di R di ) nodes bac solving linear syste 0 (R di ) su of vectors 0 4 (R di ) inner product of vectors 0 4 (R di ) counication Broadcast one R d vector 0 ReduceAll one R d vector one R n vector, 2 R 0 20 40 60 80 00 Elapsed Tie 0 20 40 60 80 00 0 2 3 4 5 Elapsed Tie 0 5 0 5 20 25 30 35 0 2 4 6 8 0 Elapsed Tie 0 0 20 30 40 Figure : Coparison of DiSCO-S, DiSCO-F and on various datasets. 00, 000; 3.04GB). Each data was split into achines. We ipleent DiSCO-S, DiSCO-F and [9] algoriths for coparison in C++, and run the on the Aazon cloud, using 4 3.xlarge EC2 instances. Figure copares the evolution of f(w) as function of elapsed tie, nuber of counications and iterations. As it can be observed, the DiSCO-F needs alost the sae nuber of iterations as DiSCO-S, however, it needs roughly just half the counication, therefore it is uch faster (if we care about elapsed tie). 4

References [] Aleh Agarwal and John C Duchi. Distributed delayed stochastic optiization. In Advances in Neural Inforation Processing Systes, pages 873 88, 20. [2] Diitri P Bertseas and John N Tsitsilis. Parallel and distributed coputation: nuerical ethods. Prentice- Hall, Inc., 989. [3] Stephen Boyd, Neal Parih, Eric Chu, Borja Peleato, and Jonathan Ecstein. Distributed optiization and statistical learning via the alternating direction ethod of ultipliers. Foundations and Trends R in Machine Learning, 3(): 22, 20. [4] Joseph K Bradley, Aapo Kyrola, Danny Bicson, and Carlos Guestrin. Parallel coordinate descent for l- regularized loss iniization. arxiv preprint arxiv:05.5379, 20. [5] Ofer Deel, Ran Gilad-Bachrach, Ohad Shair, and Lin Xiao. Optial distributed online prediction using inibatches. The Journal of Machine Learning Research, 3():65 202, 202. [6] Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating direction ethod of ultipliers. Journal of Scientific Coputing, pages 28, 202. [7] Martin Jaggi, Virginia Sith, Martin Taác, Jonathan Terhorst, Sanjay Krishnan, Thoas Hofann, and Michael I Jordan. Counication-efficient distributed dual coordinate ascent. In Advances in Neural Inforation Processing Systes, pages 3068 3076, 204. [8] Ching-Pei Lee and Dan Roth. Distributed box-constrained quadratic optiization for dual linear SVM. ICML, 205. [9] Chenxin Ma, Virginia Sith, Martin Jaggi, Michael I Jordan, Peter Richtári, and Martin Taáč. Adding vs. averaging in distributed prial-dual optiization. In ICML 205 - Proceedings of the 32th International Conference on Machine Learning, volue 37, pages 973 982. JMLR, 205. [0] Jaub Marece, Peter Richtári, and Martin Taác. Distributed bloc coordinate descent for iniizing partially separable functions. Nuerical Analysis and Optiization 204, Springer Proceedings in Matheatics and Statistics, 204. [] Benjain Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A loc-free approach to parallelizing stochastic gradient descent. In Advances in Neural Inforation Processing Systes, pages 693 70, 20. [2] Peter Richtári and Martin Taáč. Distributed coordinate descent ethod for learning with big data. arxiv preprint arxiv:30.2059, 203. [3] Ohad Shair and Nathan Srebro. Distributed stochastic optiization and learning. In Counication, Control, and Coputing (Allerton), 204 52nd Annual Allerton Conference on, pages 850 857. IEEE, 204. [4] Ohad Shair, Nathan Srebro, and Tong Zhang. Counication efficient distributed optiization using an approxiate newton-type ethod. arxiv preprint arxiv:32.7853, 203. [5] Martin Taáč, Avleen Bijral, Peter Richtári, and Nathan Srebro. Mini-batch prial and dual ethods for SVMs. ICML, 203. [6] Martin Taáč, Peter Richtári, and Nathan Srebro. Distributed ini-batch SDCA. arxiv preprint arxiv:507.08322, 205. [7] Tianbao Yang. Trading coputation for counication: Distributed stochastic dual coordinate ascent. In Advances in Neural Inforation Processing Systes, pages 629 637, 203. [8] Tianbao Yang, Shenghuo Zhu, Rong Jin, and Yuanqing Lin. Analysis of distributed stochastic dual coordinate ascent. arxiv preprint arxiv:32.03, 203. [9] Yuchen Zhang and Lin Xiao. Counication-efficient distributed optiization of self-concordant epirical loss. arxiv preprint arxiv:50.00263, 205. 5