Distributed Structured Prediction for Big Data



Similar documents
Direct Loss Minimization for Structured Prediction

Machine Learning over Big Data

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Semi-Supervised Support Vector Machines and Application to Spam Filtering

MapReduce Approach to Collective Classification for Networks

Conditional Random Fields: An Introduction

Programming Tools based on Big Data and Conditional Random Fields

MapReduce/Bigtable for Distributed Optimization

Big Graph Processing: Some Background

Journal of Machine Learning Research 1 (2013) 1-1 Submitted 8/13; Published 10/13

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Semantic parsing with Structured SVM Ensemble Classification Models

A Learning Based Method for Super-Resolution of Low Resolution Images

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Statistical machine learning, high dimension and big data

Course: Model, Learning, and Inference: Lecture 5

Structured Learning and Prediction in Computer Vision. Contents

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Scheduling Shop Scheduling. Tim Nieberg

Big Data Analytics. Lucas Rego Drumond

Tracking Groups of Pedestrians in Video Sequences

Small Maximal Independent Sets and Faster Exact Graph Coloring

3. The Junction Tree Algorithms

Distributed Machine Learning and Big Data

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Mining Large Datasets: Case of Mining Graph Data in the Cloud

An Empirical Study of Two MIS Algorithms

Simple and efficient online algorithms for real world applications

Y. Xiang, Constraint Satisfaction Problems

Bayesian networks - Time-series models - Apache Spark & Scala

Approximation Algorithms

Big Data - Lecture 1 Optimization reminders

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Compact Representations and Approximations for Compuation in Games

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

How Conditional Random Fields Learn Dynamics: An Example-Based Study

Proximal mapping via network optimization

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

WORKFLOW ENGINE FOR CLOUDS

Big Data Science. Prof. Lise Getoor University of Maryland, College Park. October 17, 2013

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Finding the M Most Probable Configurations Using Loopy Belief Propagation

Large-Scale Similarity and Distance Metric Learning

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

Fast Iterative Graph Computation with Resource Aware Graph Parallel Abstraction

Machine learning challenges for big data

Machine Learning Big Data using Map Reduce

Solving NP Hard problems in practice lessons from Computer Vision and Computational Biology

Sanjeev Kumar. contribute

Approximating the Partition Function by Deleting and then Correcting for Model Edges

Spark and the Big Data Library

Graph Processing and Social Networks

Jubatus: An Open Source Platform for Distributed Online Machine Learning

Big Data: Big N. V.C Note. December 2, 2014

Multi-Class and Structured Classification

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Part 2: Community Detection

Revisiting Output Coding for Sequential Supervised Learning

Graph Mining and Social Network Analysis

Learning and Inference over Constrained Output

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Big Data Analytics. Lucas Rego Drumond

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Segmentation & Clustering

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce for Bayesian Network Parameter Learning using the EM Algorithm

ANALYTICS IN BIG DATA ERA

Distributed forests for MapReduce-based machine learning

Data Mining: An Overview. David Madigan

How To Make A Credit Risk Model For A Bank Account

Introduction to Deep Learning Variational Inference, Mean Field Theory

Leveraging Ensemble Models in SAS Enterprise Miner

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Predicting Program Properties from Big Code

Doctor of Philosophy in Computer Science

Research Statement Joseph E. Gonzalez

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints

Evaluating partitioning of big graphs

Towards Resource-Elastic Machine Learning

A Network Flow Approach in Cloud Computing

Parallel Computing for Data Science

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

A Systematic Cross-Comparison of Sequence Classifiers

MLlib: Scalable Machine Learning on Spark

Exponential time algorithms for graph coloring

Transcription:

Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning structured predictors from big data are the computation time and the memory demands. In this paper, we propose to handle those big data problems efficiently by distributing and parallelizing the resource reuirements. We present a distributed structured prediction learning algorithm for large scale models that cannot be effectively handled by a single cluster node. Importantly, convergence and optimality guarantees of recently developed algorithms are preserved while keeping between node communication low. Introduction In the past few years, structured models have become an important tool in domains such as natural language processing, computer vision and computational biology. The growing variability within data sets, reuires an increasing ressiveness that is achieved by modeling the influence of more and more variables. Hence memory and computational limits of desktop computers are reached uickly. In computer vision, for example, uncompressed full HD video streams produce 0 megabytes of data per second. Several structured prediction frameworks have been developed in the past. Notable examples are Conditional Random Fields (CRFs) [2], structured support vector machines (SSVMs) [7, 8] and their generalizations []. All three frameworks aim at minimizing a regularized surrogate loss. While CRFs and SSVMs are the method of choice for tree-structured or sub-modular models, approximations, e.g., [] are in general reuired. Note that all three approaches are inherently parallel in the training data. But none of the aforementioned frameworks address the underlying memory limitations of large scale models arising from real-world problems. This is important since nowadays big data tasks of increasing volume, variety and velocity call for large models. Hence we are interested in making structured prediction algorithms practical for large scale scenarios. We present an algorithm which distributes and parallelizes the computation and memory reuirements while reducing communication between cluster nodes and conserving convergence and optimality guarantees. Our approach is based on the principle of dual decomposition, i.e., computation is done in parallel by partitioning the model and imposing agreement on independent variables that are reuired to be consistent. Thus, we split the graph-based optimization program into several local optimization problems solved in parallel, and cluster nodes exchange information occasionally to enforce consistency. 2 A Review on Structured Prediction Let us first consider a setting where X denotes the input space (e.g., a video or a document) and S is a structured label space (e.g., a video segmentation or a set of parse trees). Further, let φ : X S R F denote a mapping from the input and label space to an F -dimensional feature space. When using structured prediction approaches, we are commonly interested in finding the parameters w R F of a log-linear model p w (s x) ( w φ(x, s)/ɛ ) with covariance ɛ, which best describes the possible labeling s S of x X. For training, we are given a data set D = {(x i, s i ) N i= } containing N pairs, each composed by an input space object x X and a label space object s S. In order to find the model parameters w

that best describe the annotations, we are often able to construct a task loss l (x,s) (ŝ) which measures the fitness of any labeling ŝ S. The vector v = (x,s) D φ(x, s) denotes the empirical mean and we commonly assume independent and identically distributed data in addition to a prior p(w) ( w p p). During learning we minimize the negative loss-augmented data-log-posterior, i.e., min ɛ ln ( ) l(x,s) (ŝ) + w φ(x, ŝ) v w + C w ɛ p w p p. () (x,s) D ŝ S Note that the covariance ɛ = recovers the CRF objective [2] while ɛ 0 smoothly approximates the max-function, hence recovering the SSVM formulation [7, 8]. Due to the sum over all label space configurations ŝ S being generally onential in size, the unconstrained minimization problem given in E. () is NP-hard in general. Elements φ r of the feature vector φ often describe interactions between subsets of random variables, i.e., φ r (x, s) = i V r,x φ r,i (x, s i ) + α E r,x φ r,α (x, s α ). Note that a labeling s = (s i ) i V S is a tuple subsuming V variables, each having S i discrete states. The sparse interactions induced by the feature functions φ r (x, s) are visually depicted by a factor graph G r,x with the individual variables i V r,x of sample (x, s) being vertices that are connected to factors α E r,x iff vertex i is a neighbor of factor α E r,x. The union graph G x = r G r,x describes the relationship over all features r and we say that vertex i V x = r V r,x is a neighbor to factor α E x = r E r,x if variable s i is part of the variable set s α in any of the features of sample (x, s), i.e., i N(α). Conversely, all factors that variable i participates in are referred to by α N(i). Approximations [] are one way to deal with the previously outlined intractability. The dual to the program given in E. () is described by means of joint distributions ranging, for each data sample (x, s), over the label space S. We describe this probability by its variable and factor marginals b (x,s),i (s i ) and b (x,s),α (s α ) and approximate the entropies of those joint distributions by its marginal entropies H(b (x,s),i ) and H(b (x,s),α ) using chosen counting numbers c i and c α for better approximation accuracy. To ensure consistency, we reuire the beliefs to fulfill marginalization constraints corresponding to the structure of the graph G x while maximizing the approximated dual cost function ɛc i H(b (x,s),i )+ ɛc α H(b (x,s),α )+ b (x,s),i (ŝ i )l (x,s),i (ŝ i )+ b (x,s),α (ŝ α )l (x,s),α (ŝ α ) (x,s) i α i,ŝ i α,ŝ α C b (x,s),i (ŝ i )φ r,i (x, ŝ i ) + b (x,s),α (ŝ α )φ r,α (x, ŝ α ) v r, (2) r (x,s),i V r,x,ŝ i (x,s),α E r,x,ŝ α with /p + / =. The sum ranging over the training samples being the first term in both the original primal (E. ()) and the approximated dual (E. (2)) suggests that computation of the gradient is inherently parallel in the data set elements. With real-world models G x often being too large for the resources provided by a single cluster node we next discuss a possibility to partition the optimization task while preserving the original convergence properties. 3 Distributed Structured Prediction To cope with current model size needs we are interested in an algorithm to maximize E. (2) while leveraging the sparsity given by the graph structure G x. In addition, we partition the vertices of the model such that each of the distributed cluster nodes solves an independent program defined on a subgraph induced by the variables of each partition (Fig. (a)). To ensure consistency for the global model, the distributed solutions are combined by exchanging information between connected subgraphs. The distributed structured prediction algorithm extends existing frameworks by introducing a high-level factor graph (Fig. (b)) describing the cluster node interactions. Occasional exchange of information corresponds to messages being sent on this factor graph. It is important to note that we do not reuire an exchange of information at every iteration. More concretely, let P x be a partition of all the vertices i V x for sample (x, s) into disjunct subsets n x P x each containing the variables i n x that are assigned to the cluster node n x. The vertices assigned to node n x P x induce a subgraph G x,nx. As before, this subgraph describes the 2

(x, s ) (x 2, s 2 ) 200 400 600 800 000 Iterations (a) (b) (c) (d) Figure : (a): 2 samples each distributed on 2 cluster nodes (color). (b): The cluster node factor graph for consistency messages. (c),(d): Convergence of the inference task w.r.t. iterations and time. marginalization constraints reuired to be enforced on cluster node n x for its assigned variable beliefs (x,s),i (ŝ i) (x, s), i n x, ŝ i and the factor beliefs (x,s),α (ŝ α) (x, s), i n x, α N(i), ŝ α, i.e., ŝ α\ŝ i (x,s),α (ŝ α) = (x,s),i (ŝ i). A factor α that is assigned to multiple subgraphs G x,nx, corresponds to a set of beliefs (x,s),α each of them optimized independently on the cluster nodes n x N Px (α). Since these distributed beliefs originate from a single b (x,s),α in E. (2) we are reuired to ensure consistency. Formally, we construct a factor graph G Px with cluster nodes n x being the vertices that are connected to shared factors α iff n x N Px (α). Conversely, we denote by N Px (n x ) all factors α that are shared between multiple nodes, one of them being n x. To keep the shared beliefs consistent, we add the constraints (x,s),α (ŝ α) = b (x,s),α (ŝ α ) (x, s), α, n x N Px (α), ŝ α. To ensure optimization of the cost function given in E. (2), we further need to balance the entropy H(b (x,s),α ), the loss l (x,s),α and the features φ r,α for those factors α, distributed onto different cluster nodes. To this end, we let ĉ α = c α / N Px (α), ˆl (x,s),α = l (x,s),α / N Px (α) and ˆφ (x,s),α = φ (x,s),α / N Px (α) for all shared factors. For the remaining factors the variables augmented by the hat symbol ˆ correspond to the original variables. Conseuently, we obtain the following maximization, euivalent to E. (2): (x,s),n x P x ɛc i H( (x,s),i ) + i G x,nx α G x,nx,ŝ α Dual Energy 4.84 x 06 4.83 4.83 4.82 4.82 4.8 ɛĉ α H( (x,s),α ) + α G x,nx 0 20 0 00 Dual Energy 4.84 x 06 4.83 4.83 4.82 4.82 4.8 0 2 Time [s] i G x,nx,ŝ i (x,s),i (ŝ i)l (x,s),i (ŝ i )+ (x,s),α (ŝ α)ˆl (x,s),α (ŝ α ) C z v, (3) with marginalization constraints ŝ α\ŝ i (x,s),α (ŝ α) = (x,s),i (ŝ i) (x, s), n x, i, ŝ i, α N(i), consistency constraints (x,s),α (ŝ α) = b (x,s),α (ŝ α ) (x, s), n x, α N P(x,s) (n x), ŝ α and variable z r = (x,s),n x,i,ŝ i (x,s),i (ŝ i)φ r,i (x, ŝ i ) + (x,s),s,α,ŝ α (x,s),α (ŝ α) ˆφ r,α (x, ŝ α ) r = {,..., F }. We would like to utilize the structure of the graph to obtain memory efficient and fast algorithms. Since the structure is employed to ress the marginalization constraints, the dual program of E. (3), with its Lagrange multipliers λ (x,s),i α (ŝ i ) corresponding to the marginalization constraints and ν (x,s),nx α(ŝ α ) originating from the consistency constraints between different cluster nodes is our preferred task. The dual program to E. (3) is given by the following claim. Claim. Set ν (x,s),nx α = 0 for every α G Px and enforce n x N P(x,s) (α) ν (x,s),n x α(ŝ α ) = 0 (x, s), α, ŝ α. With ˆφ (x,s),i (ŝ i ) = l (x,s),i (ŝ i ) + r:i V r,x,nx w r φ r,i (x, ŝ i ) and ˆφ (x,s),α (ŝ α ) = ˆl (x,s),α (ŝ α )+ r:α E r,x,nx w r ˆφr,α (x, ŝ α ) the dual program of the approximated structured prediction dual in E. (3) reads as g = ɛc i ln ( ˆφ(x,s),i (ŝ i ) α N(i) λ ) (x,s),i α(ŝ i ) v w + C ɛc i p w p p + (x,s),n x,i G x,nx ŝ i ɛĉ α ln ( ˆφ(x,s),α (ŝ α ) + i N(α) s λ ) (x,s),i α(ŝ i ) + ν (x,s),nx α(ŝ α ). (4) ɛĉ α ŝ α (x,s),n x,α G x,nx Proof: Follows [, ]. Looking at the distributed approximated primal given in E. (4) more closely, we note that both terms involving the two types of Lagrange multipliers are now preceded by sums ranging over the samples as well as the compute nodes n x. 0 20 0 00 3

To derive an efficient algorithm we perform block-coordinate descent on this approximated primal. Fixing the consistency messages ν (x,s),nx α(ŝ α ), the optimal λ (x,s),i α (ŝ i ) is computed i G x,nx without considering current information from other cluster nodes. A status update in form of consistency messages ν (x,s),nx α(ŝ α ) is analytically computed by synchronizing messages between the different machines. The Armijo-Iterations performed to optimize w r reuire computation of the beliefs as well as the primal cost function value, which is done on the distributed nodes before another synchronization. The resulting block-coordinate descent and gradient steps are given by the following claim. Claim 2. With µ (x,s),α i (ŝ i ) = ɛĉ α ln ŝ α\ŝ i (( ˆφ (x,s),α (ŝ α ) + j N(α) s\i λ (x,s),j α(ŝ j ) + ν (x,s),nx α(ŝ α ))/(ɛĉ α )) the gradient steps in λ, ν and the gradient in w r are: λ (x,s),i α (ŝ i ) ĉ α c i + ˆφ(x,s),i (ŝ i ) + α N(i) ĉα µ (x,s),β i (ŝ i ) µ (x,s),α i (ŝ i ), ν (x,s),nx α(ŝ α ) g w r = N P(x,s) (α) (x,s),n x,i,ŝ i Proof: Follows [, ]. i N(α) λ (x,s),i α (ŝ i ) (x,s),i (ŝ i)φ r,i (x, ŝ i ) + (x,s),n x,α,ŝ α β N(i) i N(α) s λ (x,s),i α (ŝ i ), (x,s),α ˆφ r,α (x, ŝ α ) v r + C w r p sgn(w r ). Since the order of the block-coordinate descent steps does not impact convergence guarantees, we iteratively update the λ messages within a cluster node and the model parameters w r, before exchanging information between machines in form of consistency messages. Note, that updating model parameters reuires cluster nodes to only exchange numbers, while the size of the consistency messages depends on the size of the shared factors being commonly larger than a single real value. 4 Related Work and Discussion Data parallel frameworks, like MapReduce, simplify implementation of large-scale data processing but do not naturally support development of efficient learning algorithms. One of the most notable publicly available engines working towards efficient distributed algorithms is GraphLAB which, originally supporting only shared-memory environments [3], was recently extended to distributed environments [4]. However, minimization of communication overhead between cluster nodes is not considered, which potentially reduces computational performance. Our recent work on a parallel inference tasks that licitly minimizes the communication overhead was presented in []. Fig. (c) and Fig. (d) from [] show the convergence of an inference task w.r.t. iterations and time when communicating between machines every,,0,...,00 iterations. Although convergence in terms of iterations is best when transmitting information freuently, communication overhead reduces wall-clock performance when exchanging variables often. The drop in performance depends on the graph connectivity and cliue size (e.g., a common pairwise 4-connected grid in our case) and the cluster infrastructure (LAN or InfiniBand connection). Since learning involves inference a similar time dependence is ected. Conclusion We have presented a distributed structured prediction algorithm that is able to process models that exceed the resource restrictions of a single cluster node. Our approach divides computation and memory reuirements onto multiple machines while convergence and optimality guarantees are preserved by introducing a new type of consistency message. Our algorithm benefits particularly from the availability of multiple cluster nodes but it is also useful on a single machine since we derive licit rules for swapping parts of the model between memory and hard disk. Extensions towards latent variable models [6] and towards automatically finding an effective partitioning of graphical models are subject to future research. 4

References [] T. Hazan and R. Urtasun. A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction. In Proc. NIPS, 200. [2] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Seuence Data. In Proc. ICML, 200. [3] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A New Parallel Framework for Machine Learning. In Proc. UAI, 200. [4] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning in the Cloud. In Proc. Very Large Data Bases, 202. [] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message-Passing for Large-Scale Graphical Models. In Proc. CVPR, 20. [6] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Efficient Structured Prediction with Latent Variables for General Graphical Models. In Proc. ICML, 202. [7] B. Taskar, C. Guestrin, and D. Koller. Max-Margin Markov Networks. In Proc. NIPS, 2003. [8] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. In Proc. ICML, 2004.