Distributed Structured Prediction for Big Data A. G. Schwing ETH Zurich aschwing@inf.ethz.ch T. Hazan TTI Chicago M. Pollefeys ETH Zurich R. Urtasun TTI Chicago Abstract The biggest limitations of learning structured predictors from big data are the computation time and the memory demands. In this paper, we propose to handle those big data problems efficiently by distributing and parallelizing the resource reuirements. We present a distributed structured prediction learning algorithm for large scale models that cannot be effectively handled by a single cluster node. Importantly, convergence and optimality guarantees of recently developed algorithms are preserved while keeping between node communication low. Introduction In the past few years, structured models have become an important tool in domains such as natural language processing, computer vision and computational biology. The growing variability within data sets, reuires an increasing ressiveness that is achieved by modeling the influence of more and more variables. Hence memory and computational limits of desktop computers are reached uickly. In computer vision, for example, uncompressed full HD video streams produce 0 megabytes of data per second. Several structured prediction frameworks have been developed in the past. Notable examples are Conditional Random Fields (CRFs) [2], structured support vector machines (SSVMs) [7, 8] and their generalizations []. All three frameworks aim at minimizing a regularized surrogate loss. While CRFs and SSVMs are the method of choice for tree-structured or sub-modular models, approximations, e.g., [] are in general reuired. Note that all three approaches are inherently parallel in the training data. But none of the aforementioned frameworks address the underlying memory limitations of large scale models arising from real-world problems. This is important since nowadays big data tasks of increasing volume, variety and velocity call for large models. Hence we are interested in making structured prediction algorithms practical for large scale scenarios. We present an algorithm which distributes and parallelizes the computation and memory reuirements while reducing communication between cluster nodes and conserving convergence and optimality guarantees. Our approach is based on the principle of dual decomposition, i.e., computation is done in parallel by partitioning the model and imposing agreement on independent variables that are reuired to be consistent. Thus, we split the graph-based optimization program into several local optimization problems solved in parallel, and cluster nodes exchange information occasionally to enforce consistency. 2 A Review on Structured Prediction Let us first consider a setting where X denotes the input space (e.g., a video or a document) and S is a structured label space (e.g., a video segmentation or a set of parse trees). Further, let φ : X S R F denote a mapping from the input and label space to an F -dimensional feature space. When using structured prediction approaches, we are commonly interested in finding the parameters w R F of a log-linear model p w (s x) ( w φ(x, s)/ɛ ) with covariance ɛ, which best describes the possible labeling s S of x X. For training, we are given a data set D = {(x i, s i ) N i= } containing N pairs, each composed by an input space object x X and a label space object s S. In order to find the model parameters w
that best describe the annotations, we are often able to construct a task loss l (x,s) (ŝ) which measures the fitness of any labeling ŝ S. The vector v = (x,s) D φ(x, s) denotes the empirical mean and we commonly assume independent and identically distributed data in addition to a prior p(w) ( w p p). During learning we minimize the negative loss-augmented data-log-posterior, i.e., min ɛ ln ( ) l(x,s) (ŝ) + w φ(x, ŝ) v w + C w ɛ p w p p. () (x,s) D ŝ S Note that the covariance ɛ = recovers the CRF objective [2] while ɛ 0 smoothly approximates the max-function, hence recovering the SSVM formulation [7, 8]. Due to the sum over all label space configurations ŝ S being generally onential in size, the unconstrained minimization problem given in E. () is NP-hard in general. Elements φ r of the feature vector φ often describe interactions between subsets of random variables, i.e., φ r (x, s) = i V r,x φ r,i (x, s i ) + α E r,x φ r,α (x, s α ). Note that a labeling s = (s i ) i V S is a tuple subsuming V variables, each having S i discrete states. The sparse interactions induced by the feature functions φ r (x, s) are visually depicted by a factor graph G r,x with the individual variables i V r,x of sample (x, s) being vertices that are connected to factors α E r,x iff vertex i is a neighbor of factor α E r,x. The union graph G x = r G r,x describes the relationship over all features r and we say that vertex i V x = r V r,x is a neighbor to factor α E x = r E r,x if variable s i is part of the variable set s α in any of the features of sample (x, s), i.e., i N(α). Conversely, all factors that variable i participates in are referred to by α N(i). Approximations [] are one way to deal with the previously outlined intractability. The dual to the program given in E. () is described by means of joint distributions ranging, for each data sample (x, s), over the label space S. We describe this probability by its variable and factor marginals b (x,s),i (s i ) and b (x,s),α (s α ) and approximate the entropies of those joint distributions by its marginal entropies H(b (x,s),i ) and H(b (x,s),α ) using chosen counting numbers c i and c α for better approximation accuracy. To ensure consistency, we reuire the beliefs to fulfill marginalization constraints corresponding to the structure of the graph G x while maximizing the approximated dual cost function ɛc i H(b (x,s),i )+ ɛc α H(b (x,s),α )+ b (x,s),i (ŝ i )l (x,s),i (ŝ i )+ b (x,s),α (ŝ α )l (x,s),α (ŝ α ) (x,s) i α i,ŝ i α,ŝ α C b (x,s),i (ŝ i )φ r,i (x, ŝ i ) + b (x,s),α (ŝ α )φ r,α (x, ŝ α ) v r, (2) r (x,s),i V r,x,ŝ i (x,s),α E r,x,ŝ α with /p + / =. The sum ranging over the training samples being the first term in both the original primal (E. ()) and the approximated dual (E. (2)) suggests that computation of the gradient is inherently parallel in the data set elements. With real-world models G x often being too large for the resources provided by a single cluster node we next discuss a possibility to partition the optimization task while preserving the original convergence properties. 3 Distributed Structured Prediction To cope with current model size needs we are interested in an algorithm to maximize E. (2) while leveraging the sparsity given by the graph structure G x. In addition, we partition the vertices of the model such that each of the distributed cluster nodes solves an independent program defined on a subgraph induced by the variables of each partition (Fig. (a)). To ensure consistency for the global model, the distributed solutions are combined by exchanging information between connected subgraphs. The distributed structured prediction algorithm extends existing frameworks by introducing a high-level factor graph (Fig. (b)) describing the cluster node interactions. Occasional exchange of information corresponds to messages being sent on this factor graph. It is important to note that we do not reuire an exchange of information at every iteration. More concretely, let P x be a partition of all the vertices i V x for sample (x, s) into disjunct subsets n x P x each containing the variables i n x that are assigned to the cluster node n x. The vertices assigned to node n x P x induce a subgraph G x,nx. As before, this subgraph describes the 2
(x, s ) (x 2, s 2 ) 200 400 600 800 000 Iterations (a) (b) (c) (d) Figure : (a): 2 samples each distributed on 2 cluster nodes (color). (b): The cluster node factor graph for consistency messages. (c),(d): Convergence of the inference task w.r.t. iterations and time. marginalization constraints reuired to be enforced on cluster node n x for its assigned variable beliefs (x,s),i (ŝ i) (x, s), i n x, ŝ i and the factor beliefs (x,s),α (ŝ α) (x, s), i n x, α N(i), ŝ α, i.e., ŝ α\ŝ i (x,s),α (ŝ α) = (x,s),i (ŝ i). A factor α that is assigned to multiple subgraphs G x,nx, corresponds to a set of beliefs (x,s),α each of them optimized independently on the cluster nodes n x N Px (α). Since these distributed beliefs originate from a single b (x,s),α in E. (2) we are reuired to ensure consistency. Formally, we construct a factor graph G Px with cluster nodes n x being the vertices that are connected to shared factors α iff n x N Px (α). Conversely, we denote by N Px (n x ) all factors α that are shared between multiple nodes, one of them being n x. To keep the shared beliefs consistent, we add the constraints (x,s),α (ŝ α) = b (x,s),α (ŝ α ) (x, s), α, n x N Px (α), ŝ α. To ensure optimization of the cost function given in E. (2), we further need to balance the entropy H(b (x,s),α ), the loss l (x,s),α and the features φ r,α for those factors α, distributed onto different cluster nodes. To this end, we let ĉ α = c α / N Px (α), ˆl (x,s),α = l (x,s),α / N Px (α) and ˆφ (x,s),α = φ (x,s),α / N Px (α) for all shared factors. For the remaining factors the variables augmented by the hat symbol ˆ correspond to the original variables. Conseuently, we obtain the following maximization, euivalent to E. (2): (x,s),n x P x ɛc i H( (x,s),i ) + i G x,nx α G x,nx,ŝ α Dual Energy 4.84 x 06 4.83 4.83 4.82 4.82 4.8 ɛĉ α H( (x,s),α ) + α G x,nx 0 20 0 00 Dual Energy 4.84 x 06 4.83 4.83 4.82 4.82 4.8 0 2 Time [s] i G x,nx,ŝ i (x,s),i (ŝ i)l (x,s),i (ŝ i )+ (x,s),α (ŝ α)ˆl (x,s),α (ŝ α ) C z v, (3) with marginalization constraints ŝ α\ŝ i (x,s),α (ŝ α) = (x,s),i (ŝ i) (x, s), n x, i, ŝ i, α N(i), consistency constraints (x,s),α (ŝ α) = b (x,s),α (ŝ α ) (x, s), n x, α N P(x,s) (n x), ŝ α and variable z r = (x,s),n x,i,ŝ i (x,s),i (ŝ i)φ r,i (x, ŝ i ) + (x,s),s,α,ŝ α (x,s),α (ŝ α) ˆφ r,α (x, ŝ α ) r = {,..., F }. We would like to utilize the structure of the graph to obtain memory efficient and fast algorithms. Since the structure is employed to ress the marginalization constraints, the dual program of E. (3), with its Lagrange multipliers λ (x,s),i α (ŝ i ) corresponding to the marginalization constraints and ν (x,s),nx α(ŝ α ) originating from the consistency constraints between different cluster nodes is our preferred task. The dual program to E. (3) is given by the following claim. Claim. Set ν (x,s),nx α = 0 for every α G Px and enforce n x N P(x,s) (α) ν (x,s),n x α(ŝ α ) = 0 (x, s), α, ŝ α. With ˆφ (x,s),i (ŝ i ) = l (x,s),i (ŝ i ) + r:i V r,x,nx w r φ r,i (x, ŝ i ) and ˆφ (x,s),α (ŝ α ) = ˆl (x,s),α (ŝ α )+ r:α E r,x,nx w r ˆφr,α (x, ŝ α ) the dual program of the approximated structured prediction dual in E. (3) reads as g = ɛc i ln ( ˆφ(x,s),i (ŝ i ) α N(i) λ ) (x,s),i α(ŝ i ) v w + C ɛc i p w p p + (x,s),n x,i G x,nx ŝ i ɛĉ α ln ( ˆφ(x,s),α (ŝ α ) + i N(α) s λ ) (x,s),i α(ŝ i ) + ν (x,s),nx α(ŝ α ). (4) ɛĉ α ŝ α (x,s),n x,α G x,nx Proof: Follows [, ]. Looking at the distributed approximated primal given in E. (4) more closely, we note that both terms involving the two types of Lagrange multipliers are now preceded by sums ranging over the samples as well as the compute nodes n x. 0 20 0 00 3
To derive an efficient algorithm we perform block-coordinate descent on this approximated primal. Fixing the consistency messages ν (x,s),nx α(ŝ α ), the optimal λ (x,s),i α (ŝ i ) is computed i G x,nx without considering current information from other cluster nodes. A status update in form of consistency messages ν (x,s),nx α(ŝ α ) is analytically computed by synchronizing messages between the different machines. The Armijo-Iterations performed to optimize w r reuire computation of the beliefs as well as the primal cost function value, which is done on the distributed nodes before another synchronization. The resulting block-coordinate descent and gradient steps are given by the following claim. Claim 2. With µ (x,s),α i (ŝ i ) = ɛĉ α ln ŝ α\ŝ i (( ˆφ (x,s),α (ŝ α ) + j N(α) s\i λ (x,s),j α(ŝ j ) + ν (x,s),nx α(ŝ α ))/(ɛĉ α )) the gradient steps in λ, ν and the gradient in w r are: λ (x,s),i α (ŝ i ) ĉ α c i + ˆφ(x,s),i (ŝ i ) + α N(i) ĉα µ (x,s),β i (ŝ i ) µ (x,s),α i (ŝ i ), ν (x,s),nx α(ŝ α ) g w r = N P(x,s) (α) (x,s),n x,i,ŝ i Proof: Follows [, ]. i N(α) λ (x,s),i α (ŝ i ) (x,s),i (ŝ i)φ r,i (x, ŝ i ) + (x,s),n x,α,ŝ α β N(i) i N(α) s λ (x,s),i α (ŝ i ), (x,s),α ˆφ r,α (x, ŝ α ) v r + C w r p sgn(w r ). Since the order of the block-coordinate descent steps does not impact convergence guarantees, we iteratively update the λ messages within a cluster node and the model parameters w r, before exchanging information between machines in form of consistency messages. Note, that updating model parameters reuires cluster nodes to only exchange numbers, while the size of the consistency messages depends on the size of the shared factors being commonly larger than a single real value. 4 Related Work and Discussion Data parallel frameworks, like MapReduce, simplify implementation of large-scale data processing but do not naturally support development of efficient learning algorithms. One of the most notable publicly available engines working towards efficient distributed algorithms is GraphLAB which, originally supporting only shared-memory environments [3], was recently extended to distributed environments [4]. However, minimization of communication overhead between cluster nodes is not considered, which potentially reduces computational performance. Our recent work on a parallel inference tasks that licitly minimizes the communication overhead was presented in []. Fig. (c) and Fig. (d) from [] show the convergence of an inference task w.r.t. iterations and time when communicating between machines every,,0,...,00 iterations. Although convergence in terms of iterations is best when transmitting information freuently, communication overhead reduces wall-clock performance when exchanging variables often. The drop in performance depends on the graph connectivity and cliue size (e.g., a common pairwise 4-connected grid in our case) and the cluster infrastructure (LAN or InfiniBand connection). Since learning involves inference a similar time dependence is ected. Conclusion We have presented a distributed structured prediction algorithm that is able to process models that exceed the resource restrictions of a single cluster node. Our approach divides computation and memory reuirements onto multiple machines while convergence and optimality guarantees are preserved by introducing a new type of consistency message. Our algorithm benefits particularly from the availability of multiple cluster nodes but it is also useful on a single machine since we derive licit rules for swapping parts of the model between memory and hard disk. Extensions towards latent variable models [6] and towards automatically finding an effective partitioning of graphical models are subject to future research. 4
References [] T. Hazan and R. Urtasun. A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction. In Proc. NIPS, 200. [2] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Seuence Data. In Proc. ICML, 200. [3] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A New Parallel Framework for Machine Learning. In Proc. UAI, 200. [4] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning in the Cloud. In Proc. Very Large Data Bases, 202. [] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message-Passing for Large-Scale Graphical Models. In Proc. CVPR, 20. [6] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Efficient Structured Prediction with Latent Variables for General Graphical Models. In Proc. ICML, 202. [7] B. Taskar, C. Guestrin, and D. Koller. Max-Margin Markov Networks. In Proc. NIPS, 2003. [8] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent and Structured Output Spaces. In Proc. ICML, 2004.