Dynamic Mapping and Load Balancing on Scalable Interconnection Networks Alan Heirich, California Institute of Technology Center for Advanced Computing Research The problems of mapping and load balancing arbitrary programs and data structures on networks of computers are considered. A novel diusion algorithm is presented to solve these problems. It complements the diusion algorithms for load balancing which have enjoyed success on massively parallel processors. Like these algorithms it is innitely scalable and adapts in real time to the requirements of dynamic problems. This is in contrast to existing mapping strategies which are either not dynamic, not scalable, and/or inapplicable to irregular networks. Many recursive bisection algorithms have been proposed for mapping and load balancing on regular and irregular networks. Recursive spectral bisection (RSB) has been very popular because it has a solid theoretical foundation. RSB solves a series of constrained one dimensional quadratic minimization subproblems in order to obtain a locally optimal solution to a multidimensional problem. The constraint prevents the algorithm from converging on a trivial solution. Incorporating the constraint into each subproblem leads to a series of eigenvalue problems which are typically solved by (expensive) Lanczos iterations. The diusion algorithm presented here solves the same quadratic minimization problem as RSB but solves it in multiple dimensions in a single step. As in RSB it is necessary to incorporate constraints in order to avoid the trivial solution. These constraints are incorporated directly from the problem instance and have a natural representation in the problem semantics. The resulting algorithm is scalable, dynamic, and applicable to any interconnection topology which can be contiguously embedded in R n.
DIFFUSING COMPUTATIONS Diusion has been a metaphor for concurrent computation from the earliest days of parallel computing. Early papers introduced the metaphor, considered the problem of termination detection, and discussed mapping and process management. Computation: a form of energy which diuses through a (parallel) computer system. Challenges: manage this (dynamic) energy for peak eciency. { Mapping: to minimize communication. { Load balancing: to minimize idleness. Detect termination. Dijkstra & Scholten, Termination detection for diusing computations (1980). Martin, A distributed implementation method for parallel programming (1980). Chandy & Misra, Termination detection of diusing computations in communicating sequential processes (1982).
DIFFUSION ALGORITHMS With the advent of readily accessible parallel computers several practical diusion algorithms were proposed. The algorithms are based on computing an equilibrium of a dynamical system. \Solve" a Laplace system r 2 x = 0 by iteration. Converge to a xed point (equilibrium) from any initial condition. Cybenko (1989) showed correctness for the load balancing problem. Heirich (1994), Heirich & Taylor (1995) showed innite scalability. Heirich (1996) solves the mapping problem. Cybenko, Dynamic load balancing for distributed memory multiprocessors (1989). Heirich, Scalable load balancing by diusion (1994). Heirich & Taylor, A parabolic load balancing method (1995). Heirich, Dynamic mapping and load balancing on scalable interconnection networks (1996).
BIBLIOGRAPHY: DIFFUSION IN PARALLEL COMPUTING At least a dozen closely related papers have been published since 1988. All of these apply diusion to the load balancing problem. 1. Baden, SIAM J Sci Stat Comp, 12:1 (1991), 145-157. 2. Boillat, Concurrency, 2 (1990), 289-313. 3. Boillat, Bruge & Kropf, J Comp Phys, 96:1 (1991) 1-14. 4. Bruge & Fornili, Comp Phys Comm, 60 (1990), 39-45. 5. Chandy & Misra, ACM TOPLAS, 4:1 (1982). 6. Conley, Argonne Natl Lab Tech Rep ANL-93/40 (1993). 7. Cybenko, J Par Dist Comp, 7 (1989), 279-301. 8. Dijkstra & Scholten, Inf Proc Lett, 11:1 (1980), 1-4. 9. Heirich, Caltech Comp Sci Dept Tech Rep CS-TR-94-04 (1994). 10. Heirich & Taylor, Proc 24th Intl Conf Par Proc, (1995) v. III, 192-202. 11. Hong, Tan & Chen, Proc ACM Sigmetric Conf, (1988), 73-82. 12. Horton, Par Comp, 19 (1993), 209-218. 13. Hosseini, Litow, Malwaki, Nadella & Vairavan, J Par Dist Comp, 10, (1990), 160-166. 14. Martin, Inf Proc, 80 (1980), 309-314. 15. Muniz & Zaluska, Par Comp, 21 (1995), 287-301. 16. Willabeek-LeMair & Reeves, IEEE Tr Par Dist Sys, 4 (1993), 979-993. 17. Xu & Lau, J Par Dist Comp, 16 (1992), 385-393. 18. Xu & Lau, J Par Dist Comp, 24 (1995), 72-85.
PROPERTIES OF GSL ITERATIONS A Gauss-Seidel iteration on a Laplace equation is a general algorithmic paradigm for equilibrium computations. From a practical standpoint it has properties which make it robust and scalable in distributed systems. From a theoretical standpoint it ts a general framework of equilibrium computations in dynamical systems. Gauss-Seidel iteration on Laplace equation r 2 x = 0. Discrete Laplacian matrix Q, split into upper (U), lower (L), diagonal (D). Discrete iteration ~x ;(D + L) ;1 U~x. Properties: concurrent, asynchronous, fault tolerant, scalable, fast. Scalability and convergence: { Model problem (degree 4 lattice) has known eigenvalues. { Time dependent amplitude of point disturbance is linear superposition of modal amplitudes. { Convergence is exponential with respect to initial amplitude (i.e. logarithmic in time). { An arbitrary disturbance can be modeled as a composition of point disturbances.
SCALING FOR POINT DISTURBANCES The scalability of GSL iterations comes from the near scale invariance of the Laplacian spectrum. The gures below show the eigenvalues of two GSL iterations on a model problem. Both model problems are a degree-4 lattice. On the left are three spectral components from a 64 64 lattice, and on the right the corresponding components from a 1024 1024 lattice. 1 0.8 "ev0_64" "ev16_64" "ev31_64" 1 0.8 "ev0_1024" "ev256_1024" "ev511_1024" 0.6 0.6 0.4 0.4 0.2 0.2 0 0-0.2-0.2-0.4-0.4-0.6-0.6-0.8-0.8-1 0 10 20 30 40 50 60 70-1 0 200 400 600 800 1000 1200 The eigenvalues of a GSL iteration on an n n model problem are i j = 1 4 0 B@cos i n + cos j n 1 CA 2
The time dependent amplitude of a point disturbance after iterates is x () 0 0 = X i j = X i j c ( i j ) 1 4 n 0 B@cos i n + cos j n Convergence to equilibrium occurs exponentially, i.e. time is logarithmic with respect to the height of the disturbance. Below left, height of the point disturbance during 32 successive GSL iterates. Below right, number of iterates required to decrease the disturbance by 90% for increasing problem sizes. This number is constant above a certain size. 1 CA 2 32 5 28 24 4 20 3 16 12 2 8 1 4 0 0 8 16 24 32 0 32 64 96 128 160 192 224 256 Young, Iterative solution of large linear systems (1971).
GRAPH LAYOUT BY QUADRATIC MINIMIZATION Many NP-hard and NP-complete problems can be characterized as graph layout problems: given a (capacitated) graph, nd an arrangement of vertices which minimizes the sum of a metric over pairs of connected points. Minimize (nontrivially) aggregate distance z among a (weighted) set of connected points. One dimension: z P i j (x i ; x j ) 2 c i j. { Matrix bandwidth reduction. { Mapping on a ring. Two dimensions: z P i j (xi ; x j ) 2 + (y i ; y j ) 2 c i j. { Mapping on a mesh, heirarchical, or irregular network. { Detecting implicit control structures in compiled programs. { VLSI placement. { Graph isomorphism, many others. Kung & Stevenson, A software technique for reducing the routing time on a parallel computer with a xed interconnection network (1977). Bokhari, On the mapping problem (1981). Read & Corneil, The graph isomorphism disease (1977).
MAPPING PROGRAMS BY GRAPH EMBEDDING The mapping problem can be described as a graph layout problem in which a graph of communicating processes is embedded into a graph of connected computers. \Guest" graph G: vertices are processes, edges are communication channels. \Host" graph H: vertices are computers, edges are network links. Graph embedding problem (Rosenberg): map vertices of G onto vertices of H to minimize dilation (average distance) and equalize density. Mapping problem (Martin): map processes onto computers to minimize aggregate communication and equalize workload. Rosenberg, Issues in the study of graph embeddings (1981). Martin, A distributed implementation method for parallel programming (1980).
COORDINATE BISECTION ALGORITHMS In specialized problems, such as solving partial dierential equations, a mapping is implicit in the problem instance. The coordinates of a PDE grid provide a solution which can be found by repeated coordinate bisection. Repeatedly bisect along alternating axes using intrinsic problem coordinates. Example: partitioning an unstructured PDE grid. Advantages: scalable O(n log n=p), probably the best method for PDE problems. Problems: restricted to special cases in which vertices of G possess intrinsic coordinates. Inapplicable to irregular networks or dynamic problems. No provision for capacitated problems (weighted edges or vertices). Williams, Performance of dynamic load balancing algorithms for unstructured mesh calculations (1991).
GRAPH CONTRACTION ALGORITHMS Any problem instance can be solved by repeatedly coalescing vertices of a graph until it equals a standard topology. Ecient, scalable. Advantages: nds optimal solutions for standard host topologies. Problems: suboptimal solutions for irregular networks, inapplicable to dynamic problems, no provision for capacitated problems. Ben-Natan & Barak, Parallel contraction of grids for task assignment to processor networks (1992). Karabeg, Process partitioning through graph compaction (1995).
THE LAPLACIAN SPECTRUM Call C the matrix of edge weights of G. Then optimal bisections of G can be found by solving a one dimensional eigenvalue problem for C. Minimize z, z = 0:5 X i = 0:5 X i = 0:5( X i = X i X (x i ; x j ) 2 c i j (x 2 ; 2x i i x j + x 2 )c j i j j x 2 c i i i ; X X X 2 x i x j c i j + i j j j X x 2 i c i i ; X j = ~x T (D ; C)~x = ~x T Q~x X i6=j x i x j c i j x 2 j c j j) To avoid triviality (~x = 0) append a constraint ~x T ~x = 1. Minimize the resulting Lagrangian L ~x T Q~x ; (~x T ~x ; 1) @L=@~x = 0 ) 2Q~x ; 2~x = 0 ) Q~x = ~x In two dimensions minimize ~x T Q~x + ~y T Q~y, and similarly in higher dimensions. Hall, An r-dimensional quadratic placement algorithm (1970).
RECURSIVE SPECTRAL BISECTION Over 100 papers have been written in recent years about the RSB algorithm most of which address its high cost. More recently it has been challenged on the grounds that equivalent quality solutions can be obtained by heuristics (Karypis). Recursively bisect using eigenvalues for optimal splitting at each step. Implementations by Lanczos algorithm can be expensive, non robust. Ultimate result can be suboptimal. Advantages: solid theoretical foundation allows edge weights in G. Problems: no edge weights in H, no vertex weights in G or H. suboptimal solutions, expensive algorithm, unscalable, inapplicable to irregular networks or dynamic problems. Barnard & Simon, A parallel implementation of multilevel recursive spectral bisection for application to adaptive unstructured meshes (1995). Karypis & Kumar, Multilevel graph partitioning schemes (1995).
A VISUAL METAPHOR In 1963 Tutte proposed an algorithm to draw a planar graph in the plane. The proposal was to solve a Laplace equation on the graph vertices while constraining the values of some vertices in order to avoid the trivial solution. This proposal can be extended to general graphs in R n if the algorithm is based on a GSL iteration. Equalizing edge lengths of a planar graph gives a (nonunique) optimal placement in the plane (Tutte). Metaphor extends to nonplanar graphs under a leastsquares minimization. Graph embedding can be accomplished by identifying local regions of the plane (or R n ) with vertices of H. Natural interpretation of capacities: { Process weights: areas of vertices in G. { Communication capacity: weight of edges in G. { Computer work capacities: areas of regions associated with vertices in H. Tutte, How to draw a graph (1963). Heirich, Dynamic mapping and load balancing on scalable interconnection networks (1996).
DIFFUSIVE MAPPING AND LOAD BALANCING A robust, scalable and dynamic scheme for mapping and load balancing can be constructed from GSL iterations. In the rst step solve the mapping problem by solving a Laplace equation on a graph of communicating processes or a connected data structure. In the second step solve the load balancing problem by solving a Laplace equation on the workloads of the computers. 1. Dene the mapping space R n. R n represents the network of connected computers. The dimension n is chosen so that R n can be partitioned according to the connectivity of the network. For example, R 2 provides a natural partitioning for a two dimensional mesh. 2. Partition R n according to the network connectivity. In R 2 this can be done by embedding the network in a square region of the plane, bisecting each network link, and using the bisectors to dene the regions associated with each computer. The area of each region represents the workload capacity of each computer. Equivalent procedures can be used for R n. 3. Obtain the guest graph. In many problems this is the form of the problem instance. For example, in a connected data structure, or a graph of communicating processes. In other problems it may be necessary to construct this graph from a data set, for example in ray tracing disconnected polygons. 4. Place a (small) set of \distinguished" vertices of the guest graph. This requires an oracle, or knowledge specic to the problem instance. For example, a small number of processes are typically associated with i/o. These nodes will constrain the algorithm and prevent it from nding the trivial solution. The placement of these nodes is critical.
5. Perform a GSL iteration on the vertices of the guest graph. The vertices which were placed in the previous step do not move. The work requirementofguest vertices can be represented by the areas they occupy in R n. This requires a trivial modication to the basic GSL iteration. The communication requirements of guest channels can be represented by nonuniform weights in Q. The GSL iteration will take these nonuniform weights into account in the desired way, by grouping more closely those vertices which are connected by higher weighted edges. 6. Perform a GSL iteration on the workloads of the computers. The rst GSL iteration will approach a xed point and then converge very slowly to the ultimate conguration. At this point the vertices of G will be properly ordered and further iterates are only rening the solution. Shortcut the nal iterates by directly uxing vertices of G across boundaries between the computers. A vertex of G is a candidate for uxing only if it is connected by an edge to a vertex on a dierent computer. The resulting algorithm is scalable, fault tolerant, delay insensitive, fast, and dynamic. It nds a locally minimal solution to the quadratic minimization problem just as RSB does. It incorporates naturally the capacities of computers, processes, and communication requirements. It does not address limited communication capacities of network links, nor network topologies which cannot be embedded in R n for some n. The constraints result from proper initial placement of guest vertices, and these placement are critical to the quality of the resulting solution. Application of these techniques are currently under way to the problem of Monte Carlo ray tracing on parallel computers.
APPLICATION TO PHOTOREALISTIC ANIMATION (Figure and model courtesy of Greg Ward, Lawrence Berkeley Laboratories) This model of an imaginary oce is comprised of 18,579 polygons. The polygons are assembled into a graph and dynamically mapped across processors of the 512 node IBM SP2 at the Cornell Theory Center. A Euclidian metric in R 3 is used to cluster the polygons in a 2 dimensional mapping space. Dynamic requirements in animation result from the movement of objects, including light sources.